News Research Predictive Risk Models Research and Evidence

Synthetic data boosts readmission prediction

April 20, 2026 By Matthew Solan 4 min read
Share Share via Email Share on Facebook Share on LinkedIn Share on Twitter

Models trained with synthetic data more accurately predicted 30-day hospital readmissions among patients with type 2 diabetes mellitus, chronic obstructive pulmonary disease, and heart failure, compared with models using original data alone. 

In a retrospective cohort study published in BMJ Open, researchers developed an explainable machine learning (ML) framework incorporating structured electronic health record data and information extracted from clinical notes.  

Researchers used the Medical Information Mart for Intensive Care IV database and included patients with type 2 diabetes mellitus (n = 12,735), chronic obstructive pulmonary disease (COPD) (n = 14,050), and heart failure (n = 7,097), who were admitted between 2008 and 2019. Patients who died during the index hospitalization were excluded, and all included patients had complete 30-day follow-up. 

The primary outcome was unplanned all-cause readmission within 30 days of discharge. Predictors spanned six domains: demographics and socioeconomic status; comorbidities; clinical history and acute illness severity; disease-specific markers and therapies; behavioral and social factors; and care continuity indicators. Natural language processing was used to extract variables such as medication nonadherence and social support from unstructured notes. 

Synthetic data were generated using three approaches—synthetic minority over-sampling technique, conditional tabular generative adversarial network, and tabular variational autoencoder—with augmentation applied within cross-validation folds to avoid information leakage. 

The study evaluated four supervised machine learning models: logistic regression, random forest, gradient boosting, and extreme gradient boosting. Model performance was assessed using stratified five-fold cross-validation, with metrics including accuracy, area under the curve (AUROC), precision, recall, and F1 score.  

Ensemble models outperformed logistic regression. Gradient boosting achieved the highest performance in COPD and type 2 diabetes mellitus, whereas extreme gradient boosting performed best in heart failure based on F1 score. 

In cross-validation, gradient boosting achieved an F1 score of 0.84 in COPD, with accuracy of 0.89, precision of 0.89, and recall of 0.79. In type 2 diabetes mellitus, it achieved an F1 score of 0.83, accuracy of 0.89, precision of 0.87, and recall of 0.80. In heart failure, extreme gradient boosting achieved an F1 score of 0.80, with accuracy of 0.87, precision of 0.86, and recall of 0.76.  

Discrimination was strong across cohorts, with AUROC values ranging from 0.91 to 0.95. 

On held-out test sets, gradient boosting achieved an F1 score of 0.92 and recall of 0.96 in COPD, with an AUROC of 0.93. In heart failure, both gradient boosting and extreme gradient boosting achieved F1 scores of 0.91 and AUROCs of 0.93. In type 2 diabetes mellitus, gradient boosting achieved an F1 score of 0.91 and an AUROC of 0.94. The models demonstrated high precision (0.88 to 0.91) and recall (0.90 to 0.96) across cohorts. 

Across cohorts, higher illness severity scores and greater comorbidity burden were key predictors of readmission. Medication nonadherence increased the odds of readmission by 1.51 times in COPD, 1.37 times in heart failure, and 1.29 times in type 2 diabetes mellitus. Social and behavioral factors, including limited social support and gaps in preventive care, were also influential. Immunization status, outpatient follow-up, and insurance type contributed to risk stratification.  

Sensitivity analyses showed improved model performance when patients with fewer comorbidities were analyzed separately. In type 2 diabetes mellitus, excluding comorbid conditions increased F1 score from 0.87 to 0.92 and recall from 0.80 to 0.88. 

Synthetic data augmentation, particularly with tabular variational autoencoder, improved recall and overall model stability compared with other techniques. 

The researchers noted several limitations, including the use of a single-center dataset without external validation, potential underdocumentation of social determinants of health, and lack of evaluation of real-world implementation or cost-effectiveness. 

“Incorporating social and behavioral risk factors coupled with clinical severity markers supports the adoption of multidimensional risk prediction frameworks to capture real-world patient complexity and identify actionable drivers of readmission,” the researchers wrote. 

The study was supported by the American College of Clinical Pharmacy Foundation. The researchers reported having no relevant conflicts.

 

AACE Endocrine AI is published by Conexiant under a license arrangement with the American Association of Clinical Endocrinology, Inc. (AACE®). The ideas and opinions expressed in AACE Endocrine AI do not necessarily reflect those of Conexiant or AACE. For more information, see Policies.

Related Content