Modeling Diabetes Risk and Progression With Public Health Data: Ontology-Guided, Simulation-Capable Digital Twin Study.
Digital twins (DTs) offer a paradigm for health care by enabling data-driven, simulation-capable representations of individual health trajectories. However, DT development remains limited by the scarcity of standardized, temporally structured, and multidomain data suitable for modeling chronic disease progression. Most existing DT studies rely on narrowly scoped or proprietary datasets, restricting generalizability. Public health datasets, such as the Midlife in the United States study, provide rich biopsychosocial information but are underused due to structural complexity and lack of semantic integration frameworks.
This study aimed to develop and evaluate an ontology-guided, agent-orchestrated framework for constructing offline, simulation-capable, and progression-aware DTs from public health datasets. Using diabetes as a case study, the framework integrates agent-based orchestration, medical ontologies, and large language model (LLM)-assisted semantic reasoning with machine learning to support explainable feature structuring, risk prediction, and predictive "what-if" progression analysis.
A 6-stage DT framework was developed and applied to Midlife in the United States wave 2 (baseline) and wave 3 (follow-up) data. Ontology- and LLM-assisted feature selection identified predictors across biological, behavioral, psychosocial, and socioeconomic domains. Cleaned and harmonized data were used to train predictive models (random forest, eXtreme gradient boosting, and logistic regression) to estimate diabetes onset at follow-up. A state-transition simulator was implemented to model between-wave progression dynamics, quantify transitions across low-, medium-, and high-risk states, and evaluate predictive "what-if" scenarios such as weight reduction and lifestyle improvement. Model performance was assessed using accuracy, F1 score, area under the receiver operating characteristic curve (AUC), and calibration metrics.
From 9976 candidate variables, ontology- and LLM-guided selection retained the top 200 relevant predictors spanning biological, behavioral, psychosocial, and socioeconomic domains. Predictive modeling achieved strong discrimination, with random forest (AUC=0.82, accuracy=0.76) and eXtreme gradient boosting (AUC=0.81, accuracy=0.75) outperforming logistic regression (AUC=0.78). The state-transition simulator reproduced realistic progression patterns: 33.9% (1414/4174) of participants changed risk states between waves, and the high-risk group increased from 10.8% (451/4174) to 32.2% (1344/4174). Next-state prediction accuracy reached 92.5%. Predictive "what-if" analyses showed that with a simulated 10% weight reduction, model-estimated diabetes cases decreased by 98 (from 576 to 478). A placebo test (0% weight change) produced less than 0.3% difference in risk distribution, confirming model stability.
This study presents a foundational, ontology-guided, and agent-orchestrated framework for constructing offline, simulation-capable, and progression-aware DTs from public datasets. By combining semantic reasoning, multidomain predictors, and predictive "what-if" progression simulation, the framework transforms static population data into longitudinal, interpretable representations of individual health trajectories. The proof-of-concept application to diabetes demonstrates that public health data can support robust and explainable DT models for exploratory risk analysis and hypothesis generation, without implying causal intervention effects or direct clinical decision support.
This study aimed to develop and evaluate an ontology-guided, agent-orchestrated framework for constructing offline, simulation-capable, and progression-aware DTs from public health datasets. Using diabetes as a case study, the framework integrates agent-based orchestration, medical ontologies, and large language model (LLM)-assisted semantic reasoning with machine learning to support explainable feature structuring, risk prediction, and predictive "what-if" progression analysis.
A 6-stage DT framework was developed and applied to Midlife in the United States wave 2 (baseline) and wave 3 (follow-up) data. Ontology- and LLM-assisted feature selection identified predictors across biological, behavioral, psychosocial, and socioeconomic domains. Cleaned and harmonized data were used to train predictive models (random forest, eXtreme gradient boosting, and logistic regression) to estimate diabetes onset at follow-up. A state-transition simulator was implemented to model between-wave progression dynamics, quantify transitions across low-, medium-, and high-risk states, and evaluate predictive "what-if" scenarios such as weight reduction and lifestyle improvement. Model performance was assessed using accuracy, F1 score, area under the receiver operating characteristic curve (AUC), and calibration metrics.
From 9976 candidate variables, ontology- and LLM-guided selection retained the top 200 relevant predictors spanning biological, behavioral, psychosocial, and socioeconomic domains. Predictive modeling achieved strong discrimination, with random forest (AUC=0.82, accuracy=0.76) and eXtreme gradient boosting (AUC=0.81, accuracy=0.75) outperforming logistic regression (AUC=0.78). The state-transition simulator reproduced realistic progression patterns: 33.9% (1414/4174) of participants changed risk states between waves, and the high-risk group increased from 10.8% (451/4174) to 32.2% (1344/4174). Next-state prediction accuracy reached 92.5%. Predictive "what-if" analyses showed that with a simulated 10% weight reduction, model-estimated diabetes cases decreased by 98 (from 576 to 478). A placebo test (0% weight change) produced less than 0.3% difference in risk distribution, confirming model stability.
This study presents a foundational, ontology-guided, and agent-orchestrated framework for constructing offline, simulation-capable, and progression-aware DTs from public datasets. By combining semantic reasoning, multidomain predictors, and predictive "what-if" progression simulation, the framework transforms static population data into longitudinal, interpretable representations of individual health trajectories. The proof-of-concept application to diabetes demonstrates that public health data can support robust and explainable DT models for exploratory risk analysis and hypothesis generation, without implying causal intervention effects or direct clinical decision support.