Predicting 3-year depressive symptoms among middle-aged and older adults in rural China using random forest: insights from the China health and retirement longitudinal study.
Under China’s dual economic structure of urban and rural areas, rural regions face issues such as low socioeconomic status, inadequate healthcare resources, and neglect of mental health, leading to a higher prevalence of depression among middle-aged and older adults (above 45 years) in this area.
This prospective cohort study used data from 6,183 rural Chinese middle-aged and older adults in the China Health and Retirement Longitudinal Study (CHARLS, 2018–2020). A random forest model was developed to predict 3-year incidents of depressive symptoms. Independent risk factors were identified via chi-square tests followed by binary logistic regression (Odds Ratios [ORs] and 95% Confidence Intervals [CIs] reported for significant variables, p < 0.05). The model’s performance and clinical utility were assessed using standard metrics and Decision Curve Analysis (DCA). SHapley Additive exPlanations (SHAP) values determine the individual feature impact on predictions. A subgroup analysis also compared depression-related characteristics in middle-aged (45–59 years) versus older adults (≥ 60 years) with incident depressive symptoms.
Over a 3-year follow-up, 1,629 (26.35%) participants developed incident depressive symptoms. A Random Forest model, optimized using Recursive Feature Elimination (RF-RFE), which selected 28 key predictors from an initial 33. After threshold adjustment (optimal threshold = 0.43) to maximize the F1-score, the model achieved an accuracy of 0.736, precision of 0.499, recall of 0.607, F1-score of 0.548, and an AUC of 0.776 (95% CI: 0.763–0.788). The mean Brier score was 0.163 ± 0.006. DCA confirmed its clinical utility. Key protective factors identified via logistic regression included being male, higher education, and internet access. Conversely, increased age, poor self-rated health, lower life satisfaction, and functional limitations were significant risk factors for incident depressive symptoms.
The random forest model demonstrates moderate predictive ability to estimate the risk of depressive symptoms in individuals aged 45 and above in rural China over the next 3 years. It offers a potentially valuable screening tool for rural regions with low mental health awareness and high depression prevalence, enabling more targeted interventions and prevention strategies.
The online version contains supplementary material available at 10.1186/s40359-025-03513-2.
This prospective cohort study used data from 6,183 rural Chinese middle-aged and older adults in the China Health and Retirement Longitudinal Study (CHARLS, 2018–2020). A random forest model was developed to predict 3-year incidents of depressive symptoms. Independent risk factors were identified via chi-square tests followed by binary logistic regression (Odds Ratios [ORs] and 95% Confidence Intervals [CIs] reported for significant variables, p < 0.05). The model’s performance and clinical utility were assessed using standard metrics and Decision Curve Analysis (DCA). SHapley Additive exPlanations (SHAP) values determine the individual feature impact on predictions. A subgroup analysis also compared depression-related characteristics in middle-aged (45–59 years) versus older adults (≥ 60 years) with incident depressive symptoms.
Over a 3-year follow-up, 1,629 (26.35%) participants developed incident depressive symptoms. A Random Forest model, optimized using Recursive Feature Elimination (RF-RFE), which selected 28 key predictors from an initial 33. After threshold adjustment (optimal threshold = 0.43) to maximize the F1-score, the model achieved an accuracy of 0.736, precision of 0.499, recall of 0.607, F1-score of 0.548, and an AUC of 0.776 (95% CI: 0.763–0.788). The mean Brier score was 0.163 ± 0.006. DCA confirmed its clinical utility. Key protective factors identified via logistic regression included being male, higher education, and internet access. Conversely, increased age, poor self-rated health, lower life satisfaction, and functional limitations were significant risk factors for incident depressive symptoms.
The random forest model demonstrates moderate predictive ability to estimate the risk of depressive symptoms in individuals aged 45 and above in rural China over the next 3 years. It offers a potentially valuable screening tool for rural regions with low mental health awareness and high depression prevalence, enabling more targeted interventions and prevention strategies.
The online version contains supplementary material available at 10.1186/s40359-025-03513-2.