Machine learning models for detecting suicidal ideation in Chinese in-patients with major depressive disorder: A single-centre retrospective study.
Suicide claims >720,000 lives annually; major depressive disorder (MDD) carries the highest population-attributable risk. Suicidal ideation (SI), the most proximal modifiable predictor of attempt,is poorly captured by subjective scales. We developed and internally validated machine-learning models to detect SI in Chinese MDD in-patients using routine electronic medical records.
This study aimed to develop and internally validate machine-learning (ML) models that exploit routine electronic medical record (EMR) data to identify recent SI in Chinese in-patients with MDD.
A retrospective cohort of 721 in-patients with major depressive disorder (MDD), including 399 with suicidal ideation (SI-positive), was recruited from the Fourth People's Hospital of Hefei between January 2020 and August 2023. The dataset was stratified into training (70%) and test (30%) sets. All preprocessing steps (median imputation and Z-score normalization) and Boruta feature selection were performed exclusively on the training set using R software (version 4.4.2), with multicollinearity removed for variables with a variance inflation factor (VIF) > 5 or a pairwise Pearson correlation coefficient |r| > 0.75. Six machine learning algorithms-random forest (RF), logistic regression (LR), LightGBM, support vector machine (SVM), K-nearest neighbor (KNN), and XGBoost-were trained using GridSearchCV combined with 10-fold stratified cross-validation, with model fine-tuning via the class_weight = 'balanced' parameter. Model performance was evaluated on the independent test set using multiple metrics, including the area under the receiver operating characteristic curve (AUC), area under the precision-recall curve (PR-AUC), accuracy, sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV). SHAP analysis, implemented in Python (version 3.12), was used to enhance model interpretability. Stratified subgroup analysis (stratified by sex and age) and sensitivity analysis (comparing the optimal RF model with the traditional LR baseline model via DeLong's test) were conducted to verify the robustness and superiority of the proposed model.
Random Forest achieved the best discrimination (AUC 0.857) and maintained stable discrimination across sexes (AUC = 0.843 in females and 0.822 in males), demonstrating higher sensitivity in females (0.883) and higher specificity in males (0.900).Top risk features: Compared with SI-negative patients, SI-positive patients were predominantly male (74.4% vs. 53.7%), married (80.7% vs. 63.4%), and had a lower educational level (83.7% vs. 29.5% without higher education). Furthermore, both their current age and age at depression onset were significantly greater (all P < 0.001).
ML models, especially Random Forest, can effectively identify recent SI risk in Chinese MDD patients using readily available clinical data.
This study aimed to develop and internally validate machine-learning (ML) models that exploit routine electronic medical record (EMR) data to identify recent SI in Chinese in-patients with MDD.
A retrospective cohort of 721 in-patients with major depressive disorder (MDD), including 399 with suicidal ideation (SI-positive), was recruited from the Fourth People's Hospital of Hefei between January 2020 and August 2023. The dataset was stratified into training (70%) and test (30%) sets. All preprocessing steps (median imputation and Z-score normalization) and Boruta feature selection were performed exclusively on the training set using R software (version 4.4.2), with multicollinearity removed for variables with a variance inflation factor (VIF) > 5 or a pairwise Pearson correlation coefficient |r| > 0.75. Six machine learning algorithms-random forest (RF), logistic regression (LR), LightGBM, support vector machine (SVM), K-nearest neighbor (KNN), and XGBoost-were trained using GridSearchCV combined with 10-fold stratified cross-validation, with model fine-tuning via the class_weight = 'balanced' parameter. Model performance was evaluated on the independent test set using multiple metrics, including the area under the receiver operating characteristic curve (AUC), area under the precision-recall curve (PR-AUC), accuracy, sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV). SHAP analysis, implemented in Python (version 3.12), was used to enhance model interpretability. Stratified subgroup analysis (stratified by sex and age) and sensitivity analysis (comparing the optimal RF model with the traditional LR baseline model via DeLong's test) were conducted to verify the robustness and superiority of the proposed model.
Random Forest achieved the best discrimination (AUC 0.857) and maintained stable discrimination across sexes (AUC = 0.843 in females and 0.822 in males), demonstrating higher sensitivity in females (0.883) and higher specificity in males (0.900).Top risk features: Compared with SI-negative patients, SI-positive patients were predominantly male (74.4% vs. 53.7%), married (80.7% vs. 63.4%), and had a lower educational level (83.7% vs. 29.5% without higher education). Furthermore, both their current age and age at depression onset were significantly greater (all P < 0.001).
ML models, especially Random Forest, can effectively identify recent SI risk in Chinese MDD patients using readily available clinical data.
Authors
Gu Gu, Zheng Zheng, Xie Xie, Pan Pan, Zhang Zhang, Wang Wang, Chen Chen, Cheng Cheng
View on Pubmed