Adjusting Covariate Misclassification in Electronic Health Records-Based Machine Learning Prediction Models.

This study developed and evaluated methods to adjust misclassification errors in electronic health record (EHR)-derived covariates using group-wise and individualized weights based on observed sensitivity and specificity to reduce bias in predictive modeling. Logistic regression, XGBoost, and neural networks predicted follow-up adherence in lung cancer screening. The Lung-RADS category, extracted via natural language processing (NLP), was adjusted using group-wise weights and individualized weights from kernel and multinomial regression. Models with adjusted covariates were compared to naïve (unadjusted) and oracle (true value) models. Performance assessed by the area under the receiver operating characteristic (AUROC) curve across 10%, 20%, and 30% validation sets, showed that adjusted models outperformed naïve models, improving AUROC by 0.3%-10.4%. Compared to oracle models, adjusted models reduced the AUROC gap to 2.0%-7.5%. Individualized weights provided more precise corrections than group-wise weights. This scalable framework mitigates misclassification bias in EHR-derived covariates, enhancing predictive accuracy without resource-intensive manual review.
Cancer
Chronic respiratory disease
Access
Advocacy

Authors

Yang Yang, Wu Wu, Liu Liu, Bian Bian, Liang Liang, Guo Guo
View on Pubmed
Share
Facebook
X (Twitter)
Bluesky
Linkedin
Copy to clipboard