Clustering-Informed Shared-Structure Variational Autoencoder for Missing Data Imputation in Large-Scale Healthcare Data.

Despite advancements in healthcare data management, missing data in electronic health records (EHR) and patient-reported outcomes remain a persistent challenge, limiting their usability in healthcare analytics. Conventional imputation methods often struggle to capture complex nonlinear relationships, require extensive computation time, and are limited in addressing various types of missing data mechanisms. To overcome these challenges, we propose the clustering-informed shared-structure variational autoencoder (CISS-VAE), which utilizes the strengths of Bayesian neural networks. This model can effectively capture complex associations and accommodate various missing data mechanisms, including missing not at random (MNAR). We also develop iterative learning algorithms that further enhance missing data imputation accuracy while preventing overfitting. Comprehensive simulations demonstrate the superior accuracy of our model compared to traditional and contemporary methods. We apply our method to EHR data from early-stage breast cancer patients at Memorial Sloan Kettering Cancer Center, aiming to mitigate the impact of missing data and enhance health monitoring and analyses.
Cancer
Access
Policy
Advocacy

Authors

Khadem Charvadeh Khadem Charvadeh, Seier Seier, Panageas Panageas, Vaithilingam Vaithilingam, Gönen Gönen, Chen Chen
View on Pubmed
Share
Facebook
X (Twitter)
Bluesky
Linkedin
Copy to clipboard