Missing Data Imputation for Health Care Big Data using Denoising Autoencoder with Generative Adversarial Network
Main Article Content
Abstract
Missing data imputation is a key topic in healthcare that covers the issues and strategies involved in dealing with partial data in medical records, clinical trials, and health surveys. Data in healthcare might be missing for a variety of reasons, including non-response in surveys, data entry problems, or unrecorded information during therapeutic appointments. This paper introduces a novel approach to impute missing data utilizing a hybrid model that integrates denoising autoencoders with generative adversarial networks (GANs). We begin by highlighting the prevalence of missing data in health care datasets and the potential impact on analytical outcomes. The proposed methodology leverages the denoising autoencoder’s ability to reconstruct data from noisy inputs, coupled with the GAN’s proficiency in generating synthetic data that is indistinguishable from real data. By combining these two neural network architectures, our model demonstrates an enhanced capability to predict and fill in missing data points effectively. To validate our approach, we conducted experiments on several large-scale health care datasets with varying degrees of artificially introduced missingness. The performance of our model was benchmarked against traditional imputation methods such as mean imputation and k-nearest neighbors, as well as against standalone denoising autoencoders and GANs. Our results indicate a significant improvement in imputation accuracy, as measured by root mean square error (RMSE) and mean absolute error (MAE), confirming the efficacy of the hybrid model in handling missing data in a robust manner.