Missing data occurs in every field and most researchers choose simple approach to deal with. But this approach may introduce bias and result in inaccurate results. In this study, we will explore the method suitable for large sample and multivariate missing data patterns. In this paper, we utilized a cross-sectional survey data, providing information about youth health risk behavior in Beijing. Using R to simulate random missing data sets with different proportion of missing data based on the survey data set. For each of the missing data set, complete case analysis (CCA), single imputation (SI) and multiple imputation (MI) were adopted to process this and overall 30 complete data sets were obtained. Finally, logistic regression was used to analysis these complete data sets. The indicator (Akaike's Information Criterion, AIC) is used to evaluate both advantages and disadvantages of the three methods and the other indicators such as the significance of the regression coefficients (β), the fraction of missing information (FMI) are utilized to evaluate the applicability of the MI. Compared with the original data set K, the value of AIC of data sets processed by CCA and SI gradually decreases and the relative error gradually increases with the increase of the proportion of missing data. The value of AIC of data sets processed by MI changes slightly. With the increase of the proportion of missing data, especially more than 30%, the meaningless variables of the regression coefficient and the value of FMI gradually increased. Under different proportion of missing data, the MI performs well compared with CCA and SI. When dealing with missing values under MCAR, we recommend using MI instead of CCA and SI. Second, the changing of FMI can also be used as an indicator of MI to process missing data. Third, it is suitable for MI to process large sample survey data, and no more than 30% of proportion of missing data is the proper scope of application of MI.
Published in | Science Journal of Public Health (Volume 7, Issue 5) |
DOI | 10.11648/j.sjph.20190705.13 |
Page(s) | 151-158 |
Creative Commons |
This is an Open Access article, distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution and reproduction in any medium or format, provided the original work is properly cited. |
Copyright |
Copyright © The Author(s), 2019. Published by Science Publishing Group |
Survey Data, Missing Value, Multiple Imputation (MI), Complete Case Analysis (CCA), Single Imputation (SI)
[1] | Chinomona A, Mwambi H. Multiple imputation for non-response when estimating HIV prevalence using survey data. BMC Public Health, 2015, 15 (1): 1059. |
[2] | Harel O, Mitchell E M, Perkins N J, et al. Multiple Imputation for Incomplete Data in Epidemiologic Studies. American Journal of Epidemiology, 2017. |
[3] | Ma Y, Zhang W, Lyman S, et al. The HCUP SID Imputation Project: Improving Statistical Inferences for Health Disparities Research by Imputing Missing Race Data. Health Services Research, 2017. |
[4] | Hughes RA, Heron J, Sterne JAC, Tilling K. Accounting for missing data in statistical analyses: multiple imputation is not always the answer. International Journal of Epidemiology, 2019. |
[5] | Mukaka M, White S A, Terlouw D J, et al. Is using multiple imputation better than complete case analysis for estimating a prevalence (risk) difference in randomized controlled trials when binary outcome observations are missing?. Trials, 2016, 17 (1): 341. |
[6] | Alma P, Ellen M, Deirdre C F, et al. Missing data and multiple imputation in clinical epidemiological research. Clinical Epidemiology, 2017, Volume 9: 157-166. |
[7] | Sullivan D, Andridge R. A hot deck imputation procedure for multiply imputing nonignorable missing data: The proxy pattern-mixture hot deck. Computational Statistics & Data Analysis, 2015, 82: 173-185. |
[8] | Rodwell L, Lee K J, Romaniuk H, et al. Comparison of methods for imputing limited-range variables: a simulation study. BMC Medical Research Methodology, 2014, 14 (1): 57. |
[9] | Allotey P A, Harel O. Multiple Imputation for Incomplete Data in Environmental Epidemiology Research. Current Environmental Health Reports, 2019. |
[10] | Liu Y, De A. Multiple Imputation by Fully Conditional Specification for Dealing with Missing Data in a Large Epidemiologic Study. International Journal of Statistics in Medical Research, 2015, 4 (3): 287-295. |
[11] | Hayati Rezvan P, Lee K J, Simpson J A. The rise of multiple imputation: a review of the reporting and implementation of the method in medical research. BMC Medical Research Methodology, 2015, 15 (1): 30. |
[12] | Mackinnon A. The use and reporting of multiple imputation in medical research - a review. Journal of Internal Medicine, 2010, 268 (6): 586-593. |
[13] | Buuren S v, Groothuis-Oudshoorn K. mice: Multivariate Imputation by Chained Equations in R. Journal of Statistical Software, Articles, 2011, 45 (3): 1-67. |
[14] | Azur M J, Stuart E A, Frangakis C, et al. Multiple imputation by chained equations: what is it and how does it work?. International Journal of Methods in Psychiatric Research, 2011, 20 (1): 40-49. |
[15] | Enders C K, Keller B T, Levy R. A Fully Conditional Specification Approach to Multilevel Imputation of Categorical and Continuous Variables. Psychological Methods, 2017. |
[16] | Harel O, Zhou X H. Multiple imputation: review of theory, implementation and software. Statistics in medicine, 2007, 26 (16): 3057-3077. |
[17] | Zhang Z. Multiple imputation with multivariate imputation by chained equation (MICE) package. Ann Transl Med, 2016, 4 (2): 30. |
[18] | Honaker J, King G, Blackwell M. Amelia II: A program for missing data. Journal of statistical software, 2012, 45 (7): 1-47. |
[19] | Ayilara OF, Zhang L, Sajobi TT, Sawatzky R, Bohm E, Lix LM. Impact of missing data on bias and precision when estimating change in patient-reported outcomes from a clinical registry. Health and Quality of Life Outcomes, 2019, 17 (1): 106. |
[20] | Sun B L, Perkins N J, Cole S R, et al. Inverse-Probability-Weighted Estimation for Monotone and Nonmonotone Missing Data. American Journal of Epidemiology, 2017. |
[21] | Seaman S R, White I R. Review of inverse probability weighting for dealing with missing data. Statistical Methods in Medical Research, 2013, 22 (3): 278. |
[22] | Bartlett J W, Carpenter J R, Tilling K, et al. Improving upon the efficiency of complete case analysis when covariates are MNAR. Biostatistics, 2014, 15 (4): 719-730. |
[23] | Madley-Dowd P, Hughes R, Tilling K, Heron J. The proportion of missing data should not be used to guide decisions on multiple imputation. Journal of Clinical Epidemiology, 2019, 110: 63-73. |
[24] | Vanburen J, Cavanaugh J, Marshall T, et al. AIC identifies optimal representation of longitudinal dietary variables. Journal of Public Health Dentistry, 2017, 77 (2). |
[25] | Waljee A K, Mukherjee A, Singal A G, et al. Comparison of imputation methods for missing laboratory data in medicine. BMJ Open, 2013, 3 (8): e002847-e002847. |
[26] | Lee K J, Carlin J B. Multiple Imputation for Missing Data: Fully Conditional Specification Versus Multivariate Normal Imputation. American Journal of Epidemiology, 2010, 171 (5): 624-632. |
[27] | White I R, Carlin J B. Bias and efficiency of multiple imputation compared with complete-case analysis for missing covariate values. Statistics in medicine, 2010, 29 (28): 2920-2931. |
[28] | Greenland S, Finkle W D. A critical look at methods for handling missing covariates in epidemiologic regression analyses. American Journal of Epidemiology, 1995, 142 (12): 1255-1264. |
[29] | Hardt J, Herke M, Leonhart R. Auxiliary variables in multiple imputation in regression with missing X: a warning against including too many in small sample research. BMC Medical Research Methodology, 2012, 12 (1): 184. |
[30] | Hardt J, Herke M, Brian T, Laubach W. Multiple imputation of missing data: a simulation study on a binary response. Open J Stat, 2013; 3 (05): 370. |
[31] | Siddique J, Harel O, Crespi C M, et al. Binary variable multiple-model multiple imputation to address missing data mechanism uncertainty: Application to a smoking cessation trial. Statistics in Medicine, 2014, 33 (17). |
APA Style
Lingling Wang, Dandan Zhang, Jiali Duan, Ruoran Lyu. (2019). Comparison of Methods for Processing Missing Values in Large Sample Survey Data. Science Journal of Public Health, 7(5), 151-158. https://doi.org/10.11648/j.sjph.20190705.13
ACS Style
Lingling Wang; Dandan Zhang; Jiali Duan; Ruoran Lyu. Comparison of Methods for Processing Missing Values in Large Sample Survey Data. Sci. J. Public Health 2019, 7(5), 151-158. doi: 10.11648/j.sjph.20190705.13
AMA Style
Lingling Wang, Dandan Zhang, Jiali Duan, Ruoran Lyu. Comparison of Methods for Processing Missing Values in Large Sample Survey Data. Sci J Public Health. 2019;7(5):151-158. doi: 10.11648/j.sjph.20190705.13
@article{10.11648/j.sjph.20190705.13, author = {Lingling Wang and Dandan Zhang and Jiali Duan and Ruoran Lyu}, title = {Comparison of Methods for Processing Missing Values in Large Sample Survey Data}, journal = {Science Journal of Public Health}, volume = {7}, number = {5}, pages = {151-158}, doi = {10.11648/j.sjph.20190705.13}, url = {https://doi.org/10.11648/j.sjph.20190705.13}, eprint = {https://article.sciencepublishinggroup.com/pdf/10.11648.j.sjph.20190705.13}, abstract = {Missing data occurs in every field and most researchers choose simple approach to deal with. But this approach may introduce bias and result in inaccurate results. In this study, we will explore the method suitable for large sample and multivariate missing data patterns. In this paper, we utilized a cross-sectional survey data, providing information about youth health risk behavior in Beijing. Using R to simulate random missing data sets with different proportion of missing data based on the survey data set. For each of the missing data set, complete case analysis (CCA), single imputation (SI) and multiple imputation (MI) were adopted to process this and overall 30 complete data sets were obtained. Finally, logistic regression was used to analysis these complete data sets. The indicator (Akaike's Information Criterion, AIC) is used to evaluate both advantages and disadvantages of the three methods and the other indicators such as the significance of the regression coefficients (β), the fraction of missing information (FMI) are utilized to evaluate the applicability of the MI. Compared with the original data set K, the value of AIC of data sets processed by CCA and SI gradually decreases and the relative error gradually increases with the increase of the proportion of missing data. The value of AIC of data sets processed by MI changes slightly. With the increase of the proportion of missing data, especially more than 30%, the meaningless variables of the regression coefficient and the value of FMI gradually increased. Under different proportion of missing data, the MI performs well compared with CCA and SI. When dealing with missing values under MCAR, we recommend using MI instead of CCA and SI. Second, the changing of FMI can also be used as an indicator of MI to process missing data. Third, it is suitable for MI to process large sample survey data, and no more than 30% of proportion of missing data is the proper scope of application of MI.}, year = {2019} }
TY - JOUR T1 - Comparison of Methods for Processing Missing Values in Large Sample Survey Data AU - Lingling Wang AU - Dandan Zhang AU - Jiali Duan AU - Ruoran Lyu Y1 - 2019/09/26 PY - 2019 N1 - https://doi.org/10.11648/j.sjph.20190705.13 DO - 10.11648/j.sjph.20190705.13 T2 - Science Journal of Public Health JF - Science Journal of Public Health JO - Science Journal of Public Health SP - 151 EP - 158 PB - Science Publishing Group SN - 2328-7950 UR - https://doi.org/10.11648/j.sjph.20190705.13 AB - Missing data occurs in every field and most researchers choose simple approach to deal with. But this approach may introduce bias and result in inaccurate results. In this study, we will explore the method suitable for large sample and multivariate missing data patterns. In this paper, we utilized a cross-sectional survey data, providing information about youth health risk behavior in Beijing. Using R to simulate random missing data sets with different proportion of missing data based on the survey data set. For each of the missing data set, complete case analysis (CCA), single imputation (SI) and multiple imputation (MI) were adopted to process this and overall 30 complete data sets were obtained. Finally, logistic regression was used to analysis these complete data sets. The indicator (Akaike's Information Criterion, AIC) is used to evaluate both advantages and disadvantages of the three methods and the other indicators such as the significance of the regression coefficients (β), the fraction of missing information (FMI) are utilized to evaluate the applicability of the MI. Compared with the original data set K, the value of AIC of data sets processed by CCA and SI gradually decreases and the relative error gradually increases with the increase of the proportion of missing data. The value of AIC of data sets processed by MI changes slightly. With the increase of the proportion of missing data, especially more than 30%, the meaningless variables of the regression coefficient and the value of FMI gradually increased. Under different proportion of missing data, the MI performs well compared with CCA and SI. When dealing with missing values under MCAR, we recommend using MI instead of CCA and SI. Second, the changing of FMI can also be used as an indicator of MI to process missing data. Third, it is suitable for MI to process large sample survey data, and no more than 30% of proportion of missing data is the proper scope of application of MI. VL - 7 IS - 5 ER -