Model selection and model averaging in the presence of missing values

Gopal Pillay, Khuneswari (2015) Model selection and model averaging in the presence of missing values. PhD thesis, University of Glasgow.

Full text available as:
Download (2MB) | Preview


Model averaging has been proposed as an alternative to model selection which is intended to overcome the underestimation of standard errors that is a consequence of model selection. Model selection and model averaging become more complicated in the presence of missing data. Three different model selection approaches (RR, STACK and M-STACK) and model averaging using three model-building strategies (non-overlapping variable sets, inclusive and restrictive strategies) were explored to combine results from multiply-imputed data sets using a Monte Carlo simulation study on some simple linear and generalized linear models. Imputation was carried out using chained equations (via the "norm" method in the R package MICE). The simulation results showed that the STACK method performs better than RR and M-STACK in terms of model selection and prediction, whereas model averaging performs slightly better than STACK in terms of prediction. The inclusive and restrictive strategies perform better in terms of prediction, but non-overlapping variable sets performs better for model selection. STACK and model averaging using all three model-building strategies were proposed to combine the results from a multiply-imputed data set from the Gateshead Millennium Study (GMS). The performance of STACK and model averaging was compared using mean square error of prediction (MSE(P)) in a 10% cross-validation test. The results showed that STACK using an inclusive strategy provided a better prediction than model averaging. This coincides with the results obtained through a mimic simulation study of GMS data. In addition, the inclusive strategy for building imputation and prediction models was better than the non-overlapping variable sets and restrictive strategy. The presence of highly correlated covariates and response is believed to have led to better prediction in this particular context. Model averaging using non-overlapping variable sets performs better only if an auxiliary variable is available. However, STACK using an inclusive strategy performs well when there is no auxiliary variable available. Therefore, it is advisable to use STACK with an inclusive model-building strategy and highly correlated covariates (where available) to make predictions in the presence of missing data. Alternatively, model averaging with non-overlapping variables sets can be used if an auxiliary variable is available.

Item Type: Thesis (PhD)
Qualification Level: Doctoral
Keywords: model selection, model averaging, missing data, STACK, M-STACK, Rubin's rules, MICE, inclusive strategy, auxiliary variable
Subjects: H Social Sciences > HA Statistics
Q Science > QA Mathematics
Colleges/Schools: College of Science and Engineering > School of Mathematics and Statistics > Statistics
Funder's Name: UNSPECIFIED
Supervisor's Name: McColl, Professor John
Date of Award: 2015
Unique ID: glathesis:2015-6834
Copyright: Copyright of this thesis is held by the author.
Date Deposited: 10 Nov 2015 12:27
Last Modified: 20 Nov 2015 16:28

Actions (login required)

View Item View Item