Investigating statistical approaches to handling missing data in the context of the Gateshead Millennium Study

Gordon, Claire Ann (2010) Investigating statistical approaches to handling missing data in the context of the Gateshead Millennium Study. MSc(R) thesis, University of Glasgow.

Full text available as:
Download (18MB) | Preview


A commonly occurring problem in all kinds of studies is that of missing data. These missing values can occur for a number of reasons, including equipment malfunctions and, more typically, subjects recruited to a study not participating fully. In particular, in a longitudinal study, one or more of the repeated measurements on a subject might be missing. The way in which missing values are dealt with depends on the data analyst's experience with statistical techniques. The most common way in which data analysts proceed is to use the complete case analysis method, i.e. removing cases with missing values for any of the variables and running the analysis on the remaining cases. Although this method is very straightforward to implement and is used by the vast majority of data analysts, it can lead to biased results unless data are missing completely at random. Complete Case analysis can dramatically reduce the sample size of the study, as only those cases for which all variables are measured are included in the analysis. Therefore the complete case analysis method is "not generally recommended" (Diggle et al., 2002). Alternative approaches to the complete case analysis method involve filling in (or imputing) values for the incomplete cases, making "more efficient use of the available data" (Schafer, 1997). The purpose of this thesis is to compare and contrast the results obtained from analysing the relationship between growth and feeding behaviour in the first year of life using the complete case analysis and three imputation methods: single hot-decking, multiple hot-decking and the EM algorithm. The data used in this research come from the Gateshead Millennium Study, a prospective study of a cohort of just over 1,000 babies. In practical terms, the purpose of the work is to confirm the conclusions from the published complete-case analysis. It is of more theoretical interest to determine which imputation method is the most appropriate for dealing with missing data in this study. Chapter 1 provides an introduction to the problem of missing data and how they may arise and a description of the Gateshead Millennium Study data, to which all the missing data methods will be applied. It concludes by giving the aims of this thesis. Chapter 2 provides an in depth review of various missing data approaches and indicates which characteristics of the missing data have to be considered in order to determine which of these approaches can be employed to deal with the missing values. Also in Chapter 2, various aspects of the Gateshead Millennium Study data are reviewed. Measures of growth and feeding behaviour in the first year of life are described as these are important variables in the published analysis. Chapter 3 assesses how complete the Gateshead Millennium Study data is by producing a detailed description of each of the questions in each of the questionnaires. This is achieved by examining the Wave Non-response, Section Non-response and Item Non-response for each of the six questionnaires. Chapter 4 recreates the results from the complete case analyses for the relationship between development of growth and feeding in the first year of life which have already been performed and published in the paper - How Does Maternal and Child Feeding Behaviour Relate to Weight Gain and Failure to Thrive? Data From a Prospective Birth Cohort (Wright et al., 2006a). This chapter also gives insight as to whether or not it is appropriate to assume that the missing data mechanism is MCAR and therefore whether or not it is reasonable to believe the results obtained from the complete case analysis. Chapter 5 focusses on the various methods used to impute the missing values in the Gateshead Millennium Study data. This chapter begins by considering the EM Algorithm. It gives details of how the EM Algorithm was performed and the results obtained. In addition to the EM Algorithm, this chapter also considers the procedures and results for Single Imputation and Multiple Imputation by hot-decking. This chapter concludes by comparing the results of these methods to one another and also to the complete case analysis results from Chapter 4. Finally, Chapter 6 provides a summary of the results from the various missing data methods applied and discusses various alternative methods which could also have been performed.

Item Type: Thesis (MSc(R))
Qualification Level: Masters
Additional Information: The questionnaires in Appendix A of this thesis are the intellectual property of the Gateshead Millennium Study Team.
Keywords: Missing Data, Missing Data Mechanisms, Complete Case Analysis, EM Algorithm, Hot-deck Imputation, Multiple Imputation, Gateshead Millennium Study
Subjects: H Social Sciences > HA Statistics
Colleges/Schools: College of Science and Engineering > School of Mathematics and Statistics > Statistics
Funder's Name: UNSPECIFIED
Supervisor's Name: McColl, Prof. John
Date of Award: 2010
Depositing User: Miss Claire A Gordon
Unique ID: glathesis:2010-2312
Copyright: Copyright of this thesis is held by the author.
Date Deposited: 05 Jan 2011
Last Modified: 10 Dec 2012 13:53

Actions (login required)

View Item View Item