Gordon, Claire Ann
Investigating statistical approaches to handling missing data in the context of the Gateshead Millennium Study.
MSc(R) thesis, University of Glasgow.
Full text available as:
A commonly occurring problem in all kinds of studies is that of
missing data. These missing values can occur for a number of
reasons, including equipment malfunctions and, more typically,
subjects recruited to a study not participating fully. In
particular, in a longitudinal study, one or
more of the repeated measurements on a subject might be missing.
The way in which missing values are dealt with depends on the data
analyst's experience with statistical techniques. The most common
way in which data analysts proceed is to use the complete case
analysis method, i.e. removing cases with missing values for any of
the variables and running the analysis on the remaining cases.
Although this method is very straightforward to implement and is
used by the vast majority of data analysts, it can lead to biased
results unless data are missing completely at random. Complete Case
analysis can dramatically reduce the sample size of the study, as
only those cases for which all variables are measured are included
in the analysis. Therefore the complete case analysis method is "not
generally recommended" (Diggle et al., 2002). Alternative approaches to
the complete case analysis method involve filling in (or imputing)
values for the incomplete
cases, making "more efficient use of the available data" (Schafer, 1997).
The purpose of this thesis is to compare and contrast the results
obtained from analysing the relationship between growth and feeding
behaviour in the first year of life using the complete case analysis
and three imputation methods: single hot-decking, multiple
hot-decking and the EM algorithm. The data used in this research
come from the Gateshead Millennium Study, a prospective study of a
cohort of just over 1,000 babies. In practical terms, the purpose of
the work is to confirm the conclusions from the published
complete-case analysis. It is of more theoretical interest to
determine which imputation method
is the most appropriate for dealing with missing data in this study.
Chapter 1 provides an introduction to the problem of missing data
and how they may arise and a description of the Gateshead Millennium
Study data, to which all the missing data methods will be applied.
It concludes by giving the
aims of this thesis.
Chapter 2 provides an in depth review of various missing data
approaches and indicates which characteristics of the missing data
have to be considered in order to determine which of these
approaches can be employed to deal with the missing values. Also in
Chapter 2, various aspects of the Gateshead Millennium Study data
are reviewed. Measures of growth and feeding behaviour in the first
year of life are described as these are important variables in the
Chapter 3 assesses how complete the Gateshead Millennium Study data
is by producing a detailed description of each of the questions in
each of the questionnaires. This is achieved by examining the Wave
Non-response and Item Non-response for each of the six questionnaires.
Chapter 4 recreates the results from the complete case analyses for
the relationship between development of growth and feeding in the
first year of life which have already been performed and published
in the paper - How Does Maternal and Child Feeding Behaviour
Relate to Weight Gain and Failure to Thrive? Data From a Prospective
Birth Cohort (Wright et al., 2006a). This chapter also gives insight as to
whether or not it is appropriate to assume that the missing data
mechanism is MCAR and
therefore whether or not it is reasonable to believe the results obtained from the complete case analysis.
Chapter 5 focusses on the various methods used to impute the missing
values in the Gateshead Millennium Study data. This chapter begins
by considering the EM Algorithm. It gives details of how the EM
Algorithm was performed and the results obtained. In addition to
the EM Algorithm, this chapter also considers the procedures and
results for Single Imputation and Multiple Imputation by
hot-decking. This chapter concludes by comparing the results of
these methods to one another and also
to the complete case analysis results from Chapter 4.
Finally, Chapter 6 provides a summary of the results from the
various missing data methods applied and discusses various
alternative methods which could also have been performed.
||The questionnaires in Appendix A of this thesis are the intellectual property of the Gateshead Millennium Study Team.
||Missing Data, Missing Data Mechanisms, Complete Case Analysis, EM Algorithm, Hot-deck Imputation, Multiple Imputation, Gateshead Millennium Study
||H Social Sciences > HA Statistics
||College of Science and Engineering > School of Mathematics and Statistics > Statistics
||McColl, Prof. John
|Date of Award:
Miss Claire A Gordon
||Copyright of this thesis is held by the author.
||05 Jan 2011
||10 Dec 2012 13:53
Actions (login required)