High-dimensional Bayesian variable selection with applications to genome-wide association studies

Bangchang, Kannat Na (2024) High-dimensional Bayesian variable selection with applications to genome-wide association studies. MSc(R) thesis, University of Glasgow.

Full text available as:
[thumbnail of 2023BangchangMSc(R).pdf] PDF
Download (11MB)

Abstract

Genome Wide Association studies (GWAS) are a type of experiment that aim to detect genetic variation that may be linked to a type of disease. In variable selection, a major challenge arises when the number of covariates is huge compared to the number of observations. Even if proper priors allow this to be done via Bayesian methods, with an extremely high number of covariates (i.e. many thousands or even millions) compared to the number of observations (i.e. a few hundreds), there are 2 major problems: huge computational time burdens for analysing each dataset, another is the sparsity in the number of covariates associated to the response. If data splitting is used for variable selection in the case above, this can lead to significant reduction in computational time.

GWAS typically contain many thousands of covariates (i.e. DNA variants), which makes variable selection an exceptionally computationally intensive process. Additionally, with large datasets, the MCMC sampler often becomes inefficient in terms of CPU time and shows a lack of MCMC convergence. We investigated if splitting the whole dataset into a number of small sub-datasets before running Bayesian Variable Selection (BVS) reduces the time for the MCMC sampler, improving the mixing of the Markov chain. But simultaneously, we need to investigate the impact of data splitting in terms of the properties and accuracy of the resulting model. When the data is split across columns (i.e. subsetting variables), a number of the sub-datasets may not contain the covariates associated to the response.

Hence, the covariates that are selected in each sub-dataset via using Bayesian variable selection should be finally combined to determine the final set of associated covariates. But this procedure could lead to possible biases, so we assessed how this affects the error in estimation of regression coefficients and other parameters.

Finally, we applied this technique with the real dataset that is about GWAS of heart disease from Prof.Sandosh Padmanabhan’s lab at Cardiovascular Sciences at Glasgow.

Item Type: Thesis (MSc(R))
Qualification Level: Masters
Additional Information: Supported by funding from the Ministry of Higher Education, Science, Research and Innovation, Royal Thai Government.
Subjects: H Social Sciences > HA Statistics
Q Science > QA Mathematics
Colleges/Schools: College of Science and Engineering > School of Mathematics and Statistics > Statistics
Supervisor's Name: Gupta, Professor Mayetri
Date of Award: 2024
Depositing User: Theses Team
Unique ID: glathesis:2024-84229
Copyright: Copyright of this thesis is held by the author.
Date Deposited: 12 Apr 2024 10:42
Last Modified: 12 Apr 2024 10:42
Thesis DOI: 10.5525/gla.thesis.84229
URI: https://theses.gla.ac.uk/id/eprint/84229

Actions (login required)

View Item View Item

Downloads

Downloads per month over past year