Wandy, Joe (2017) Unsupervised Bayesian explorations of mass spectrometry data. PhD thesis, University of Glasgow.
Full text available as:
PDF
Download (13MB) |
Abstract
In recent years, the large-scale, untargeted studies of the compounds that serve as workers in the cell (proteins) and the small molecules involved in essential life-sustaining chemical processes (metabolites) have provided insights into a wide array of fields, such as medical diagnostics, drug discovery, personalised medicine and many others. Measurements in such studies are routinely performed using liquid chromatography mass spectrometry (LC-MS) instruments. From these measurements, we obtain a set of peaks having mass-to-charge, retention time (RT) and intensity values. Before further analysis is possible, the raw LC-MS data has to be processed in a data pre-preprocessing pipeline. In the alignment step of the pipeline, peaks from multiple LC-MS measurements have to be matched. In the identification step, the identity of unknown compounds in the sample that generate the observed peaks have to be assigned. Using tandem mass spectrometry, fragmentation peaks characteristic to a compound can be obtained and used to help establish the identity of the compound. Alignment and identification are challenging because the true identities of the entire set of compounds in the sample are unknown, and a single compound can produce many observed peaks, each with a potential drift in its retention time value. These observed peaks are not independent as they can be explained as being generated by the same compound.
The aim of this thesis is to introduce methods to group these related peaks and to use these groupings to improve alignment and assist in identification during data pre-processing. Firstly, we introduce a generative model to group related peaks by their retention time. This information is used to influence direct-matching alignment, bringing related peak groups closer during matching. Investigations using benchmark datasets reveal that improved alignment performance is obtained from this approach. Next, we also consider mass information in the grouping process, resulting in PrecursorCluster, a model that performs the grouping of related peaks in metabolomics by their explainable mass relationships, RT and intensity values. Through a second-stage process that matches these related peak groups, peak alignment is produced. Experiments on benchmark datasets show that an improved alignment performance is obtained, while uncertainties in matched peaksets can also be extracted from the method. In the next section, we expand upon this two-stage method and introduce HDPAlign, a model that performs the clustering of related peaks within and across multiple LC-MS runs at once. This allows for matched peaksets and their respective uncertainties to be naturally extracted from the model. Finally, we look at fragmentation peaks used for identification and introduce MS2LDA, a topic model to group related fragmentation features. These groups of related fragmentation features potentially correspond to substructures shared by metabolites and can be used to assist data interpretation during identification. This final section corresponds to a work in progress and points to many interesting avenues for future research.
Item Type: | Thesis (PhD) |
---|---|
Qualification Level: | Doctoral |
Keywords: | bioinformatics, metabolomics, machine learning, bayesian inference. |
Subjects: | Q Science > QA Mathematics > QA75 Electronic computers. Computer science Q Science > QH Natural history > QH301 Biology |
Colleges/Schools: | College of Science and Engineering > School of Computing Science |
Supervisor's Name: | Rogers, Dr. Simon |
Date of Award: | 2017 |
Depositing User: | Joe Wandy |
Unique ID: | glathesis:2017-7928 |
Copyright: | Copyright of this thesis is held by the author. |
Date Deposited: | 13 Feb 2017 13:49 |
Last Modified: | 16 Mar 2017 09:11 |
URI: | https://theses.gla.ac.uk/id/eprint/7928 |
Actions (login required)
View Item |
Downloads
Downloads per month over past year