Hierarchical hidden Markov models with applications to BiSulfite-sequencing data

Ghosh, Tusharkanti (2018) Hierarchical hidden Markov models with applications to BiSulfite-sequencing data. PhD thesis, University of Glasgow.

Full text available as:
[thumbnail of 2018TusharPhD.pdf] PDF
Download (3MB)
Printed Thesis Information: https://eleanor.lib.gla.ac.uk/record=b3308224

Abstract

DNA methylation is an epigenetic modification with significant roles in various biological processes such as gene expression and cellular proliferation. Aberrant DNA methylation patterns compared to normal cells have been associated with a large number of human malignancies and potential cancer symptoms. In DNA methylation studies, an important objective is to detect differences between two groups under distinct biological conditions, for e.g., between cancer/ageing and normal cells. BiSulfite sequencing (BS-seq) is currently the gold standard for experimentally measuring genome-wide DNA methylation. Recent evolution in the BS-seq technologies enabled the DNA methylation profiles at single base pair resolution to be more accurate in terms of their genome coverages. The main objective of my thesis is to identify differential patterns of DNA methylation between proliferating and senescent cells. For efficient detection of differential methylation patterns, this thesis adopts the approach of Bayesian latent variable model. One such class of models is hidden Markov model (HMM) that can detect the underlying latent (hidden) structures of the model. In this thesis, I propose a family of Bayesian hierarchical HMMs for identifying differentially methylated cytosines (DMCs) and differentially methylated regions (DMRs) from BS-seq data which act as important indicators in better understanding of cancer and other related diseases. I introduce HMMmethState, a model-based hierarchical Bayesian technique for identifying DMCs from BS-seq data. My novel HMMmethState method implements hierarchical HMMs to account for spatial dependence among the CpG sites over genomic positions of BS-seq methylation data. In particular, this thesis is concerned with developing hierarchical HMMs for the differential methylation analysis of BS-seq data, within a Bayesian framework. In these models, aberrant DNA methylation is driven by two latent states: differentially methylated state and similarly methylated state, which can be interpreted as methylation status of CpG sites, that evolve over genomic positions as a first order Markov chain. I first design a (homogeneous) discrete-index hierarchical HMM in which methylated counts given the methylation status of CpG sites follow Beta-Binomial emission distribution specific to the methylation state. However, this model does not incorporate the genomic positional variations among the CpG sites, so I develop a (non-homogeneous) continuous-index hierarchical HMM, in which the transition probabilities between methylation status depend on the genomic positions of the CpG sites. This Beta-Binomial emission model however does not take into account the correlation in the methylated counts of the proliferating and senescent cells, which has been observed in the BS-seq data analysis. So, I develop a hierarchical Normal-logit Binomial emission model that induces correlation between the methylated counts of the proliferating and senescent cells. Furthermore, to perform parameter estimation for my models, I implement efficient Markov Chain Monte Carlo (MCMC) based algorithms. In this thesis, I provide an extensive study on model comparisons and adequacy of all the models using Bayesian model checking. In addition, I also show the performances of all the models using Receiver Operating Characteristics (ROC) curves. I illustrate the models by fitting them to a large BS-seq dataset and apply model selection criteria on the dataset in search of selecting the best model. In addition, I compare the performances of my methods with existing methods for detecting DMCs with competing methods. I demonstrate how the HMMmethState based algorithms outperform the existing methods in simulation studies in terms of ROC curves. I present the results of DMRs obtained using my method, i.e., the results of DMRs with the proposed HMMmethState that have been applied to the BS-seq datasets. The results of the hierarchical HMMs explain that I can certainly implement these methods under unconditioned settings to identify DMCs for high-throughput BS-seq data. The predicted DMCs can also help in understanding the phenotypic changes associated with human ageing.

Item Type: Thesis (PhD)
Qualification Level: Doctoral
Subjects: H Social Sciences > HA Statistics
Colleges/Schools: College of Science and Engineering > School of Mathematics and Statistics > Statistics
Supervisor's Name: Gupta, Dr Mayetri and Macaulay, Dr Vincent
Date of Award: 2018
Depositing User: Dr Tusharkanti Ghosh
Unique ID: glathesis:2018-9036
Copyright: Copyright of this thesis is held by the author.
Date Deposited: 04 May 2018 11:38
Last Modified: 28 May 2018 10:27
URI: https://theses.gla.ac.uk/id/eprint/9036

Actions (login required)

View Item View Item

Downloads

Downloads per month over past year