Algorithms for viral haplotype reconstruction and bacterial metagenomics: resolving fine-scale variation in next generation sequencing data

Schirmer, Melanie (2014) Algorithms for viral haplotype reconstruction and bacterial metagenomics: resolving fine-scale variation in next generation sequencing data. PhD thesis, University of Glasgow.

Full text available as:
[img]
Preview
PDF
Download (21MB) | Preview

Abstract

The discovery of DNA has been one of the biggest catalysts in genomic research. Sequencing has enabled us to access the wealth of information encoded in DNA and has provided the basis for ground-breaking achievements such as the first complete human genome sequence. Furthermore, it has tremendously advanced our understanding of life-threatening genetic disorders and bacterial and viral infections. With the recent advent of next generation sequencing (NGS) technologies, sequencing became accessible to the majority of researchers and made metagenomic sequencing widely available. However, to realise its true potential, sophisticated and tailor-made bioinformatic programs are essential to translate the collected data into meaningful information. My thesis explored the potential of resolving fine-scale variation in NGS data. The identification and correction of artificial fine-scale variation in the form of biases and errors is imperative in order to draw valid conclusions. Furthermore, resolving natural fine-scale variation in the form of single nucleotide polymorphisms (SNPs) and closely related species or strains is critical for the development of effective treatments and the characterisation of diseases. In recent years, Illumina has emerged as the global market leader in DNA sequencing. However, biases and errors associated with this high-throughput sequencing technology are still poorly understood which has precluded the development of effective noise removal algorithms. In addition, many programs were not designed for Illumina data or metagenomic sequencing. Therefore, a better understanding of the idiosyncrasies encountered in Illumina data is essential and programs must be tested and benchmarked on realistic and reliable in silico data sets to reveal not only their true capacities but also their limitations. I conducted the largest in vivo study of Illumina error profiles in combination with state-of-the-art library preparation methods to date. For the first time, a direct connection between experimental design factors and systematic errors was established, providing detailed insight into the nature of Illumina errors. Further, I tested various error removal techniques and developed a sophisticated Illumina amplicon noise removal algorithm, enabling researchers to choose optimal processing strategies for their particular data sets. In addition, I devised several simulation tools that accurately reflect artificial and natural fine-scale variation. This includes a flexible and efficient read simulation program which is the only program that can directly reflect the impact of experimental design factors. Furthermore, I developed a program simulating the evolution of a virus into a quasi-species. These programs formed the basis for two comprehensive benchmarking studies that revealed the capacities and limitations of viral haplotype reconstruction programs and taxonomic classification programs, respectively. My work furthers our knowledge of Illumina sequencing errors and will facilitate more accurate and effective analyses of sequencing data sets.

Item Type: Thesis (PhD)
Qualification Level: Doctoral
Keywords: Bioinformatics, DNA sequencing, amplicons, metagenomics, next generation sequencing, Illumina, error profiles, viral haplotype reconstruction, Dirichlet process mixture model
Subjects: Q Science > QA Mathematics
Q Science > QA Mathematics > QA75 Electronic computers. Computer science
Q Science > QR Microbiology
Q Science > QR Microbiology > QR355 Virology
T Technology > TD Environmental technology. Sanitary engineering
Colleges/Schools: College of Science and Engineering > School of Engineering
Funder's Name: UNSPECIFIED
Supervisor's Name: Quince, Dr. Christopher and Sloan, Professor William T.
Date of Award: 2014
Depositing User: Melanie Schirmer
Unique ID: glathesis:2014-5627
Copyright: Copyright of this thesis is held by the author.
Date Deposited: 22 Oct 2014 07:46
Last Modified: 22 Oct 2014 07:48
URI: http://theses.gla.ac.uk/id/eprint/5627

Actions (login required)

View Item View Item

Downloads

Downloads per month over past year