Sequence data mining and characterisation of unclassified microbial diversity

Modha, Sejal (2022) Sequence data mining and characterisation of unclassified microbial diversity. PhD thesis, University of Glasgow.

Full text available as:
[thumbnail of 2022ModhaPhD.pdf] PDF
Download (22MB)


In the last two decades, sequencing has become increasingly affordable and a routine tool to study the microbial community of a given environment. Metagenomics has revolutionised the way microbes are identified and studied in this age of biological data science because it provides a relatively unbiased view of the composition of microbial communities we interact with every day, which are integral to our ecosystem. These technological advances have led to an exponential growth of raw data repositories that save, distribute and archive these metagenomic datasets. Since metagenomics presents the ultimate opportunity to capture, explore and identify uncultivated microbial genomic sequences, these metagenomic datasets harbour a large proportion of unknown sequences that do not bear any similarity to known sequences readily available in the standard sequence data repositories. The aim of this thesis was to systematically catalogue, quantify and potentially characterise the unknown sequences embedded within the metagenomic datasets. To this end, a comprehensive, portable, modular framework called UnXplore was developed to determine the proportion of unknown sequences included in human microbiome datasets. UnXplore was applied to a range of different human microbiomes and showed that on average 2% of assembled sequences were categorised as unknown meaning that they did not bear any sequence similarity to known sequences. A third of the unknown sequences were shown to contain large open reading frames indicating the coding potential and biological origin of the unknowns. Furthermore, a small proportion of these potentially coding sequences were shown to have functional similarities as they were deemed to contain known protein domain signatures. These results indicated that unknown sequences captured through the UnXplore framework were not artefacts and were indeed of biological origin. To test this formally, supervised kmer-based machine learning models were devised, tested and validated. These models are currently distributed in a package called TetraPredX that can accurately predict whether a sequence originated from bacteria, archaea, virus or plasmid. TetraPredX models were applied to the unknown sequence dataset and revealed that the majority of unknown sequences are of biological origin. Furthermore, TetraPredX results demonstrated that >70% of all long unknown sequences (i.e. >1kb) are likely to be of virus origin indicating an unexplored diversity of viruses that is yet to be fully characterised and classified. In order to catalogue the diversity of virus sequences in human microbiome samples analysed here, an extensive virus discovery analysis was carried out on the contigs assembled through UnXplore. This helped to characterise a vast diversity of prokaryotic, eukaryotic and unclassified virus sequences captured in a range of human microbiomes. The results obtained here demonstrate the need to systematically interrogate metagenomic datasets to fully comprehend and compile the presence of both known and unknown uncultivated microbes within them. A comprehensive survey of metagenomic datasets carried out in this manner would provide a more complete picture of the known and unknown organisms that surround us.

Item Type: Thesis (PhD)
Qualification Level: Doctoral
Colleges/Schools: College of Medical Veterinary and Life Sciences > School of Infection & Immunity > Centre for Virus Research
Funder's Name: Medical Research Council (MRC)
Supervisor's Name: Robertson, Prof. David L., Orton, Dr. Richard J. and Hughes, Dr. Joseph
Date of Award: 2022
Depositing User: Theses Team
Unique ID: glathesis:2022-83156
Copyright: Copyright of this thesis is held by the author.
Date Deposited: 05 Oct 2022 15:41
Last Modified: 06 Oct 2022 08:01
Thesis DOI: 10.5525/gla.thesis.83156
Related URLs:

Actions (login required)

View Item View Item


Downloads per month over past year