Uncovering the mutational landscape of SARS-CoV-2 using machine learning methods

Lamb, Kieran Daniel (2024) Uncovering the mutational landscape of SARS-CoV-2 using machine learning methods. PhD thesis, University of Glasgow.

Full text available as:
[thumbnail of 2024lambphd.pdf] PDF
Download (22MB)

Abstract

Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) is the causative pathogen behind the Coronavirus disease 19 (COVID-19) pandemic. Following its emergence in Wuhan in the Hubei province of China, SARS-CoV- 2 infected millions of people around the world and has since become one of the deadliest on record. As part of the pandemic response, an unprecedented number of viral genomes were sequenced to produce the worlds largest dataset of viral sequencing data. In this thesis, we used machine learning methods to discover more about the mutational landscape of the virus from this sequencing data. We use mutational signature analysis to discover the mutational processes providing the mutations SARS-CoV-2 uses to adapt and evolve over time. We show that these processes are dynamic, and shift in their activity throughout the pandemic. We show that different variants of concern (VOCs) show different levels of mutational process activity which may relate to differences between the intrinsic virology between these lineages. We next show how large language models (LLMs) that have traditionally been used in natural language processing (NLP) can be used to produce meaningful representations of viral proteins. These representations can distinguish between proteins from different virus VOCs, generate metrics that can evaluate every possible mutation in the protein, and even predict putative evolutionary trajectories that correlate with the real emergence dates. We also show that model logits identify epistatic interactions disturbed by mutations and identify positions of structural conservation. Much of this can be completed using a single sequence and can also be used in a surveillance scenario where new sequences can have their representations compared against currently circulating or prior lineages. Finally, we show how identifying mutational patterns using co-occurrence highlights interesting pairs of mutations that may be selected for by the virus and its selective environment. Using mutational contexts, language models and the virus phylogeny, we can investigate how these mutations might benefit the virus and improve our understanding of how linked mutations appear in a circulating viruses. In summary, this thesis shows how techniques from machine learning can help us learn more about the evolutionary processes, dynamics and effects of changing viral proteins using genomic sequence data.

Item Type: Thesis (PhD)
Qualification Level: Doctoral
Subjects: Q Science > QA Mathematics > QA75 Electronic computers. Computer science
Q Science > QR Microbiology
Colleges/Schools: College of Medical Veterinary and Life Sciences
Funder's Name: Medical Research Council (MRC)
Supervisor's Name: Robertson, Professor David L. and Yuan, Dr. Ke
Date of Award: 2024
Depositing User: Theses Team
Unique ID: glathesis:2024-84637
Copyright: Copyright of this thesis is held by the author.
Date Deposited: 24 Oct 2024 16:07
Last Modified: 30 Oct 2024 09:55
Thesis DOI: 10.5525/gla.thesis.84637
URI: https://theses.gla.ac.uk/id/eprint/84637
Related URLs:

Actions (login required)

View Item View Item

Downloads

Downloads per month over past year