Machine learning for the prediction of protein-protein interactions

Reyes, José Antonio (2010) Machine learning for the prediction of protein-protein interactions. PhD thesis, University of Glasgow.

Full text available as:
[thumbnail of 2009reyesphd.pdf] PDF
Download (1MB)
Printed Thesis Information: https://eleanor.lib.gla.ac.uk/record=b2705960

Abstract

The prediction of protein-protein interactions (PPI) has recently emerged as an important problem in the fields of bioinformatics and systems biology, due to the fact that most essential cellular processes are mediated by these kinds of interactions. In this thesis we focussed in the prediction of co-complex interactions, where the objective is to identify and characterize protein pairs which are members of the same protein complex.

Although high-throughput methods for the direct identification of PPI have been developed in the last years. It has been demonstrated that the data obtained by these methods is often incomplete and suffers from high false-positive and false-negative rates. In order to deal with this technology-driven problem, several machine learning techniques have been employed in the past to improve the accuracy and trustability of predicted protein interacting pairs, demonstrating that the combined use of direct and indirect biological insights can improve the quality of predictive PPI models. This task has been commonly viewed as a binary classification problem. However, the nature of the data creates two major problems. Firstly, the imbalanced class problem due to the number of positive examples (pairs of proteins which really interact) being much smaller than the number of negative ones. Secondly, the selection of negative examples is based on some unreliable assumptions which could introduce some bias in the classification results.

The first part of this dissertation addresses these drawbacks by exploring the use of one-class classification (OCC) methods to deal with the task of prediction of PPI. OCC methods utilize examples of just one class to generate a predictive model which is consequently independent of the kind of negative examples selected; additionally these approaches are known to cope with imbalanced class problems. We designed and carried out a performance evaluation study of several OCC methods for this task. We also undertook a comparative performance evaluation with several conventional learning techniques.

Furthermore, we pay attention to a new potential drawback which appears to affect the performance of PPI prediction. This is associated with the composition of the positive gold standard set, which contain a high proportion of examples associated with interactions of ribosomal proteins. We demonstrate that this situation indeed biases the classification task, resulting in an over-optimistic performance result. The prediction of non-ribosomal PPI is a much more difficult task. We investigate some strategies in order to improve the performance of this subtask, integrating new kinds of data as well as combining diverse classification models generated from different sets of data.

In this thesis, we undertook a preliminary validation study of the new PPI predicted by using OCC methods. To achieve this, we focus in three main aspects: look for biological evidence in the literature that support the new predictions; the analysis of predicted PPI networks properties; and the identification of highly interconnected groups of proteins which can be associated with new protein complexes.

Finally, this thesis explores a slightly different area, related to the prediction of PPI types. This is associated with the classification of PPI structures (complexes) contained in the Protein Data Bank (PDB) data base according to its function and binding affinity. Considering the relatively reduced number of crystalized protein complexes available, it is not possible at the moment to link these results with the ones obtained previously for the prediction of PPI complexes. However, this could be possible in the near future when more PPI structures will be available.

Item Type: Thesis (PhD)
Qualification Level: Doctoral
Keywords: machine learning, protein-protein interactions, bioinformatics, systems biology
Subjects: Q Science > QA Mathematics > QA75 Electronic computers. Computer science
Q Science > Q Science (General)
Colleges/Schools: College of Science and Engineering > School of Computing Science
Supervisor's Name: Gilbert, Dr. David
Date of Award: 2010
Depositing User: Mr. Jose Antonio Reyes
Unique ID: glathesis:2010-1474
Copyright: Copyright of this thesis is held by the author.
Date Deposited: 22 Jan 2010
Last Modified: 10 Dec 2012 13:40
URI: https://theses.gla.ac.uk/id/eprint/1474

Actions (login required)

View Item View Item

Downloads

Downloads per month over past year