Implementation, adaptation and evaluation of statistical analysis techniques for next generation sequencing data

Fulton, Rachael Louise (2010) Implementation, adaptation and evaluation of statistical analysis techniques for next generation sequencing data. MSc(R) thesis, University of Glasgow.

Full text available as:
[thumbnail of 2010fultonmsc.pdf] PDF
Download (5MB)
Printed Thesis Information:


Deep sequencing is a new high‐throughput sequencing technology intended to lower the cost of DNA sequencing further than what was previously thought possible using standard methods. Analysis of sequencing data such as SAGE (serial analysis of gene expression) and microarray data has been a popular area of research in recent years. The increasing development of these different technologies and the variety of the data produced has stressed the need for efficient analysis techniques.

Various methods for the analysis of sequencing data have been developed in recent years: both SAGE data, which is discrete; and microarray data, which is continuous. These
include simple analysis techniques, hierarchical clustering techniques (both Bayesian and Frequentist) and various methods for finding differential expression between groups of samples. These methods range from simple comparison techniques to more complicated computational methods, which attempt to isolate the more subtle dissimilarities in the data.

Various analysis techniques are used in this thesis for the analysis of unpublished deep sequencing data. This analysis was approached in three sections. The first was looking at clustering techniques previously developed for SAGE data, Poisson C / Poisson L algorithm and a Bayesian hierarchical clustering algorithm and evaluating and adapting these techniques for use on the deep sequencing data. The second was looking at methods to find differentially expressed tags in the dataset. These differentially expressed tags are of interest, as it is believed that finding tags which are significantly up or down regulatedacross groups of samples could potentially be useful in the treatment of certain diseases.

Finally due to the lack of published data, a simulation study was constructed using various models to simulate the data and assess the techniques mentioned above on data with pre‐defined sample groupings and differentially expressed tags. The main goals of the simulation study were the validation of the analysis techniques previously discussed and estimation of false positive rates for this type of large, sparse dataset.

The Bayesian algorithm yielded surprising results, producing no hierarchy, suggesting no evidence of clustering. However, promising results were obtained for the adapted Poisson C / Poisson L algorithm applied using various models to fit the data and measures of similarity. Further investigation is needed to confirm whether it is suitable for the clustering of deep sequencing data in general, especially where the situation of three or more groups of interest occurs.

From the results of the differential expression analysis it can be deduced that the overdispersed log linear method for the analysis of differential expression, particularly when compared to simple test such as the 2‐sample t‐tests and the Wilcoxon signed rank test is the most reliable. This deduction is made based upon the results of the overlapping with other methods and the more reasonable number of differentially expressed tags detected, in contrast to those detected using the adapted log ratio method. However none of this can be confirmed, as no information was known about the tags in either dataset.

The success of the Poisson C / Poisson L algorithm on both the Poisson and Truncated Poisson simulated datasets suggests that the method of simulation is acceptable for the assessment of clustering algorithms developed for use on sequencing data. However, evaluation of the differential expression analysis performed on the simulated data indicates that further work is needed on the method of simulation to increase its reliability.

The algorithms presented can be adapted for use on any form of discrete data. From the work done here, there is there is evidence that the adapted Poisson C / Poisson L algorithm is a promising technique for the analysis of deep sequencing data.

Item Type: Thesis (MSc(R))
Qualification Level: Masters
Keywords: Deep sequencing, simulating data, clustering algorithms
Subjects: Q Science > QH Natural history > QH426 Genetics
H Social Sciences > HA Statistics
Colleges/Schools: College of Science and Engineering > School of Mathematics and Statistics > Statistics
Supervisor's Name: Khanin, Dr. Raya
Date of Award: 2010
Depositing User: Miss Rachael Fulton
Unique ID: glathesis:2010-1718
Copyright: Copyright of this thesis is held by the author.
Date Deposited: 13 Apr 2010
Last Modified: 10 Dec 2012 13:45

Actions (login required)

View Item View Item


Downloads per month over past year