Xue, Jinghao (2008) Aspects of generative and discriminative classifiers. PhD thesis, University of Glasgow.
Full text available as:

PDF
Download (1MB)  Preview 
Abstract
In recent years, under the new terminology of generative and discriminative classifiers, research interest in classical statistical approaches to discriminant analysis has reemerged in the machine learning community. In discriminant analysis, observations with features $\mathbf{x}$ measured are classified into classes labelled by a categorical variable $y$. {\em Generative classifiers}, also termed the sampling paradigm, such as normalbased discriminant analysis and the na\"{i}ve Bayes classifier, model the joint distribution $p(\mathbf{x}, y)$ of the measured features $\mathbf{x}$ and the class labels $y$ factorised in the form $p(\mathbf{x}y)p(y)$, where $p(\mathbf{x}y)$ is a datagenerating process (DGP), and learn the model parameters through maximisation of the likelihood with respect to $p(\mathbf{x}y)p(y)$. {\em Discriminative classifiers}, also termed the diagnostic paradigm, such as logistic regression, model the conditional distribution $p(y\mathbf{x})$ of the class labels given the features, and learn the model parameters through maximising the conditional likelihood based on $p(y\mathbf{x})$. In order to exploit the best of both worlds, it is necessary to first compare generative and discriminative classifiers and then combine them. In this thesis, we first performed some empirical and simulation studies to provide extension of and make comments on a highlycited report~\citep{Ng:01}, which compared the na\"{i}ve Bayes classifier or normalbased linear discriminant analysis (LDA) with linear logistic regression (LLR). Then we studied extensively two hybridlearning techniques, namely the hybrid generativediscriminative algorithm~\citep{Raina:03} and the generativediscriminative tradeoff (GDT) approach~\citep{Bouchard:04}, for combining the generative and discriminative classifiers. Based on our results from these studies, we proposed a joint generativediscriminative modelling approach to classification. In addition, we extended our investigation to generative and discriminative hidden Markov models, the latent variable models for structured data. We also developed discriminative approaches for a specific application, that of histogrambased image thresholding. The contributions of this thesis are the following. First,~\citet{Ng:01} claimed that there exist two distinct regimes of performance between the generative and discriminative classifiers with regard to the trainingset size; however, our empirical and simulation studies, as presented in Chapter \ref{ch:ng}, suggest that it is not so reliable to claim such an existence of the two distinct regimes. In addition, for real world datasets, so far there is no theoretically correct, general criterion for choosing between the discriminative and the generative approaches to classification of an observation $\mathbf{x}$ into a class $y$; the choice depends on the relative confidence you have in the correctness of the specification of either $p(y\mathbf{x})$ or $p(\mathbf{x}, y)$. This can be to some extent a demonstration of why~\citet{Efron:75} and~\citet{ONeill:80} prefer LDA but other empirical studies may prefer LLR instead. Furthermore, we suggest that pairing of either LDA assuming a common diagonal covariance matrix (LDA$\Lambda$) or the na\"{i}ve Bayes classifier and LLR may not be perfect, and hence it may not be reliable for any claim that was derived from the comparison between LDA$\Lambda$ or the na\"{i}ve Bayes classifier and LLR to be generalised to all generative and discriminative classifiers. Secondly, in Chapter \ref{ch:gdt}, we present the interpretation and asymptotic relative efficiency (ARE) of the GDT approach for linear and quadratic normal discrimination without model misspecification, and compare its ARE with those of its generative and discriminative counterparts. The classification performance of the GDT is compared with those of LDA and LLR on simulated datasets. We argue that the GDT is a generative model integrating both discriminative and generative learning. It is therefore sensitive to model misspecification of the datagenerating process and, in practice, its discriminative component may behave differently from a truly discriminative approach. Amongst the three approaches that we compare, the asymptotic efficiency of the GDT is lower than that of the generative approach when no model misspecification occurs. In addition, without model misspecification, LDA performs the best; with model misspecification, the GDT may perform the best at an optimal tradeoff between its discriminative and generative components, and LLR, a truly discriminative classifier, in general performs well when the trainingsample size is reasonably large. Thirdly, in Chapter \ref{ch:hyb}, we interpret the hybrid algorithm from three perspectives, namely classconditional probabilities, classposterior probabilities and loss functions underlying the model. We suggest that the hybrid algorithm is by nature a generative model with its parameters learnt through both generative and discriminative approaches, in the sense that it assumes a scaled datageneration process and uses scaled classposterior probabilities to perform discrimination. Our suggestion can also be applied to its multiclass extension. In addition, using simulated and realworld data, we compare the performance of the normalised hybrid algorithm as a classifier with that of the na\"{i}ve Bayes classifier and LLR. Our simulation studies suggest in general the following: if the covariance matrices are diagonal matrices, the na\"{i}ve Bayes classifier performs the best; if the covariance matrices are full matrices, LLR performs the best. Our studies also suggest that the hybrid algorithm may provide worse performance than either the na\"{i}ve Bayes classifier or LLR alone. Fourthly, based on our studies presented in Chapters \ref{ch:ng},~\ref{ch:gdt} and~\ref{ch:hyb}, we propose in Chapter \ref{ch:jgd} a joint generativediscriminative modelling (JGD) approach to classification, by partitioning variables into two subsets based on statistical tests of the DGP. Our JGD approach adopts statistical tests, such as normality tests, of the assumed DGP for each variable to justify the use of generative approaches for the variables which satisfy the tests and of discriminative approaches for other variables. Such a partition of variables and a combination of generative and discriminative approaches are derived in a probabilistic rather than a heuristic way. We have concentrated on particular choices for the generative and discriminative components of our models, but the overall principle is quite general and can accommodate many other special versions. Of course, we must ensure that the assumptions underlying the resulting generative classifiers can be tested statistically. Numerical results from real UCI and geneexpression data and from simulated data demonstrate promising performance of this new approach for practical application to both low and highdimensional data. Fifthly, in Chapter \ref{ch:hmm}, we study the assumption of ``mutual information independence", which is used by~\citet{Zhou:05} for deriving the socalled discriminative hidden Markov model (DHMM). We suggest that the mutual information assumption (\ref{equ:dhmm:mi1}) results in the DHMM, while another mutual information assumption (\ref{equ:ghmm2:mi1}) results in its generative counterpart, the GHMM. However, in practice, whether or not the assumptions are reasonable and how the corresponding HMMs perform can be datadependent; research efforts to explore an adaptive switching between or combination of these two models may be worthwhile. Meanwhile, we suggest that the socalled outputdependent HMMs could be represented in a statedependent manner, and vice versa, essentially by application of Bayes' theorem. Finally, in Chapter \ref{ch:img}, we present discriminative approaches to histogrambased image thresholding, in which the optimal threshold is derived from the maximum likelihood based on the conditional distribution $p(yx)$ of $y$, the class indicator of a grey level $x$, given $x$. The discriminative approaches can be regarded as discriminative extensions of the traditional generative approaches to thresholding, such as Otsu's method~\citep{Otsu:79} and Kittler and Illingworth's minimum error thresholding (MET)~\citep{Kittler:86}. As illustrations, we develop discriminative versions of Otsu's method and MET by using discriminant functions corresponding to the original methods to represent $p(yx)$. These two discriminative thresholding approaches are compared with their original counterparts on selecting thresholds for a variety of histograms of mixture distributions. Results show that the discriminative Otsu method consistently provides relatively good performance. Although being of higher computational complexity than the original methods in parameter estimation, its robustness and model simplicity can justify the discriminative Otsu method for scenarios in which the risk of model misspecification is high and the computation is not demanding.
Item Type:  Thesis (PhD) 

Qualification Level:  Doctoral 
Subjects:  H Social Sciences > HA Statistics Q Science > QA Mathematics 
Colleges/Schools:  College of Science and Engineering > School of Mathematics and Statistics > Statistics 
Supervisor's Name:  Titterington, Professor D.M. 
Date of Award:  2008 
Depositing User:  Mr Jinghao Xue 
Unique ID:  glathesis:2008272 
Copyright:  Copyright of this thesis is held by the author. 
Date Deposited:  10 Jun 2008 
Last Modified:  10 Dec 2012 13:17 
URI:  http://theses.gla.ac.uk/id/eprint/272 
Actions (login required)
View Item 