Aspects of generative and discriminative classifiers

Xue, Jinghao (2008) Aspects of generative and discriminative classifiers. PhD thesis, University of Glasgow.

Full text available as:
 Preview
PDF
In recent years, under the new terminology of generative and discriminative classifiers, research interest in classical statistical approaches to discriminant analysis has re-emerged in the machine learning community. In discriminant analysis, observations with features $\mathbf{x}$ measured are classified into classes labelled by a categorical variable $y$. {\em Generative classifiers}, also termed the sampling paradigm, such as normal-based discriminant analysis and the na\"{i}ve Bayes classifier, model the joint distribution $p(\mathbf{x}, y)$ of the measured features $\mathbf{x}$ and the class labels $y$ factorised in the form $p(\mathbf{x}|y)p(y)$, where $p(\mathbf{x}|y)$ is a data-generating process (DGP), and learn the model parameters through maximisation of the likelihood with respect to $p(\mathbf{x}|y)p(y)$. {\em Discriminative classifiers}, also termed the diagnostic paradigm, such as logistic regression, model the conditional distribution $p(y|\mathbf{x})$ of the class labels given the features, and learn the model parameters through maximising the conditional likelihood based on $p(y|\mathbf{x})$. In order to exploit the best of both worlds, it is necessary to first compare generative and discriminative classifiers and then combine them. In this thesis, we first performed some empirical and simulation studies to provide extension of and make comments on a highly-cited report~\citep{Ng:01}, which compared the na\"{i}ve Bayes classifier or normal-based linear discriminant analysis (LDA) with linear logistic regression (LLR). Then we studied extensively two hybrid-learning techniques, namely the hybrid generative-discriminative algorithm~\citep{Raina:03} and the generative-discriminative tradeoff (GDT) approach~\citep{Bouchard:04}, for combining the generative and discriminative classifiers. Based on our results from these studies, we proposed a joint generative-discriminative modelling approach to classification. In addition, we extended our investigation to generative and discriminative hidden Markov models, the latent variable models for structured data. We also developed discriminative approaches for a specific application, that of histogram-based image thresholding. The contributions of this thesis are the following. First,~\citet{Ng:01} claimed that there exist two distinct regimes of performance between the generative and discriminative classifiers with regard to the training-set size; however, our empirical and simulation studies, as presented in Chapter \ref{ch:ng}, suggest that it is not so reliable to claim such an existence of the two distinct regimes. In addition, for real world datasets, so far there is no theoretically correct, general criterion for choosing between the discriminative and the generative approaches to classification of an observation $\mathbf{x}$ into a class $y$; the choice depends on the relative confidence you have in the correctness of the specification of either $p(y|\mathbf{x})$ or $p(\mathbf{x}, y)$. This can be to some extent a demonstration of why~\citet{Efron:75} and~\citet{ONeill:80} prefer LDA but other empirical studies may prefer LLR instead. Furthermore, we suggest that pairing of either LDA assuming a common diagonal covariance matrix (LDA-$\Lambda$) or the na\"{i}ve Bayes classifier and LLR may not be perfect, and hence it may not be reliable for any claim that was derived from the comparison between LDA-$\Lambda$ or the na\"{i}ve Bayes classifier and LLR to be generalised to all generative and discriminative classifiers. Secondly, in Chapter \ref{ch:gdt}, we present the interpretation and asymptotic relative efficiency (ARE) of the GDT approach for linear and quadratic normal discrimination without model mis-specification, and compare its ARE with those of its generative and discriminative counterparts. The classification performance of the GDT is compared with those of LDA and LLR on simulated datasets. We argue that the GDT is a generative model integrating both discriminative and generative learning. It is therefore sensitive to model mis-specification of the data-generating process and, in practice, its discriminative component may behave differently from a truly discriminative approach. Amongst the three approaches that we compare, the asymptotic efficiency of the GDT is lower than that of the generative approach when no model mis-specification occurs. In addition, without model mis-specification, LDA performs the best; with model mis-specification, the GDT may perform the best at an optimal tradeoff between its discriminative and generative components, and LLR, a truly discriminative classifier, in general performs well when the training-sample size is reasonably large. Thirdly, in Chapter \ref{ch:hyb}, we interpret the hybrid algorithm from three perspectives, namely class-conditional probabilities, class-posterior probabilities and loss functions underlying the model. We suggest that the hybrid algorithm is by nature a generative model with its parameters learnt through both generative and discriminative approaches, in the sense that it assumes a scaled data-generation process and uses scaled class-posterior probabilities to perform discrimination. Our suggestion can also be applied to its multi-class extension. In addition, using simulated and real-world data, we compare the performance of the normalised hybrid algorithm as a classifier with that of the na\"{i}ve Bayes classifier and LLR. Our simulation studies suggest in general the following: if the covariance matrices are diagonal matrices, the na\"{i}ve Bayes classifier performs the best; if the covariance matrices are full matrices, LLR performs the best. Our studies also suggest that the hybrid algorithm may provide worse performance than either the na\"{i}ve Bayes classifier or LLR alone. Fourthly, based on our studies presented in Chapters \ref{ch:ng},~\ref{ch:gdt} and~\ref{ch:hyb}, we propose in Chapter \ref{ch:jgd} a joint generative-discriminative modelling (JGD) approach to classification, by partitioning variables into two subsets based on statistical tests of the DGP. Our JGD approach adopts statistical tests, such as normality tests, of the assumed DGP for each variable to justify the use of generative approaches for the variables which satisfy the tests and of discriminative approaches for other variables. Such a partition of variables and a combination of generative and discriminative approaches are derived in a probabilistic rather than a heuristic way. We have concentrated on particular choices for the generative and discriminative components of our models, but the overall principle is quite general and can accommodate many other special versions. Of course, we must ensure that the assumptions underlying the resulting generative classifiers can be tested statistically. Numerical results from real UCI and gene-expression data and from simulated data demonstrate promising performance of this new approach for practical application to both low- and high-dimensional data. Fifthly, in Chapter \ref{ch:hmm}, we study the assumption of mutual information independence", which is used by~\citet{Zhou:05} for deriving the so-called discriminative hidden Markov model (D-HMM). We suggest that the mutual information assumption (\ref{equ:dhmm:mi1}) results in the D-HMM, while another mutual information assumption (\ref{equ:ghmm2:mi1}) results in its generative counterpart, the G-HMM. However, in practice, whether or not the assumptions are reasonable and how the corresponding HMMs perform can be data-dependent; research efforts to explore an adaptive switching between or combination of these two models may be worthwhile. Meanwhile, we suggest that the so-called output-dependent HMMs could be represented in a state-dependent manner, and vice versa, essentially by application of Bayes' theorem. Finally, in Chapter \ref{ch:img}, we present discriminative approaches to histogram-based image thresholding, in which the optimal threshold is derived from the maximum likelihood based on the conditional distribution $p(y|x)$ of $y$, the class indicator of a grey level $x$, given $x$. The discriminative approaches can be regarded as discriminative extensions of the traditional generative approaches to thresholding, such as Otsu's method~\citep{Otsu:79} and Kittler and Illingworth's minimum error thresholding (MET)~\citep{Kittler:86}. As illustrations, we develop discriminative versions of Otsu's method and MET by using discriminant functions corresponding to the original methods to represent $p(y|x)$. These two discriminative thresholding approaches are compared with their original counterparts on selecting thresholds for a variety of histograms of mixture distributions. Results show that the discriminative Otsu method consistently provides relatively good performance. Although being of higher computational complexity than the original methods in parameter estimation, its robustness and model simplicity can justify the discriminative Otsu method for scenarios in which the risk of model mis-specification is high and the computation is not demanding.