Variable selection for supervised and semi-supervised mixtures of contaminated Gaussian distributions

Sanchez Gomez, Jorge Alfredo (2024) Variable selection for supervised and semi-supervised mixtures of contaminated Gaussian distributions. PhD thesis, University of Glasgow.

Full text available as:
[thumbnail of 2024SanchezGomezPhD.pdf] PDF
Download (2MB)

Abstract

Finite mixture models have the advantage of being versatile modelling tools for grouped data (B¨ohning, 2000; Fraley and Raftery, 1998b; McLachlan and Basford, 1988). This has led them to be applied in a variety of settings such as classification and clustering problems. Like any model, they have assumptions and limitations. One of the assumptions that is common, is that there are no contaminated observations present in the data (or in the classes/clusters) (Barnett et al., 1994; Becker and Gather, 1999; Bock, 2002; Gallegos and Ritter, 2009).

A popular approach to deal with this is a finite mixture model with contaminated Gaussian component distributions (Punzo and McNicholas, 2016). Each contaminated Gaussian models the data with two components, one for non-contaminated and one for contaminated data. However, a limitation of the contaminated Gaussian mixture model is that, as a complex and usually highly-parameterised model, it is not very suitable for data with a very large number of variables. The purpose of the current thesis is to extend the applicability of this model to this type of data.

In order to preserve the original variables, rather than looking at projection methods, a greedy search (Meek, 1997) approach for variable selection is customized for a mixture of contaminated Gaussian distributions in the supervised and semi-supervised learning framework. The performance of this approach in both settings is explored in both simulated and plasmode data. The criterion used to choose variables is based on classification performance. The results show that incorporating these criterion in the tailored variable selection algorithm in most cases improved the classification performance in comparison with using all variables (and often over use of the set of variables in simulations known to be the true class separating variables). Nevertheless, the performance in identifying contaminated samples was more mixed. The proposed variable selection procedure removed some variables that do not contain class information but contain contamination information. As a result, hurting the ability of the model in identifying contaminated samples specially in cases where is highly likely the presence of contamination in all variables. To summarize, the proposed variable selection algorithm seemed to perform well in both supervised and semi-supervised settings in terms of classification. However, the performance in predicting contaminated samples depends on the type of contamination and its association with class separation. There is a slight decrease in predicting contaminated samples in cases where the contamination is present in all the variables.

Item Type: Thesis (PhD)
Qualification Level: Doctoral
Additional Information: Supported by funding from the Ministry of Higher Education, Sccience, Technology and Innovation (SENESCYT).
Subjects: H Social Sciences > HA Statistics
Q Science > QA Mathematics
Colleges/Schools: College of Science and Engineering
Funder's Name: Ministry of Higher Education, Sccience, Technology and Innovation (SENESCYT)
Supervisor's Name: Dean, Dr. Nema and Neocleous, Dr. Tereza
Date of Award: 2024
Depositing User: Theses Team
Unique ID: glathesis:2024-84652
Copyright: Copyright of this thesis is held by the author.
Date Deposited: 05 Nov 2024 09:42
Last Modified: 05 Nov 2024 09:44
Thesis DOI: 10.5525/gla.thesis.84652
URI: https://theses.gla.ac.uk/id/eprint/84652

Actions (login required)

View Item View Item

Downloads

Downloads per month over past year