Enhancing data representation in distributed machine learning

Aladwani, Tahani (2025) Enhancing data representation in distributed machine learning. PhD thesis, University of Glasgow.

Full text available as:
[thumbnail of 2024aladwaniphd.pdf] PDF
Download (7MB)

Abstract

Distributed computing devices, ranging from smartphones to edge micro-servers—collectively referred to as clients—are capable of gathering and storing diverse types of data, such as images and voice recordings. This wide array of data sources has the potential to significantly enhance the accuracy and robustness of Deep Learning (DL) models across a variety of tasks. However, this data is intrinsically heterogeneous, due to the differences in users’ preferences, lifestyles, locations, and other factors. Consequently, it necessitates comprehensive preprocessing (e.g., labeling, filtering, relevance assessment, balancing, etc.) to ensure its suitability for the development of effective and reliable models. Therefore, this thesis explores the feasibility of conducting predictive analytics and model inference on edge computing (EC) systems when access to data is limited, and on clients’ devices through federated learning (FL) when direct access to data is entirely restricted.

The first part of this thesis focuses on reducing the data transmission rate between clients and EC servers by employing techniques such as data and task caching, identifying data overlaps, and evaluating task popularity. While this strategy can significantly minimize data offloading to the lowest possible level, it does not entirely eliminate dependence on third-party entities.

The second part of this thesis eliminates the dependency on third-party entities by implementing FL, where direct access to raw data is not possible. In this context, node and data selection are guided by predictions and model performance. The objective is to identify the most suitable nodes and relevant data for training by clustering nodes based on data characteristics and analyzing the overlap between query boundaries and cluster boundaries.

The third part of this thesis introduces a mechanism designed to support classification tasks, such as image classification. These tasks present significant challenges when building models on distributed data, particularly due to issues like label shifting or missing labels across clients. To address these challenges, the proposed method mitigates the impact of imbalances across clients by employing multiple cluster-based meta-models, each tailored to specific label distributions.

The fourth part of this thesis introduces a two-phase federated self-learning framework, termed 2PFL, which addresses the challenges of extreme data scarcity and skewness when training classifiers over distributed labeled and unlabeled data. 2PFL demonstrates the capability to achieve high-performance models, even when trained with only 10% to 20% labeled data compared to the available unlabeled data.

The conclusion chapter underscores the importance of adaptable learning mechanisms that can respond to the continuous changes in clients’ data volume, requirements, formats, and protection regulations. By incorporating the EC layer, we can alleviate concerns related to data privacy, reduce the volume of data needing offloading, expedite task execution, and facilitate the training of complex models.

For scenarios demanding stricter privacy-preserving measures, FL offers a viable solution, enabling multiple clients to collaboratively train models while adhering to user privacy protection, data security, and government regulations. However, due to the indirect access to data inherent in FL, several challenges must be addressed to ensure the development of high-performance models. These challenges include imbalanced data distribution across clients, partially labeled data, and fully unlabeled data, all of which are explored and demonstrated through experimental evaluations.

Item Type: Thesis (PhD)
Qualification Level: Doctoral
Additional Information: Supported by a scholarship from Saudi Ministry of Education.
Subjects: Q Science > QA Mathematics > QA75 Electronic computers. Computer science
Q Science > QA Mathematics > QA76 Computer software
Colleges/Schools: College of Science and Engineering > School of Computing Science
Funder's Name: Saudi Ministry of Education
Supervisor's Name: Anagnostopoulos, Dr. Christos and Deligianni, Dr. Fani
Date of Award: 2025
Depositing User: Theses Team
Unique ID: glathesis:2025-85246
Copyright: Copyright of this thesis is held by the author.
Date Deposited: 23 Jun 2025 15:46
Last Modified: 23 Jun 2025 15:49
Thesis DOI: 10.5525/gla.thesis.85246
URI: https://theses.gla.ac.uk/id/eprint/85246
Related URLs:

Actions (login required)

View Item View Item

Downloads

Downloads per month over past year