Yi, Zixuan (2026) Effective multi-modal and multi-domain graph-based recommender systems via self-supervised learning. PhD thesis, University of Glasgow.
Full text available as:|
PDF
Download (9MB) |
Abstract
The rapid expansion of digital applications and services has created an urgent need for sophisticated top-K recommendation models capable of matching users with items aligned to their interests. Graph-based recommender systems have become a cornerstone for tackling this information overload. However, their effectiveness is often hindered by fundamental limitations, including vulnerability to noisy interactions, the so-called over-smoothing problem, and an inability to sufficiently leverage rich and complex information sources. In particular, many existing models underperform in effectively mining and integrating supervisory signals from diverse modalities (e.g., text, images) and domains (e.g., books, movies). This thesis argues that the top-K recommendation effectiveness of graph-based recommendation can be enhanced by proposing novel Self-Supervised Learning (SSL) techniques designed to explicitly mine and integrate multi-modality and multi-domain signals.
This thesis aims to address existing research gaps in the literature by proposing a suite of novel graph-based recommender techniques using the SSL paradigm. It addresses three primary challenges: (i) enhancing the fundamental expressive power of graph neural architectures to alleviate the over-smoothing problem (i.e., representations becoming indistinguishable after repeated graph aggregation operations), and the incapability of existing approaches to denoise noisy implicit interactions in top-K recommendation; (ii) enhancing the modality encoding capabilities to address the insufficient modality fusion and the isolated multi-modal recommendation pipeline in top-K multi-modal recommendation; and (iii) improving the knowledge transfer capability and generalisability of graph-based recommender systems for top-K recommendation in multi-domain settings.
To enhance the expressiveness of graph neural architectures, we propose two novel graph-based recommender models at different architectural levels. First, Positional Graph Contrastive Learning (PGCL) operates at the message-passing level and integrates graph positional encodings (e.g., Laplacian eigenvector) into a new graph message-passing function. This architecture, trained with an SSL loss, generates highly distinguishable user and item embeddings, thereby improving the expressive power of graph-based recommender systems while alleviating the over-smoothing problem. Second, the Diffusion Graph Transformer (DiffGT) leverages a new graph transformer model at the overall architecture level to denoise the noisy implicit user-item interactions within a diffusion process. In particular, DiffGT applies an SSL loss to maximise the agreement between the original user/item embeddings and the denoised user/item embeddings during model training, thereby improving the model expressiveness and yielding an improved top-K recommendation performance.
To address the insufficient modality encoding issue, we first focus on enhancing the modality fusion within multi-modal graph-based recommender systems. In particular, we propose the Multi-modal Graph Contrastive Learning (MMGCL) model, which introduces modality-specific graph augmentations as positive samples and a modality-aware negative sampling strategy in an SSL loss. This enhances the modality fusion process, unlike the existing approaches that often treat each modality with equal importance. In addition, we further enhance the modality fusion by using Large Multi-modal (LMM) encoders. We show that by using SSL to enable deep modality alignment across modalities, these LMM encoders significantly outperform the shallow alignment methods common in existing graph-based recommender systems. On the other hand, in addition to addressing modality fusion, there are still longstanding isolation problems – isolated feature extraction process and isolated modality encoding process – that impede the effective mining of self-supervised signals across multiple modalities and remain unresolved in existing multi-modal graph-based recommender systems. To address these isolation problems, we introduce the Unified multi-modal Graph Transformer (UGT), a novel end-to-end architecture that uses SSL to unify the multi-modal representations into the same semantic space, thereby enhancing top-K multi-modal recommendation within a unified graph transformer architecture.
Next, to address the insufficient domain transfer capabilities in existing cross-domain models, we introduce two novel graph-based approaches. The first, Personalised Graph Prompt-based Recommendation (PGPRec), is an ID-based approach that enables effective and parameter-efficient cross-domain knowledge transfer. Specifically, PGPRec first uses SSL to pre-train a graph encoder, ensuring that it learns high-quality and generalisable knowledge across domains. This knowledge is then effectively transferred from a single source domain to a target domain via personalised and item-wise graph prompts. To further enhance generalisation and reduce reliance on ID-based features, inspired by the model soup paradigm, we propose AdapterSoupRec, which leverages multi-modal large language models (MLLMs) to generate universal item representations. In particular, we use SSL and cross-entropy losses to enable MLLMs to generate highly generalisable representations and effective model configurations. These configurations are then combined via a weighted average (i.e., the ‘model soup’ technique) to create a more effective set of model parameters, thereby achieving improved top-K recommendation in a multidomain setting.
Overall, this thesis contributes novel and effective SSL-enhanced graph-based recommender models that systematically address the challenges of limited architectural expressiveness, insufficient modality encoding, and insufficient domain transfer capabilities. Our extensive experimental evaluations on numerous real-world datasets validate the thesis statement, demonstrating that recommendation effectiveness is significantly enhanced by explicitly mining more supervision signals from diverse modalities and domains. These contributions make progress towards the development of effective graph-based recommender systems and pave the way for further future directions of research in top-K recommender systems.
| Item Type: | Thesis (PhD) |
|---|---|
| Qualification Level: | Doctoral |
| Subjects: | Q Science > QA Mathematics > QA75 Electronic computers. Computer science |
| Colleges/Schools: | College of Science and Engineering > School of Computing Science |
| Supervisor's Name: | Ounis, Professor Iadh and Macdonald, Professor Craig |
| Date of Award: | 2026 |
| Depositing User: | Theses Team |
| Unique ID: | glathesis:2026-85783 |
| Copyright: | Copyright of this thesis is held by the author. |
| Date Deposited: | 27 Feb 2026 16:06 |
| Last Modified: | 27 Feb 2026 16:12 |
| Thesis DOI: | 10.5525/gla.thesis.85783 |
| URI: | https://theses.gla.ac.uk/id/eprint/85783 |
| Related URLs: |
Actions (login required)
![]() |
View Item |
Downloads
Downloads per month over past year

Tools
Tools