Ge, Xuri (2024) Towards context-aware image semantic representation via modality relational reasoning and embedding. PhD thesis, University of Glasgow.
Full text available as:
PDF
Download (17MB) |
Abstract
Representation learning is a machine learning technique aimed at automatically discovering the most informative features of raw data, transforming it into a representation that captures the essential characteristics relevant to a specific task. Instead of relying on manual feature engineering, representation learning enables models to learn these features directly from the data, often leading to more accurate and robust performance across various artificial intelligence (AI) applications. In contexts like computer vision (CV) or natural language processing (NLP), etc., representation learning helps models understand complex, high-dimensional data by focusing on meaningful patterns and structures within the input. This approach is fundamental for enabling deep learning models to generalize effectively and adapt to diverse challenges in real-world scenarios.
Unlike other modalities such as text or speech with explicit semantic expressions, image data is inherently complex and ambiguous, requiring the extraction of more complex spatial and contextual information. In particular, factors such as the diversity and complexity of entities and corresponding relations and the ambiguity of semantic expressions make it more challenging to accurately capture and represent the features of images. In unimodal visual representation learning or multimodal joint representation learning that includes vision, visual representation learning presents unique challenges. Consequently, effective visual representation learning demands more sophisticated techniques to overcome these challenges and achieve robust performance.
This thesis is geared towards context-aware image semantic representation learning via modality relational reasoning and embedding methods. Our research aims to advance understanding and methodologies of combining contextual relationship information from a uni-visual modality or multiple joint modalities to enhance visual semantic representations. Two different tasks are studied in depth, namely unimodal facial action unit (FAU) recognition and multimodal image-sentence retrieval (ISR). We explore the effectiveness of various visual relational reasoning and embedding approaches in these two tasks. On the one hand, we explore the effectiveness of relational reasoning and information transfer between different muscle regions to improve the final visual facial representations in the FAU recognition task. We first propose a biLSTM-based implicit relational reasoning and embedding method with skipping connections (Skip-BiLSTM) and verify the effectiveness of relational reasoning for face representation. Then, we explore the encoding of explicit muscle relations into muscle features and propose a Graph Neural Network (GNN) model with local-global interactions to further enhance the face representation capability. In our latest work, we introduce language-guided supervision for FAU recognition, which introduces language-level local and global relational reasoning for face representation learning, and we achieve better AU recognition performance in the final.
On the other hand, we explore the effectiveness of different multimodal relationship reasoning and encoding approaches to improve representation learning ability, especially for complex images, in multimodal interaction tasks. We first explore the contribution of a novel multimodal tree-structured relational reasoning and embedding to the multimodal feature representation learning in the image-sentence retrieval task. Moreover, we introduce scene recognition for semantic relational preprocessing of complex image scenes and utilize graph convolutional neural networks (GCNs) for further relational reasoning and embedding (termed relationshipaware GCNs), which further improves the multimodal feature representation capability, especially for complex visual representations. Finally, we explore the effectiveness of a semantic and spatial relation-based salient object enhancement approach within the visual modality for image-sentence retrieval during multimodal alignment optimization.
Experimental results demonstrate that visual representation learning based on relational reasoning and embedding can effectively promote the visual feature representation ability and further enhance the semantic and relational expression of fundamental visual features, whether for unimodal FAU recognition or multimodal image-sentence retrieval tasks.
Item Type: | Thesis (PhD) |
---|---|
Qualification Level: | Doctoral |
Additional Information: | Supported by funding from the China Scholarship Council. |
Subjects: | Q Science > QA Mathematics > QA75 Electronic computers. Computer science |
Colleges/Schools: | College of Science and Engineering > School of Computing Science |
Funder's Name: | China Scholarship Council |
Supervisor's Name: | Jose, Professor Joemon |
Date of Award: | 2024 |
Depositing User: | Theses Team |
Unique ID: | glathesis:2024-84783 |
Copyright: | Copyright of this thesis is held by the author. |
Date Deposited: | 06 Jan 2025 14:41 |
Last Modified: | 06 Jan 2025 16:01 |
Thesis DOI: | 10.5525/gla.thesis.84783 |
URI: | https://theses.gla.ac.uk/id/eprint/84783 |
Related URLs: |
Actions (login required)
View Item |
Downloads
Downloads per month over past year