Researching an enhanced multimodal learning framework for improved inter-modal and intra-modal alignment

Long, Zijun (2024) Researching an enhanced multimodal learning framework for improved inter-modal and intra-modal alignment. PhD thesis, University of Glasgow.

Full text available as:
[thumbnail of 2024LongPhD.pdf] PDF
Download (34MB)

Abstract

In recent decades, machine learning research has predominantly focused on single-modal data. However, the emergence of multimodal data, such as images or videos accompanied by text, particularly on social media platforms, has underscored the importance of advancing multimodal learning. This thesis centers on multimodal learning, exploring ways to enhance the performance of multimodal models—specifically those utilizing vision and language modalities. It aims to improve the understanding and integration of multimodal data, thereby boosting performance in downstream tasks such as crisis response, robotics, cross-modal retrieval, and recommendation.

In this thesis, we argue that enhancing shallow inter-modal and intra-modal alignment in existing multimodal approaches can improve performance across different tasks by enabling deeper alignment. To address this, we introduce a novel multimodal learning framework, named MCA, designed to improve multimodal learning performance while maintaining flexibility across various downstream tasks. The framework comprises three core components: Mixture-ofModality-Experts (MoME), Contrastive Learning Techniques, and Adapter Methods, each offering unique functionalities.

Firstly, the Mixture-of-Modality-Experts (MoME) component is designed to manage a diverse range of input modalities and improve inter-modal alignment. Recent years have seen a significant shift towards multimodal learning, yet many existing models are mere amalgamations of single-modal models, using fusion layers to merge separate vision and language models. This method often leads to shallow alignments and can compromise the effectiveness of multimodal models. To overcome these limitations, MoME enables a unified model architecture, incorporating a modality-specific expert system adept at processing multimodal data (notably vision and language) for a variety of downstream tasks, such as classification and image-text retrieval. Benefiting from this design, MoME has the ability to process different combinations of input, such as unimodal, multimodal, or mixed.

Secondly, to enhance intra-modal and inter-modal alignment and bolster performance across both unimodal and multimodal contexts, we researched several innovative contrastive learning techniques. Initially, our research focused on label-aware contrastive learning for image models, resulting in a robust encoder for image inputs. Subsequently, we introduced an Optimized Learning Fusion strategy, termed CLCE, designed to refine the optimization process by integrating the cross-entropy loss function with the contrastive learning loss function. Furthermore, we developed a debiased contrastive learning approach aimed at mitigating label noise within the contrastive learning framework, thereby further enhancing model performance. Collectively, these methodologies fortify the contrastive learning component of our multimodal learning framework, significantly deepening inter-model alignment and augmenting overall effectiveness.

Thirdly, to address the challenges of efficiency and practicality associated with large-scale models, we have developed an innovative approach to transfer learning utilizing adapters. As the size of Multimodal Large Language Models (MLLMs) increases, their adaptation to specific tasks becomes more complex, primarily due to heightened computational and memory requirements. Traditional fine-tuning methods, while effective, are resource-intensive and necessitate extensive, task-specific training. Although various adaptation methods have been proposed to mitigate these issues, they often result in inadequate inter-modal alignment, compromising the models’ overall effectiveness. In response to these challenges, we present the MultiWay-Adapter (MWA), a novel method equipped with an ‘Alignment Enhancer’. This feature significantly improves inter-modal alignment, facilitating efficient model transferability with minimal tuning. Consequently, the MWA emerges as a highly efficient and effective method for adapting MLLMs, substantially enhancing their utility across a broader range of applications.

Each proposed approach within the framework is rigorously assessed using one or more specially curated datasets for that component. This evaluation includes a detailed analysis of the approaches, identifying suitable settings for their deployment, and providing insights into their performance characteristics.

This thesis has made contributions to the field of multimodal learning by enhancing both intra-modal and inter-modal alignment, improving computational efficiency, and validating the proposed MCA framework in real-world applications. Our evaluations provide multiple pieces of evidence for improved alignment and enhanced performance across various metrics in evaluated datasets, supporting our thesis statement. These advancements pave the way for future research and development in creating more effective and efficient multimodal systems.

Furthermore, this thesis extends to the comprehensive evaluation and optimization of the proposed framework across various domains, such as crisis response, robotics, and cross-modal retrieval. Insightful findings are drawn from an extensive series of experiments that cover the proposed framework of multimodal learning. The results presented within this thesis highlight the improvements our framework contributes to both the overarching benchmarks of multimodal learning and a wide array of downstream applications.

We applied multimodal learning to crisis response, addressing the limitation of prior works that primarily use single-modality content. This thesis examines the importance of integrating multiple modalities for crisis content categorization. We design a multimodal learning framework that fuses textual and visual inputs, leveraging both to classify content based on specific tasks. Using the CrisisMMD dataset, we demonstrate effective automatic labeling with an average of 88.31% F1 performance across relevance and humanitarian category classification tasks. We also analyze the success and failure cases of unimodal and multimodal models.

The second application of our multimodal learning framework is in robotic vision, which requires tasks like object detection, segmentation, and identification. Integrating specialized models into a unified vision pipeline poses engineering challenges and costs. Multimodal Large Language Models (MLLMs) have emerged as effective backbones for various tasks. Leveraging the pre-training capabilities of MLLMs simplifies the framework, reducing the need for task-specific encoders. The large-scale pre-trained knowledge in MLLMs allows for easier finetuning and superior performance in robotic vision tasks. We introduce the RoboLLM framework, equipped with a BEiT-3 backbone, to handle all visual perception tasks in the ARMBench challenge. RoboLLM outperforms existing baselines and significantly reduces the engineering burden of model selection and tuning.

The third application in this thesis is text-to-image retrieval, which finds relevant images based on text queries. This is crucial for digital libraries, e-commerce, and multimedia databases. While multimodal models show state-of-the-art performance in some retrieval tasks, they struggle with large-scale, diverse, and ambiguous real-world needs due to computational costs and injective embeddings. To address this, we present the two-stage Coarse-to-Fine Index-shared Retrieval (CFIR) framework for efficient large-scale long-text to image retrieval. The first stage, Entity-based Ranking (ER), handles query ambiguity using a multiple-queries-to-multiple-targets paradigm. The second stage, Summary-based Re-ranking (SR), refines rankings with summarized queries. We also propose a specialized Decoupling-BEiT-3 encoder for both stages, enhancing computational efficiency with vector-based similarity inference. Evaluations on the AToMiC dataset show that CFIR outperforms existing MLLMs by up to 11.06% in Recall@1000, while reducing training and retrieval times by 68.75% and 99.79%, respectively.

Item Type: Thesis (PhD)
Qualification Level: Doctoral
Subjects: Q Science > QA Mathematics > QA75 Electronic computers. Computer science
Colleges/Schools: College of Science and Engineering > School of Computing Science
Supervisor's Name: McCreadie, Dr. Richard and Aragon Camarasa, Dr. Gerardo
Date of Award: 2024
Depositing User: Theses Team
Unique ID: glathesis:2024-84814
Copyright: Copyright of this thesis is held by the author.
Date Deposited: 16 Jan 2025 10:23
Last Modified: 16 Jan 2025 10:23
Thesis DOI: 10.5525/gla.thesis.84814
URI: https://theses.gla.ac.uk/id/eprint/84814
Related URLs:

Actions (login required)

View Item View Item

Downloads

Downloads per month over past year