Effective multi-modal conversational recommendation

Wu, Yaxiong (2024) Effective multi-modal conversational recommendation. PhD thesis, University of Glasgow.

Full text available as:
[thumbnail of 2024WuPhD.pdf] PDF
Download (15MB)


Conversational recommender systems have recently received much attention for addressing the information asymmetry problem in information seeking, by eliciting the dynamic preferences of users and taking actions based on their current needs through multi-turn & closed-loop interactions. Despite recent advances in uni-modal conversational recommender systems that use only natural-language interfaces for recommendations, leveraging both visual and textual information effectively for multi-modal conversational recommender systems has not yet been fully researched. In particular, multi-modal conversational recommender systems are expected to leverage the multi-modal information (such as the natural-language feedback of users and textual/visual representations of recommendation items) during the communications between users and recommender systems.

In this thesis, we aim to effectively track and estimate the users’ dynamic preferences from the multi-modal conversational recommendations (in particular with vision-and-language-based interactions), so as to develop realistic and effective multi-modal conversational recommender systems. In particular, we are motivated to answer the following questions: (1) how to better understand the users’ natural-language feedback and the corresponding recommendations with the partial observability of the users’ preferences over time; (2) how to better track the users’ preferences over the sequences of the systems’ visual recommendations and the users’ naturallanguage feedback; (3) how to decouple the recommendation policy (i.e. model) optimisation and the multi-modal composition representation learning; (4) how to effectively incorporate the users’ long-term and short-term interests for both cold-start and warm-start users; (5) how to ensure the realism of simulated conversations, such as positive/negative natural-language feedback. To address these five challenges, we propose to leverage recent advanced techniques (including multi-modal learning, deep learning, and reinforcement learning) for re-framing and developing more effective multi-modal conversational recommender systems. In particular, we introduce the framework of the multi-modal conversational recommendation task with cold-start or warm-start users, as well as how to measure the success of the tasks. Note that we also refer to multi-modal conversational recommendation as dialog-based interactive recommendation or multi-modal interactive recommendation throughout this thesis.

The first challenge refers to the partial observability in natural-language feedback. For example, the users’ feedback, which takes the form of natural-language critiques about the displayed recommendation at each iteration, can only allow the recommender system to obtain a partial portrayal of the users’ preferences. To alleviate such a partial observation issue, we propose a novel dialog-based recommendation model, the Estimator-Generator-Evaluator (EGE) model, which uses Q-learning for a partially observable Markov decision process (POMDP), to effectively incorporate the users’ preferences over time. Specifically, we leverage an Estimator to track and estimate users’ preferences, a Generator to match the estimated preferences with the candidate items to rank the next recommendations, and an Evaluator to judge the quality of the estimated preferences considering the users’ historical feedback.

The second challenge refers to multi-modal sequence dependency issue in multi-modal dialog state tracking. For instance, multi-modal dialog sequences (i.e. turns consisting of the system’s visual recommendations and the user’s natural-language feedback) make it challenging to correctly incorporate the users’ preferences across multiple turns. Indeed, the existing formulations of interactive recommender systems suffer from their inability to capture the multi-modal sequential dependencies of textual feedback and visual recommendations because of their use of recurrent neural network-based (i.e., RNN-based) or transformer-based models. To alleviate the multi-modal sequence dependency issue, we propose a novel multi-modal recurrent attention network (MMRAN) model to effectively incorporate the users’ preferences over the long visual dialog sequences of the users’ natural-language feedback and the system’s visual recommendations.

The third challenge refers to the coupling issue of policy (i.e. recommendation model) optimisation and representation learning. For example, it is typically challenging and unstable to optimise a recommendation agent to improve the recommendation quality associated with implicit learning of multi-modal representations in an end-to-end fashion in deep reinforcement learning (DRL). To address this coupling issue, we propose a novel goal-oriented multi-modal interactive recommendation model (GOMMIR) that uses both verbal and non-verbal relevance feedback to effectively incorporate the users’ preferences over time. Specifically, our GOMMIR model employs a multi-task learning approach (using goal-oriented reinforcement learning (GORL)) to explicitly learn the multi-modal representations using a multi-modal composition network when optimising the recommendation agent.

The fourth challenge refers to the personalisation for cold-start and warm-start users. For instance, it can be challenging to make satisfactory personalised recommendations across multiple interactions due to the difficulty in balancing the users’ past interests and the current needs for generating the users’ state (i.e. current preferences) representations over time. To perform the personalisation for cold-start and warm-start users, we propose a novel personalised multimodal interactive recommendation model (PMMIR) using hierarchical reinforcement learning (HRL) to more effectively incorporate the users’ preferences from both their past and real-time interactions.

The final challenge refers to the realism of simulated conversations. In a real-world shop-ping scenario, users can express their natural-language feedback when communicating with a shopping assistant by stating their satisfactions positively with “I like” or negatively with “I dislike” according to the quality of the recommended fashion products. A multi-modal conversational recommender system (using text and images in particular) aims to replicate this process by eliciting the dynamic preferences of users from their natural-language feedback and updating the visual recommendations so as to satisfy the users’ current needs through multi-turn interactions. However, the impact of positive and negative natural-language feedback on the effectiveness of multi-modal conversational recommendation has not yet been fully explored. To further explore the multi-modal conversational recommendation with positive and negative natural-language feedback, we investigate the effectiveness of the recent multi-modal conversational recommendation models for effectively incorporating the users’ preferences over time from both positively and negatively natural-language oriented feedback corresponding to the visual recommendations.

Overall, we contribute an effective multi-modal conversational recommendation framework that make accurate recommendations by leveraging visual and textual information. This framework includes models for tracking users’ preferences with partial observations, mitigating the multi-modal sequence dependency issue, decoupling the composition representation learning from policy optimisation, incorporating both the users’ long-term preferences and short-term needs for personalisation, and ensuring the realism of simulated conversations. These contributions make progress in the development of multi-modal conversational recommendation techniques and could inspire future directions of research in recommendation systems.

Item Type: Thesis (PhD)
Qualification Level: Doctoral
Subjects: Q Science > QA Mathematics > QA75 Electronic computers. Computer science
T Technology > T Technology (General)
Colleges/Schools: College of Science and Engineering > School of Computing Science
Supervisor's Name: Macdonald, Professor Craig and Ounis, Professor Iadh
Date of Award: 2024
Depositing User: Theses Team
Unique ID: glathesis:2024-84149
Copyright: Copyright of this thesis is held by the author.
Date Deposited: 26 Mar 2024 11:28
Last Modified: 26 Mar 2024 11:30
Thesis DOI: 10.5525/gla.thesis.84149
URI: https://theses.gla.ac.uk/id/eprint/84149

Actions (login required)

View Item View Item


Downloads per month over past year