Vlachou, Maria (2025) Predicting retrieval failures in conversational recommendation systems. PhD thesis, University of Glasgow.
Full text available as:![]() |
PDF
Download (5MB) |
Abstract
In recent years, the use of dialogue systems and voice assistants commonly implemented in smart devices has shifted the users’ interest towards online shopping. In turn, online shopping platforms are gaining popularity and moving towards allowing an interactive dialogue with users that more accurately depicts a real shopping setting. In this regard, the task of Conversational Image Recommendation is the state-of-the-art task for conversational recommendation in the fashion domain, where a user has a specific fashion item in mind, and interacts with the system with natural language feedback on recommended image items, which guides the system in finding the imagined item in the next turn. Such systems are trained and evaluated with user simulators as a plentiful surrogate for human users. A practical problem with CRS performance is that it is primarily evaluated in terms of successes and is therefore assumed to return the item of interest by a pre-defined number of turns. In practice, often the item is not returned by the end of a conversation, therefore leading to conversational failures; this is our particular setting of interest.
In this thesis, we argue that the performance of a Conversational Recommendation System can be predicted to detect when a conversation fails, under different scenarios, across different turns of a conversation. In this regard, Query Performance Prediction (QPP) techniques predict the effectiveness of a ranked list result in response to a query without having access to relevance judgments. We predict the performance of CRS models by treating them as dense retrieval processes, where both the image retrieved items and textual feedback can be represented with dense embedded representations. In particular, we propose a set of coherence-based dense QPPs specifically designed for single-representation dense retrieval models (ANCE and TCT-ColBERT) and show that the examination of the relations among dense embedded representations already contained in the document list is sufficient to provide effective predictions for dense retrieval models. At the same time, by using a multi-level perspective that jointly considers QPPs and types of queries, we explain why some QPPs are better for certain types of queries, thus explaining discrepancies among different evaluation metrics.
At the next stage, we predict the effectiveness of a ranking of image items in Conversational Image Recommendation models, which are also based on learned embedded representations of images, and where user feedback takes the place of a textual query. In deed, we create a novel task which we call Conversational Performance Prediction (CPP), which predicts conversation success at the conversation level and taking into account the multi-turn nature of the task, and can differentiate between success predicted over a short-term and a long-term horizon, thereby predicting current user satisfaction or overall satisfaction of a conversation. First, we examine the set of unsupervised predictors developed for dense retrieval models but applied to state-of-the-art Conversational Image Recommendation models; a GRU-based model, which mainly considers the feedback of the previous turn, and an EGE model that considers the entire dialogue history. Our results show that using correlations is not an optimal evaluation strategy for predicting conversational failures, as, while correlations are low to medium mainly for short-term predictions, a lot of inconsistencies are observed among the performance of different predictors across metrics and datasets (similarly to dense retrieval models). Consequently, we propose a supervised CPP approach, which treats CPP as a binary classification task, which predicts whether a target item is returned by a given turn. In this way, we show that by learning the embedded representations already contained in the CRS models, we can predict the accuracy of a conversation success using the retrieved items of both single and multiple turns.
In addition, state-of-the-art CRS models are trained using user simulators with a single target item in mind, and at the same time, they are assumed to be infinitely patient. These settings do not reflect a real shopping scenario, where a user might change their mind according to what a shopping assistant is suggesting. For this purpose, we enhance the evaluation completeness of CRS models by obtaining real user opinions in a user study using pooling similar to information retrieval tasks, thus identifying alternative relevance labels for several target items, and in turn, inform the user simulator with an extended target space. This increases the completeness of CRS evaluation, and therefore, creates a more realistic prediction setting for CRS, which leads to improved predictions of user preferences. Indeed, when we reevaluate the CRS models using the updated simulator with the identified alternatives as part of the target space, we show that by the single target setting previously used to evaluate CRS models for a maximum amount of 10 turns was underestimating the effectiveness of CRS models.
As a final step, we account for the fact that CRS models assume only one type of recommendation failure, namely the inability of the system to retrieve the target item. In this regard, we introduce the concept of recommendation scenarios, and specifically, we adapt our CPP framework for different types of conversational failures, which are determined by whether the user’s need is clearly defined and whether the target item is available. Therefore, we propose the removed target scenario (the target is not available in the catalogue), and the alternative scenario (a user has a more flexible need, which can be satisfied by either the original target or any of the identified alternatives in the collected datasets). Consequently, we detect different types of conversational failure, such as when a user cannot find an item, versus when the system’s catalogue does not contain the relevant item. By examining the supervised CPP predictors introduced under these two novel scenarios, we find that in both cases, there is a marked difference from the original scenario, and that CPP can indeed be predicted for different recommendation scenarios.
Item Type: | Thesis (PhD) |
---|---|
Qualification Level: | Doctoral |
Subjects: | Q Science > QA Mathematics > QA75 Electronic computers. Computer science |
Colleges/Schools: | College of Science and Engineering > School of Computing Science |
Funder's Name: | Engineering and Physical Sciences Research Council (EPSRC) |
Supervisor's Name: | Macdonald, Professor Craig |
Date of Award: | 2025 |
Depositing User: | Theses Team |
Unique ID: | glathesis:2025-85203 |
Copyright: | Copyright of this thesis is held by the author. |
Date Deposited: | 17 Jun 2025 12:49 |
Last Modified: | 17 Jun 2025 13:04 |
Thesis DOI: | 10.5525/gla.thesis.85203 |
URI: | https://theses.gla.ac.uk/id/eprint/85203 |
Actions (login required)
![]() |
View Item |
Downloads
Downloads per month over past year