Neural pseudo-relevance feedback models for information retrieval

Wang, Xiao (2024) Neural pseudo-relevance feedback models for information retrieval. PhD thesis, University of Glasgow.

Full text available as:
[thumbnail of 2023WangXiaoPhD.pdf] PDF
Download (3MB)

Abstract

Verbatim queries submitted to search engines often do not sufficiently describe the user’s search intent. Moreover, even with well-formed user queries, retrieval failures can still occur, caused by lexical or semantic mismatches, or both, between the language of the user’s query and that used in the relevant documents. Pseudo-relevance feedback (PRF) techniques, which modify a query’s representation using top-ranked documents, have been shown to overcome such inadequacies and improve retrieval effectiveness.

In this thesis, we argue that the pseudo-relevance feedback information can be used in neuralbased models to improve retrieval effectiveness, for both sparse retrieval and dense retrieval paradigms. Indeed, recent advancements in pretrained generative language models, such as T5 and FlanT5, have demonstrated their ability to generate textual responses that are relevant to a given prompt. In light of this success, we study the capacity of such models to perform query reformulation and how they compare with long-standing query reformulation methods that use pseudo-relevance feedback. In particular, we investigate two representative query reformulation frameworks, GenQR and GenPRF. Specifically, GenQR directly reformulates the user’s input query, while GenPRF provides additional context for the query by making use of pseudo-relevance feedback information in the top-ranked documents. For each reformulation method, we leverage different techniques, including fine tuning and direct prompting, to harness the knowledge of language models. The reformulated queries produced by the generative models are demonstrated to markedly benefit the effectiveness of sparse retrieval on various TREC test collections.

In addition, Dense retrieval models, in both single representation dense retrieval and multiple representation dense retrieval paradigms, have shown higher effectiveness over traditional sparse retrieval by mitigating the lexical and semantic mismatch issues to some extent. However, underrepresented queries can still cause retrieval failures. In particular, in this thesis, we investigate the potential for multiple representation dense retrieval (exemplified by ColBERT) to be enhanced using pseudo-relevance feedback, and thereby present our proposed approach, ColBERT-PRF. More specifically, ColBERT-PRF extracts representative feedback embeddings from the document embeddings of the pseudo-relevant set and uses corresponding token statistics to identify good expansion embeddings among the representative embeddings. These expansion embeddings are then appended to the original query representation to form a refined query representation. We show that these additional expansion embeddings benefit the effectiveness of a reranking of the initial query results as well as an additional dense retrieval operation. Evaluation experiments conducted on MSMARCO passage and document ranking as well as the TREC Robust04 document ranking tasks demonstrate the effectiveness of our proposed ColBERT-PRF technique. In addition, we study the effectiveness of variants of the ColBERT-PRF model with different weighting methods. Finally, we show that ColBERT-PRF can be made more efficient, and with little impact on effectiveness, through the application of approximate scoring and different clustering methods.

While PRF techniques are effective in closing the vocabulary gap between the user’s query formulations and the relevant documents, they are typically applied on the same target corpus as the final retrieval. In the past, external expansion techniques have sometimes been applied to obtain a high-quality pseudo-relevant feedback set using a high quality external corpus. However, such external expansion approaches have only been studied for sparse retrieval, and their effectiveness for recent dense retrieval methods remains under investigation. Moreover, dense retrieval approaches such as ANCE and ColBERT have been shown to face challenges when it comes to out-of-domain evaluations, due to the knowledge shift between different domains.

Therefore, in this thesis, we propose a dense external expansion technique to improve the zeroshot retrieval effectiveness of both single and multiple representation dense retrieval. In particular, we employ the MSMARCO passage collection as the external corpus. The experimental results performed on two TREC datasets indicate the effectiveness of our proposed external dense query expansion techniques for both the sparse retrieval and the (single or multiple) dense retrievals.

Furthermore, we note that the ColBERT model has only been applied to the BERT model with its corresponding WordPiece tokeniser. However, the effect of the pre-trained model and the tokenisation method for the contextualised late interaction mechanism used by ColERT is not well understood. Therefore, in this thesis, we extend ColBERT to Col⋆ and ColBERT-PRF to Col⋆-PRF, by generalising the de-facto standard BERT PLM to various different PLMs. As different tokenisation methods can directly impact the matching behaviour within the late interaction mechanism, we study the nature of matches occurring in different Col⋆ and Col⋆-PRF models, and further quantify the contribution of lexical and semantic matching on retrieval effectiveness.

Finally, both the ColBERT-PRF as well as the Col⋆-PRF models perform dense query expansion in an unsupervised manner and might be affected by heuristic techniques such as clustering and IDF statistics. Therefore, in this thesis, we propose a contrastive solution that learns to select the most useful embeddings for expansion. More specifically, a deep language model-based contrastive weighting model, called CWPRF, is trained to learn to discriminate between relevant and non-relevant documents for semantic search. Our experimental results show that our contrastive weighting model can aid in selecting useful expansion embeddings and outperform various baselines. In particular, CWPRF can further improve nDCG@10 by upto 4.1% compared to our proposed ColBERT-PRF approach while maintaining its efficiency.

Item Type: Thesis (PhD)
Qualification Level: Doctoral
Subjects: Q Science > QA Mathematics > QA75 Electronic computers. Computer science
Colleges/Schools: College of Science and Engineering > School of Computing Science
Supervisor's Name: Macdonald, Professor Craig and Ounis, Professor Iadh
Date of Award: 2024
Depositing User: Theses Team
Unique ID: glathesis:2024-84093
Copyright: Copyright of this thesis is held by the author.
Date Deposited: 21 Feb 2024 11:34
Last Modified: 21 Feb 2024 14:10
URI: https://theses.gla.ac.uk/id/eprint/84093
Related URLs:

Actions (login required)

View Item View Item

Downloads

Downloads per month over past year