Mackie, Iain (2025) Integrating pre-trained language models into novel query expansion pipelines. PhD thesis, University of Glasgow.
Full text available as:|
PDF
Download (7MB) |
Abstract
In our increasingly digital society, proficiency in finding valuable and useful information is crucial within everyday personal and professional life. Information Retrieval (IR) is the academic f ield that focuses on discovering useful information (documents) that fulfil a user’s information need (query). In particular, search systems process a user query and return a ranked list of documents determined by the query-specific relevance calculated by a retrieval model. Retrieval models have traditionally been based on query-document term overlap; however, dense embedding models are becoming increasingly prevalent with the emergence of Pre-Trained Language Models (PLMs).
Lexical mismatch is a classic problem within information retrieval, whereby a user query fails to capture their complete information need, leading to retrieval models failing to find relevant documents. A common approach for this issue is query expansion, involving the augmentation of the query with supplementary information to enhance the retrieval of relevant documents. This process is usually done automatically through pseudo-relevance feedback (PRF), where a set of documents from a first-pass retrieval algorithm are assumed relevant, and are used to expand the query with useful context. This approach has proven beneficial for both sparse and dense retrieval models. In this thesis, I hypothesise that integrating PLMs into multi-stage query expansion pipelines can improve performance over current sparse and dense expansion methods. Leveraging the capabilities of PLMs to generate and rank relevant content to build better expansion models, which should particularly help queries that require reasoning or contextualisation.
This thesis examines existing retrieval and expansion models, identifying their shortcomings to focus the contributions. This includes constructing a formal definition of complex queries and constructing new datasets, such as CODEC, designed to evaluate the effectiveness of our proposed retrieval models on this query type. I first show that by simply using PLMs to re-rank our first-pass candidate set of documents before query expansion improves sparse and dense retrieval by 5–8%. This motivates the development of a new fine-grained expansion model, Latent Entity Expansion (LEE), that achieves a further 2-8% gain in NDCG by explicitly modelling knowledge using terms and entities. Furthermore, I introduce a novel expansion pipeline, termed “adaptive expansion”, that iterates between retrieving new batches of documents and updating the expansion model through PLM re-ranking. Adaptive expansion leads to state-of-the-art effectiveness gains without requiring any additional re-ranking computation.
This thesis’s second central research thread explores not using pseudo-relevance feedback at all; instead, leveraging the generative capabilities of PLMs to build our query expansion models directly. I introduce Generative Relevance Feedback (GRF), which shows that sparse expansion using PLM-generated content improves MAP between 5-19% over traditional PRF. I also show that GRF is highly effective when combined with dense and learned sparse retrieval. Furthermore, I increase the retrieval effectiveness of these pipelines by incorporating Generative Relevance Modelling (GRM) to mitigate hallucination by scaling the weight of generated documents in our expansion model. We propose Relevance-Aware Sample Estimation (RASE) to ground the generated documents to the target corpus and use PLMs to estimate relevance.
Overall, this body of work demonstrates the potential of query expansion when combined with novel pipelines that leverage the capabilities of PLMs. This paradigm shift provides a foundation for future advancements, promising more efficient and effective search systems.
| Item Type: | Thesis (PhD) |
|---|---|
| Qualification Level: | Doctoral |
| Subjects: | T Technology > T Technology (General) |
| Colleges/Schools: | College of Science and Engineering > School of Engineering |
| Supervisor's Name: | Dalton, Dr. Jeffrey and McCreadie, Dr. Richard |
| Date of Award: | 2025 |
| Depositing User: | Theses Team |
| Unique ID: | glathesis:2025-85661 |
| Copyright: | Copyright of this thesis is held by the author. |
| Date Deposited: | 07 Jan 2026 09:28 |
| Last Modified: | 07 Jan 2026 09:28 |
| Thesis DOI: | 10.5525/gla.thesis.85661 |
| URI: | https://theses.gla.ac.uk/id/eprint/85661 |
Actions (login required)
![]() |
View Item |
Downloads
Downloads per month over past year

Tools
Tools