Enlighten Theses

In this section

Vision-language models for chest X-ray radiology

Dalla Serra, Francesco (2025) Vision-language models for chest X-ray radiology. EngD thesis, University of Glasgow.

Full text available as:

PDF
Download (32MB)

Abstract

Chest X-ray (CXR) is a widely requested imaging test used as a quick and non-invasive procedure to examine various pathologies in the chest cavity. When radiologists interpret CXR scans, they typically consult additional clinical information about the patient under examination and document the relevant findings visualised in the CXR into free-text radiology reports. Therefore, in clinical practice, CXRs are often accompanied by supplementary textual documents that provide important context for accurate diagnosis.

This thesis explores the potential of Visual-Language Models (VLMs)—AI systems designed to process and integrate both visual and textual information—to develop flexible autonomous decision support tools for CXR analysis. We investigate several multimodal tasks, including medical finding classification, Automated Radiology Reporting (ARR), and medical Visual Question Answering (VQA). Medical finding classification involves identifying and categorising specific pathologies or abnormalities present in the CXR images. ARR corresponds to the task of generating free-text radiology reports for each scan, providing comprehensive descriptions and diagnoses. Medical VQA focuses on answering questions about the visual content of medical scans, facilitating deeper interaction with the imaging data.

We address these tasks by improving the visual representation of the CXR scans and providing the VLM with additional relevant textual information. This includes utilising patients’ medical history and the reasons for the scans, as detailed in the indication field of the radiology report, available at the time of imaging. We leverage expert-written radiology reports to supervise ARR models and guide VQA model responses through specific textual queries. By integrating both textual and visual data, we aim to improve the models’ ability to accurately interpret and interact with the imaging data. This thesis is organised as follows.

We start by addressing the medical finding classification task. In particular, we investigate how different pre-training strategies of the image encoder impact the performances of a multimodal model and how these degrade in the scenario of limited labelled data. We demonstrate the impact of self-supervised pre-training strategies on this task.

Second, we focus on the ARR task. We start by exploring the effect of incorporating structured information extraction from each scan – expressed in the form of triples (entity1, relation, entity2). Triples extraction is used as the intermediate task in a two-step pipeline for ARR, showing improved results. Additionally, we propose the extraction of more fine-grained visual representations, each specific to an anatomical region of the CXR, which are used as the visual input representation to perform ARR. Our approach offers an effective solution for encoding detailed information about abnormalities within each anatomical region. Following, we demonstrate how to manipulate these region-specific representations to model the evolution of findings over time (e.g., by examining longitudinal scans) and enable controllable partial reporting – the task of generating the radiology report for a selected set of anatomical regions. We then integrate all the proposed solutions for ARR into a single model and provide a human evaluation of its performance to assess its accuracy and clinical utility.

Finally, we focus on the medical VQA task, by exploring how ARR and VQA can be integrated into a unified pipeline. We show that grounding the VQA model on the predicted radiology reports improves its ability to answer queries related to the CXR images.

This work demonstrates how to effectively tackle visio-linguistic tasks specific to CXR scans, addressing the unique challenges of each task. The research presented here offers valuable insights that may guide future studies, ultimately contributing to the successful integration of VLMs into radiologists’ workflows for improved clinical outcomes.

Item Type:	Thesis (EngD)
Qualification Level:	Doctoral
Additional Information:	Supported by funding from Canon Medical Research Europe Limited and the UKRI EPSRC Centre for Doctoral Training in Applied Photonics [EP/S022821/1].
Subjects:	Q Science > QA Mathematics > QA75 Electronic computers. Computer science
Colleges/Schools:	College of Science and Engineering > School of Computing Science
Funder's Name:	Engineering and Physical Sciences Research Council (EPSRC)
Supervisor's Name:	Deligianni, Dr. Fani and O'Neil, Dr. Alison Q.
Date of Award:	2025
Depositing User:	Theses Team
Unique ID:	glathesis:2025-85025
Copyright:	Copyright of this thesis is held by the author.
Date Deposited:	09 Apr 2025 12:40
Last Modified:	18 May 2026 15:10
Thesis DOI:	10.5525/gla.thesis.85025
URI:	https://theses.gla.ac.uk/id/eprint/85025
Related URLs:	Conference proceeding Enlighten Publications Record Conference proceeding Conference proceeding Conference proceeding

Actions (login required)

View Item

Downloads

Downloads per month over past year

Tools

Enlighten Theses

Vision-language models for chest X-ray radiology

Abstract

Actions (login required)

Downloads

Library