Transformers and contrastive semi-supervised learning for medical image segmentation

Liu, Qianying (2026) Transformers and contrastive semi-supervised learning for medical image segmentation. PhD thesis, University of Glasgow.

Full text available as:
[thumbnail of 2025LiuPhD.pdf] PDF
Download (6MB)

Abstract

Medical Image Semantic Segmentation (MISS), the process of assigning a semantic label to each pixel in an image, is a foundational task in computational medicine, critical for quantitative diagnostics and treatment planning. However, developing robust MISS models faces two intertwined challenges. First, there is an architectural dilemma: Convolutional Neural Networks (CNNs), like U-Net, excel at learning local features but are limited by their receptive fields, failing to capture global context essential for segmenting organs with large deformations. Conversely, Vision Transformers (ViTs) effectively model long-range dependencies but lack the inductive biases of CNNs, leading to poor generalization on the small datasets typical in medicine without extensive pre-training. Second, the prohibitive cost and expertise required for creating pixel-level annotations create a severe data scarcity bottleneck. While Semi-Supervised Learning (SSL) aims to mitigate this by leveraging unlabeled data, existing methods often fail to learn high-level semantic relations and are susceptible to confirmation bias from noisy pseudo-labels, class imbalance, and suboptimal contrastive sample selection.

This thesis presents a comprehensive investigation to systematically address these challenges, delivering a cohesive suite of novel deep learning frameworks. The contributions are four-fold:

First, to resolve the architectural trade-off, this work introduces CS-Unet, a pure Transformer network built upon a U-Net-like architecture. Its core innovation is the Convolutional Swin Transformer (CST) block, which integrates convolutions directly within the Multi-Head Self-Attention and Feed-Forward Network modules. This design imbues the Transformer with inherent localized spatial context and strong inductive biases, enabling it to efficiently learn both local and global features. Without pre-training, CS-Unet outperforms existing Transformer and CNN-based models on multi-organ and cardiac datasets, achieving state-of-the-art performance with fewer parameters.

Second, to address data scarcity, a novel Multi-Scale Cross Supervised Contrastive Learning (MCSC) framework for SSL is developed. MCSC jointly trains CNN and Transformer models, using a cross-teaching paradigm where each network provides pseudo-labels for the other. Crucially, it moves beyond simple output consistency by applying a contrastive loss to feature maps at multiple scales, enforcing hierarchical semantic consistency. To handle the class imbalance endemic to medical imaging, a class-prevalence-aware loss is used to ensure features for infrequent classes are learned robustly.

Third, to fortify SSL against noisy pseudo-labels, a certainty-guided contrastive learning strategy is proposed. This approach mitigates the impact of inaccurate pseudo-labels by using a certainty metric to guide the selection of samples for contrastive learning. The framework’s computational efficiency is enhanced through novel sampling strategies that select a few representative samples for contrasting, and a negative memory bank is used to increase sample diversity and eliminate dependence on batch size.

Fourth, this thesis introduces a new paradigm for SSL by leveraging external anatomical priors through the Contrastive Cross-Teaching with Registration (CCT-R) framework. CCT-R is the first method to integrate spatial registration transforms into the learning process. It features two novel modules: a Registration Supervision Loss (RSL), which uses transforms between labeled and unlabeled volumes to generate an additional, highly reliable source of pseudo-labels, and Registration-Enhanced Positive Sampling (REPS), which uses registration to identify anatomically-corresponding positive pairs across volumes for contrastive learning.

Overall, these contributions provide a powerful toolkit that significantly alleviates the annotation bottleneck in medical AI. The proposed methods demonstrate state-of-the-art performance on challenging segmentation benchmarks, delivering a pathway to develop accurate, data-efficient models for real-world clinical applications and opening new avenues for research into fusing geometric priors with semantic segmentation.

Item Type: Thesis (PhD)
Qualification Level: Doctoral
Subjects: Q Science > QA Mathematics > QA75 Electronic computers. Computer science
Colleges/Schools: College of Science and Engineering > School of Computing Science
Supervisor's Name: Deligianni, Dr. Fani
Date of Award: 2026
Depositing User: Theses Team
Unique ID: glathesis:2026-85734
Copyright: Copyright of this thesis is held by the author.
Date Deposited: 02 Feb 2026 15:08
Last Modified: 03 Feb 2026 16:00
Thesis DOI: 10.5525/gla.thesis.85734
URI: https://theses.gla.ac.uk/id/eprint/85734
Related URLs:

Actions (login required)

View Item View Item

Downloads

Downloads per month over past year