Deep learning for ultrasound tongue imaging: towards robust, interpretable, and deployable assessment of speech sound disorders

Al Ani, Saja (2026) Deep learning for ultrasound tongue imaging: towards robust, interpretable, and deployable assessment of speech sound disorders. PhD thesis, University of Glasgow.

Full text available as:
[thumbnail of 2026AlAniPhD.pdf] PDF
Download (4MB)

Abstract

Speech sound disorders (SSDs) are among the most common developmental communication difficulties in childhood, with long-term consequences for intelligibility, literacy, and psychosocial well-being. Clinical assessment relies primarily on auditory–perceptual judgement, which, although effective, is subjective and provides limited insight into underlying articulatory mechanisms. Ultrasound tongue imaging (UTI) offers a safe, non-invasive method for visualising tongue movement during speech; however, its interpretation remains challenging due to image noise, speaker and acquisition variability, and the high cost of expert annotation. This thesis examines the systematic adaptation of deep learning (DL) to address three core challenges in automated UTI analysis: (C1) data variability and generalisability limitations, (C2) data scarcity and annotation inefficiency, and (C3) lack of interpretability and clinical usability.

Reproducible baseline deep neural network (DNN) models are first established for phonetic classification from raw UTI, quantifying generalisation limitations under speaker-independent evaluation. A novel multi-input FusionNet architecture is then introduced, combining raw ultrasound frames with texture-based representations to improve cross-speaker robustness. A two-stage conditional generative adversarial framework is proposed for field-of-view (FoV) standardisation and tongue region enhancement, improving image consistency and classification performance across domains. To address data scarcity, a cost-focused framework integrates statistical power-curve modelling with active learning to optimise annotation effort, achieving substantial reductions in required labelled data while maintaining clinically meaningful accuracy. Model interpretability is examined using an explainable AI (XAI) technique to assess how image representation and standardisation influence network attention and anatomical relevance. Finally, the feasibility of clinical translation is demonstrated through a prototype web-based deployment system for real-time inference and visualisation.

Collectively, this work presents an integrated DL framework that advances robustness, data efficiency, interpretability, and deployment for ultrasound-based speech assessment, contributing toward objective and scalable clinical decision-support tools for SSDs.

Item Type: Thesis (PhD)
Qualification Level: Doctoral
Keywords: Ultrasound tongue imaging, speech sound disorders, deep learning, explainable artificial intelligence, generative adversarial networks, speech assessment.
Colleges/Schools: College of Science and Engineering > School of Engineering
Supervisor's Name: Zoha, Dr Ahmed and Cleland, Dr. Joanne
Date of Award: 2026
Depositing User: Theses Team
Unique ID: glathesis:2026-86062
Copyright: Copyright of this thesis is held by the author.
Date Deposited: 24 Jun 2026 11:15
Last Modified: 24 Jun 2026 11:22
Thesis DOI: 10.5525/gla.thesis.86062
URI: https://theses.gla.ac.uk/id/eprint/86062

Actions (login required)

View Item View Item

Downloads

Downloads per month over past year