Multilevel models for the analysis of linguistic data

Alexander, Craig (2019) Multilevel models for the analysis of linguistic data. PhD thesis, University of Glasgow.

Full text available as:
[thumbnail of 2019alexanderphd.pdf] PDF
Download (5MB)
Printed Thesis Information:


Describing the numerous factors that constrain and promote particular aspects of linguistic behaviour in interaction is very difficult. The recent adoption of more advanced quantitative methods has enhanced this modelling, leading to a greater understanding of linguistic patterns. At the same time, the increase in availability of digital recordings and storage capacity for such recordings is leading to increasingly large corpora of complex linguistic data for such investigations. The Sounds of the City corpus is one such example and is the corpus we model throughout this thesis. The corpus is an electronic real-time corpus of Glaswegian vernacular, which consists of a searchable, multi layered database of 58 hours of recordings from 136 speakers, recorded between 1970 and 2010 with orthographic transcripts and automatically phonemically segmented waveforms, amenable to automatic acoustic analyses of durational and resonance characteristics of speech.

Vowel formant measurements provide a numeric representation of a spoken vowel and are a commonly used metric to measure linguistic variation and change, with each vowel having multiple formant measures, which correspond to the resonances of the vocal tract. The first three vowel formants are important perceptual cues for the successful recognition of vowel qualities. Current quantitative modelling methods consider each formant separately, inferring characteristics on each formant measurement assuming independence between each formant. This assumption for most vowels seems misplaced, as formant measures are often correlated with one another.

In this thesis, we extend upon current modelling techniques applied to sociolinguistic corpora by introducing a Bayesian hierarchical model which models the first three formant measures for each vowel simultaneously, taking into consideration the correlation present between such measures. We also implement reparameterisation methods to alleviate issues caused by highly correlated samples, which is often observed in MCMC output for models applied to datasets with nested structures, a common feature in sociolinguistic corpora. These models not only account for the complex nested structure of the data and uncover the underlying dynamics of language just like classical mixed effects models, but now additionally account for the correlation between formants, providing a more accurate representation of factors driving linguistic variation and change.

The output from the Bayesian hierarchical model is visualised as a graphical model. Graphical models provide a visual representation of the conditional dependence between variables, making them an attractive inference tool. We combine the hierarchical model and jointly infer the relationship between vowel formant measurements using the precision estimates from the hierarchical model as input to a Bayesian Gaussian graphical model. The resulting graph utilises a chain graph like structure which visually informs the user which factors have a significant effect on vowel variation, corresponding to each formant, and also the relationship present between the first three formants. This novel inference tool helps to aid the understanding of complex model output much like the ones fitted to the Sounds of the City corpus, though can easily be applied to numerous modelling problems.

Item Type: Thesis (PhD)
Qualification Level: Doctoral
Keywords: Linguistics, Bayesian statistics, Glaswegian dialect, graphical models, hierarchical models.
Subjects: P Language and Literature > PE English
Q Science > Q Science (General)
Q Science > QA Mathematics
Colleges/Schools: College of Science and Engineering > School of Mathematics and Statistics > Statistics
Supervisor's Name: Evers, Dr. Ludger
Date of Award: 2019
Depositing User: Craig Alexander
Unique ID: glathesis:2019-41168
Copyright: Copyright of this thesis is held by the author.
Date Deposited: 03 Jun 2019 11:58
Last Modified: 05 Mar 2020 22:46
Thesis DOI: 10.5525/gla.thesis.41168

Actions (login required)

View Item View Item


Downloads per month over past year