Rennie, Gordon (2025) Automatic detection of laughter in spontaneous conversations. PhD thesis, University of Glasgow.
Full text available as:![]() |
PDF
Download (3MB) |
Abstract
Laughter is an important expression used to communicate in a variety of important ways. It is used to signal enjoyment and humour, to control and maintain the flow of conversations, to help mediate the discussion of controversial conversation topics and is used to help speakers bond. Given laughter’s wide range of uses, if they are to engage in effective human-computer interactions, it is vital for computers to be able to detect laughter. However, laughter is not homogeneous. There are two broad types of laughter: voiced and unvoiced. In addition, many individuals have different and unique ways of laughing. The pitch, volume, length and frequency of laughter has a wide divergence across speakers. Furthermore, it is used infrequently. These factors make the application of machine learning approaches, to the automatic detection of laughter, difficult.
This thesis initially shows, through a literature review, that the task of laughter detection has been widely addressed previously. However, the field has placed constraints upon the task of laughter detection. These constraints are shown to split the field into three broad tasks. Type 1 classification tasks involve short clips of between 1 and 3 seconds and contain only one kind of speech event (i.e., laughter, speech, sighs or fillers) being classified. Type 2 tasks make use of medium length clips of between 3 and 11 seconds. Each clip of this type contains multiple speech events: however, laughter can constitute a large amount of the total audio of each clip, i.e., between 10-30%. Finally, type 3 tasks employ long form conversations that are between 10 minutes to an hour. Laughter makes up less than 10% of the audio in this case: there is no guarantee of any laughter being present. Initially, it is shown that these three types of tasks vary in difficulty. Evidence of this is given by examining the F1 score achieved by the same methodology when applied to the three tasks. Scores vary from 80-100% in type 1 tasks and 50% in type 2 tasks to 25% in type 3 tasks. Furthermore, a disparity in the effectiveness of laughter detection methods, as estimated by different evaluation metrics, is found. This is shown to lead to an over-estimation of the effectiveness of state-of-the-art methods in types 2 and 3 laughter detection tasks.
This thesis replicates the state-of-the-art research on a publicly available type 2 corpus, achieving a frame level F1 of 40% and an event level F1 of 52%. It then applies these methods to the SSPNet Mobile Corpus, a private type 3 dataset, and shows the same methods achieve a frame level F1 of 15% and an event level F1 of 26%. An extensive performance analysis illustrates that the longer length of the audio introduces a large number of false laughter detections that are centred on speech. It is then demonstrated that methods that specifically target the removal of these false detections, by leveraging automatic speech recognition, are able to achieve a frame level F1 of 30% and an event level F1 of 45%. This enables an almost two-fold increase in performance over the state-of-the-art approaches for type 3 tasks.
Transformers are then applied to the task. It is demonstrated that these transformers, pretrained on audio tasks such as automatic speech recognition, can be used to extract attention embeddings in terms of low level descriptors of the audio data. Such embeddings are shown to be more effective than hand-crafted features for training laughter detectors. This method is then demonstrated to achieve a frame level F1 of 60% and an event level F1 of 80%, i.e., the best results achieved in type 3 laughter detection. The effectiveness of this approach is then replicated on the SSPNet Vocalisation corpus, which achieves a frame level F1 of 77% and an event level F1 of 88%. Furthermore, it is shown to be as effective at the task of automatic filler detection by achieving a frame level F1 of 70% and an event level of 80%. The final section applies a selection of the laughter detection systems to detect differences in laughter behaviour due to the gender composition of the speakers in a conversation. This demonstrates an initial use-case of automatic speaker information extraction. Overall, this thesis accomplishes effective laughter detection in a type 3 task.
Item Type: | Thesis (PhD) |
---|---|
Qualification Level: | Doctoral |
Subjects: | Q Science > QA Mathematics > QA75 Electronic computers. Computer science |
Colleges/Schools: | College of Science and Engineering > School of Engineering |
Supervisor's Name: | Vinciarelli, Professor Alessandro and Perepelkina, Dr. Olga |
Date of Award: | 2025 |
Depositing User: | Theses Team |
Unique ID: | glathesis:2025-84936 |
Copyright: | Copyright of this thesis is held by the author. |
Date Deposited: | 04 Apr 2025 10:34 |
Last Modified: | 04 Apr 2025 10:39 |
Thesis DOI: | 10.5525/gla.thesis.84936 |
URI: | https://theses.gla.ac.uk/id/eprint/84936 |
Related URLs: |
Actions (login required)
![]() |
View Item |
Downloads
Downloads per month over past year