Shi, Tong (2026) Interpreting and synthesising human faces and articulated animals from video data. PhD thesis, University of Glasgow.
Full text available as:|
PDF
Download (13MB) |
Abstract
This thesis studies Human Emotion Recognition, 3D Human Head Reconstruction and Articulated Animal Reconstruction from in-the-wild videos. This is to enable interpretable analysis for human emotions, and facilitate controllable synthesis of human faces and reconstruction of articulated animals by jointly considering appearance (e.g., colour, opacity, and scale), audio, and dynamic motion cues. Our approaches to all three tasks share a common theme: they incorporate explicit representations of 2D and 3D motion and geometry.
When interpreting human emotions from a video, it is natural to do so from different modalities, that is, fusing features from 2D images and audio segments. Here lies our first contribution: understanding human emotions from talking videos by training a multi-modal neural network to predict various emotion categories in a principled fashion. Particularly, this is done by jointly model 2D visual features, optical flow feature, audio signals, and motion representations through an intra- and inter-modal interaction pipeline. We show it achieves state-of-the-art performance on multi-modality emotion recognition setting. Beyond interpreting 2D images and audio features from a talking portrait video, we further estimate a 3D shape and learn how to reconstruct the 3D portrait and deform its shape so that it could talk, i.e. synthesising talking portrait videos. Our second contribution is a regression approach to synthesise talking portrait videos, which supports training purely from 2D images – without 3D supervision, and without using pre-defined 3D shapes from face specific priors such as 3D morphable models, landmarks and depth maps. Moreover, this model is generic, so it allows sampling new portrait and animating it condition on one arbitrary audio chunk.
Going beyond human heads, our third contribution addresses more complex articulate and deformable objects, articulated animals in particular, which are challenging to reason about in terms of motion and the articulated structure. In this task, we reconstruct a articulated 3D Animal model given an animal monocular video. It models jointly complex animal pose variation and canonical appearance, and optimises an implicit opacity–colour texture that is supported on a mesh scaffold.
| Item Type: | Thesis (PhD) |
|---|---|
| Qualification Level: | Doctoral |
| Subjects: | Q Science > QA Mathematics > QA75 Electronic computers. Computer science |
| Colleges/Schools: | College of Science and Engineering > School of Computing Science |
| Supervisor's Name: | Henderson, Professor Paul and Pugeault, Dr. Nicolas |
| Date of Award: | 2026 |
| Depositing User: | Theses Team |
| Unique ID: | glathesis:2026-85937 |
| Copyright: | Copyright of this thesis is held by the author. |
| Date Deposited: | 21 May 2026 12:55 |
| Last Modified: | 21 May 2026 13:00 |
| Thesis DOI: | 10.5525/gla.thesis.85937 |
| URI: | https://theses.gla.ac.uk/id/eprint/85937 |
Actions (login required)
![]() |
View Item |
Downloads
Downloads per month over past year

Tools
Tools