Enlighten Theses

In this section

Interpreting and synthesising human faces and articulated animals from video data

Shi, Tong (2026) Interpreting and synthesising human faces and articulated animals from video data. PhD thesis, University of Glasgow.

Full text available as:

PDF
Download (13MB)

Abstract

This thesis studies Human Emotion Recognition, 3D Human Head Reconstruction and Articulated Animal Reconstruction from in-the-wild videos. This is to enable interpretable analysis for human emotions, and facilitate controllable synthesis of human faces and reconstruction of articulated animals by jointly considering appearance (e.g., colour, opacity, and scale), audio, and dynamic motion cues. Our approaches to all three tasks share a common theme: they incorporate explicit representations of 2D and 3D motion and geometry.

When interpreting human emotions from a video, it is natural to do so from different modalities, that is, fusing features from 2D images and audio segments. Here lies our first contribution: understanding human emotions from talking videos by training a multi-modal neural network to predict various emotion categories in a principled fashion. Particularly, this is done by jointly model 2D visual features, optical flow feature, audio signals, and motion representations through an intra- and inter-modal interaction pipeline. We show it achieves state-of-the-art performance on multi-modality emotion recognition setting. Beyond interpreting 2D images and audio features from a talking portrait video, we further estimate a 3D shape and learn how to reconstruct the 3D portrait and deform its shape so that it could talk, i.e. synthesising talking portrait videos. Our second contribution is a regression approach to synthesise talking portrait videos, which supports training purely from 2D images – without 3D supervision, and without using pre-defined 3D shapes from face specific priors such as 3D morphable models, landmarks and depth maps. Moreover, this model is generic, so it allows sampling new portrait and animating it condition on one arbitrary audio chunk.

Going beyond human heads, our third contribution addresses more complex articulate and deformable objects, articulated animals in particular, which are challenging to reason about in terms of motion and the articulated structure. In this task, we reconstruct a articulated 3D Animal model given an animal monocular video. It models jointly complex animal pose variation and canonical appearance, and optimises an implicit opacity–colour texture that is supported on a mesh scaffold.

Item Type:	Thesis (PhD)
Qualification Level:	Doctoral
Subjects:	Q Science > QA Mathematics > QA75 Electronic computers. Computer science
Colleges/Schools:	College of Science and Engineering > School of Computing Science
Supervisor's Name:	Henderson, Professor Paul and Pugeault, Dr. Nicolas
Date of Award:	2026
Depositing User:	Theses Team
Unique ID:	glathesis:2026-85937
Copyright:	Copyright of this thesis is held by the author.
Date Deposited:	21 May 2026 12:55
Last Modified:	21 May 2026 13:00
Thesis DOI:	10.5525/gla.thesis.85937
URI:	https://theses.gla.ac.uk/id/eprint/85937

Actions (login required)

View Item

Downloads

Downloads per month over past year

Tools

Enlighten Theses

Interpreting and synthesising human faces and articulated animals from video data

Abstract

Actions (login required)

Downloads

Library