Explicit object-centric video prediction with deep learning models

Sulaiman, Yiliyasi (2026) Explicit object-centric video prediction with deep learning models. PhD thesis, University of Glasgow.

Full text available as:
[thumbnail of 2025SulaimanPhD.pdf] PDF
Download (5MB)

Abstract

Video prediction is a crucial task for intelligent agents such as robots and autonomous vehicles, it enables them to anticipate and act early on time-critical incidents. Many state of-the-art video prediction methods typically model the dynamics of a scene jointly and implicitly and seeing it as a single entity, without any explicit decomposition into separate objects. This is sub-optimal, as every object in a dynamic scene has their own pattern of movement, typically somewhat independent of others. Therefore, we hypothesize that explicit modelling of moving objects is crucial for video prediction in limited data and compute scenarios.

We first investigate video prediction with multiple moving and interacting objects in a static camera setting within the context of a latent-transformer as the video predictor. We conduct detailed and carefully-controlled experiments on both synthetic and real-world datasets; our results show that decomposing a dynamic scene leads to higher quality predictions compared with models of a similar capacity that lack such decomposition. We then investigate the trajectory prediction of occluded objects and scenes with background motion which is a common phenomena in real-world scenarios. We introduce explicit motion information, depth map and point flow, to assist the prediction model we proposed previously. We investigate this approach in both synthetic and real-world scenarios. The experimental results shows that with the integration of explicit motion information, the predicted trajectory of dynamic objects is more accurate. We finally investigate the case of deformable objects such as scenes in garment manipulation tasks. We introduced a diffusion variant of our proposed video prediction model to better handle the motion prediction of fully deformable objects because of its continuous nature compared to transformer based architectures. By testing it on a garment manipulation dataset, we find that our diffusion-based variant outperformed our transformer-based models.

Our findings suggest that for video prediction models to accurately model motion patterns inside a dynamic scene, scaling up holistic models are inefficient and recourse consuming. In contrast, decomposition of objects and modeling with their explicit motion information can be a better and more efficient alternative compared to monolithic models with same capacity. Furthermore, this setting implies that it can be more useful in closed-world settings like robotic manipulation tasks where limited objects are in the scene.

Item Type: Thesis (PhD)
Qualification Level: Doctoral
Additional Information: Supported by funding from the School of Computing Science, University of Glasgow.
Subjects: Q Science > QA Mathematics > QA75 Electronic computers. Computer science
Colleges/Schools: College of Science and Engineering > School of Computing Science
Supervisor's Name: Pugeault, Dr. Nicolas and Henderson, Dr. Paul
Date of Award: 2026
Depositing User: Theses Team
Unique ID: glathesis:2026-85705
Copyright: Copyright of this thesis is held by the author.
Date Deposited: 23 Jan 2026 10:35
Last Modified: 25 Jan 2026 09:04
Thesis DOI: 10.5525/gla.thesis.85705
URI: https://theses.gla.ac.uk/id/eprint/85705

Actions (login required)

View Item View Item

Downloads

Downloads per month over past year