Halliday, Alba (2025) Bayesian hierarchical modelling frameworks for correcting reporting delays in disease surveillance. PhD thesis, University of Glasgow.
Full text available as:![]() |
PDF
Download (7MB) |
Abstract
Accurate and timely surveillance of infectious diseases is critical for effective public health responses. Up-to-date quantitative indicators for the prevalence of diseases in a population, e.g. case or death counts, can provide early warning of outbreaks, empowering public health bodies to develop targeted interventions, allocate limited resources, and communicate risks to influence public behaviour. However, data collection for such indicators often suffers from delays, for example due to administrative protocols, testing processes, or resource limitations. These delays mean that available information on outbreaks lags behind reality; delays also vary randomly and systematically in space and time, making it difficult to confidently detect disease outbreaks and provide timely, effective interventions.
From a statistical perspective, correcting delayed reporting is a compositional count data prediction problem. Compositional data, take the form of parts of some whole, in this case a set of non-negative counts reported after each delay that sum to a total count, such as the number of disease cases. In a nowcasting setting, the total count is not yet observed and we aim to predict it given the observed parts of the total for delays that have already elapsed. Applying appropriate statistical methodology for count data with this structure can yield models that learn about the properties of the delay distribution, to provide nowcasting predictions. At the same time, this means that methodological advancements in the field of correcting delayed reporting can potentially lead to innovation in the general field of modelling compositional counts, relevant to a wide range of research fields beyond disease epidemics.
Research carried out prior to this project developed a general multivariate Bayesian hierarchical framework, based on the Generalized-Dirichlet-Multinomial (GDM) family of distributions, that can flexibly account for the different sources of variability in count data suffering from delayed reporting. The framework was developed into a model for a time series of an individual disease in one geographic region. The model demonstrated theoretical and practical potential for the GDM method to provide more accurate and precise predictions, compared to alternative methods.
The work presented here is underpinned by two broad aims: to make the GDM approach more practical for real-time public health applications and to develop novel extensions to the methodology to account for more complex data challenges and features. For the first aim, we developed improvements in computational efficiency and in streamlining applications to real data. Then, we demonstrated the efficacy of the improved GDM model as a solution for nowcasting COVID-19 hospital deaths in different regions of England. Through an unprecedented rolling prediction experiment, we assessed the performance of the GDM against a cohort of competing methods representing the current state-of-the-art, finding that predictions from the GDM were the most accurate and most precise.
For the second aim, our work was informed by a collaboration with experts at Brazil’s leading public health institute, the Oswaldo Cruz Foundation (Fiocruz). This offered unique insights into the specific data challenges affecting Brazil’s current operational disease warning systems, while also supporting our understanding of more general issues in correcting delayed reporting. One component of this work was motivated by the challenge of nowcasting COVID-positive severe acute respiratory illness (SARI) cases, as an indicator of COVID outbreaks in Brazil. Here, we developed a joint modelling framework for nowcasting total SARI and COVID-positive SARI cases. The framework addressed the novel challenge of correcting delayed reporting of disease counts where information on the length of the reporting delay was not recorded. Applied to data spanning the whole of the Brazil, our approach allowed for predictions of COVID-positive cases, which suffer from this data challenge, through leveraging the more timely and complete data for the total SARI cases. A rolling prediction experiment demonstrated improvements in predictive performance from incorporating links between overall SARI incidence and COVID-positive rates, as well as from accounting for patient age distributions.
The last major piece of work of the thesis explored potential effects of the level of a disease in the population on the severity of reporting delays. We investigated this issue in data for different diseases, offering new insights into potential capacity limitations or elasticity within the respective reporting processes. We propose a framework that flexibly models the effect of the prevalence of the disease on the delay distribution. Through a simulation study aiming to imitate real data, we demonstrated the framework’s ability to disentangle the various sources of variability in the data, including the prevalence-delay interaction, and improve overall prediction accuracy. Since the existing statistical and biostatistical literature on correcting delayed reporting does not assume an explicit effect of disease prevalence on reporting delays, this work could represent the first step for a new paradigm of nowcasting frameworks.
Overall, the work in this thesis provides substantial methodological advancements in correcting reporting delays for disease surveillance, taking the initial proof-of-concept of the GDM framework and greatly enhancing its practicality and versatility. All aspects of the work were driven by and demonstrated using real-world data challenges, employing realistic prediction experiments to develop a robust evidence base for the potential of advanced methods based on the GDM framework to enhance public health responses and policy decisions.
Item Type: | Thesis (PhD) |
---|---|
Qualification Level: | Doctoral |
Subjects: | H Social Sciences > HA Statistics Q Science > QA Mathematics |
Colleges/Schools: | College of Science and Engineering > School of Mathematics and Statistics |
Supervisor's Name: | Stoner, Dr. Oliver and Lee, Professor Duncan |
Date of Award: | 2025 |
Depositing User: | Theses Team |
Unique ID: | glathesis:2025-84997 |
Copyright: | Copyright of this thesis is held by the author. |
Date Deposited: | 08 Apr 2025 07:43 |
Last Modified: | 08 Apr 2025 08:16 |
Thesis DOI: | 10.5525/gla.thesis.84997 |
URI: | https://theses.gla.ac.uk/id/eprint/84997 |
Actions (login required)
![]() |
View Item |
Downloads
Downloads per month over past year