A two-stage Bayesian modelling framework with applications in spatial epidemiology

Villejo, Stephen Jun Vecera (2025) A two-stage Bayesian modelling framework with applications in spatial epidemiology. PhD thesis, University of Glasgow.

Full text available as:
[thumbnail of 2025VillejoPhD.pdf] PDF
Download (95MB)

Abstract

This thesis proposes a framework for doing two-stage modelling in spatial epidemiology, whose main goal is to understand the association between a covariate of interest, which is modelled in the first stage, and health outcomes, which is modelled in the second stage using the first-stage model predictions as inputs. A two-stage modeling framework has the advantage of being more computationally efficient than a joint modelling approach when the first-stage model is already complex in itself, and avoids the potential problem of unwanted feedback effects, which happen when the second-stage data affect first-stage model inference. Chapter 1 discusses the motivation behind this research. The specific data application of this thesis links dengue incidence and climate variables, particularly temperature, relative humidity, and rainfall, in the Philippines. Dengue is an infectious disease caused by Aedes mosquitoes, and which poses significant socioeconomic and disease burden in many tropical and subtropical regions of the world.

In a two-stage modelling framework, the first stage fits the model for the main covariate of interest, whose association with the health outcome is investigated. In the data application, the first-stage fits climate models, which are then used to predict the true climate field over the entire spatial domain. The data limitation, which poses challenges on the accuracy of model inference and predictions, is the sparsity in the data from weather stations. This data problem is overcome by incorporating additional data sources (referred to as proxy data), albeit more biased but with wider spatial coverage, and then combining the different data sources in a process called data fusion, whose main goal is the improvement of model accuracy. Chapter 3 presents an initial exploration of a data fusion Bayesian model estimated using integrated nested Laplace approximation (INLA). Chapter 4 presents a flexible model specification of the data fusion model, which is shown to outperform benchmark approaches in terms of the accuracy of model predictions and parameter estimates. The proposed model specifies both a time-varying random field to account for the additive bias and a constant multiplicative bias parameter in the proxy data. Chapter 4 also presents the results from applying the proposed data fusion model on the meteorological data in the Philippines. The results of leave-group-out cross validation show that the data fusion model outperforms benchmark approaches.

Chapter 5 presents results from an extensive analysis on the link between climate and dengue occurrence in the Philippines. The predicted climate fields from Chapter 4 are used as inputs to the health model. To account for the uncertainty in the predictions from the climate models, a resampling approach is used, which generates samples from the first-stage model posteriors and where each sample is used as an input to the second-stage model. The final posterior estimates of second-stage model parameters are then computed using Bayesian model averaging. The results show that temperature has a non-linear relationship with dengue occurrence. In particular, temperature is generally positively related to dengue, but very hot conditions tend to have a negative impact. Moreover, the relationship between rainfall and dengue varies in space, depending on the climate type of the area. For areas with uniform and low variation in the amount of rainfall all year round, rainfall is negatively associated with dengue, while for areas with pronounced dry and wet season, rainfall is positively related with dengue. This is potentially explained by the fact that consistent rainfall tends to wash away mosquito breeding sites, while sporadic rainfall during dry season tends to create more breeding sites.

Chapter 6 investigates the correctness of the two approaches for doing two-stage modelling used in Chapter 5, particularly the crude plug-in approach, which simply plugs in the posterior mean of the first-stage (climate) model parameters to the second-stage (health) model, and the resampling approach. I used the simulation-based calibration (SBC) approach, which tests the self-consistency property of Bayesian models, to validate the correctness of the aforementioned approaches. Results show that the crude plug-in method indeed underestimates the posterior uncertainty in the second-stage model parameters, while the resampling approach is correct. This chapter also proposes a new approach for doing uncertainty propagation, called the Q uncertainty method, which introduces a new model component called the error component. The Q−1 matrix essentially encodes the uncertainty in the first-stage latent parameters. In addition, I proposed a low rank approximation of the Q matrix, which can be useful for large spatio-temporal applications. I also used the SBC method to validate the correctness of the proposed method. Results of model validation on toy spatial models show that the Q method can be correct, but the accuracy of the posterior approximations and the computational benefits of the method depends on the coarseness of the mesh for the error component and the dimension of the first-stage model latent parameters. The main reasons for the computational bottleneck with the proposed method is that the predictor expression of the Q method involves non-linear model components, which does not fit quite conveniently in the INLA framework.

Finally, Chapter 7, the conclusion chapter, highlights the main contributions of this thesis and outlines potential directions for future work. In addition, I reemphasize current approaches for fitting conditional latent Gaussian models, and provide ideas on a new approach for fitting such models. Whereas the previous chapters highlight the problem of spatial misalignment, the final chapter discusses the issue of time misalignment. I provide ideas and initial results from using INLA to fit Mixed-Data-Sampling (MIDAS) models, which provide a framework for fitting a regression model on time series data with varying frequencies.

Item Type: Thesis (PhD)
Qualification Level: Doctoral
Subjects: H Social Sciences > HA Statistics
Colleges/Schools: College of Science and Engineering > School of Mathematics and Statistics
Supervisor's Name: Illian, Professor Janine and Castro-Camilo, Dr. Daniela
Date of Award: 2025
Depositing User: Theses Team
Unique ID: glathesis:2025-85492
Copyright: Copyright of this thesis is held by the author.
Date Deposited: 02 Oct 2025 14:37
Last Modified: 02 Oct 2025 14:40
Thesis DOI: 10.5525/gla.thesis.85492
URI: https://theses.gla.ac.uk/id/eprint/85492
Related URLs:

Actions (login required)

View Item View Item

Downloads

Downloads per month over past year