Welcome to our research page featuring recent publications in the field of biostatistics and epidemiology! These fields play a crucial role in advancing our understanding of the causes, prevention, and treatment of various health conditions. Our team is dedicated to advancing the field through innovative studies and cutting-edge statistical analyses. On this page, you will find our collection of research publications describing the development of new statistical methods and their application to real-world data. Please feel free to contact us with any questions or comments.
Showing 1 of 10 publications
External validation of the discriminative ability of prediction models is of key importance. However, the interpretation of such evaluations is challenging, as the ability to discriminate depends on both the sample characteristics (ie, case-mix) and the generalizability of predictor coefficients, but most discrimination indices do not provide any insight into their respective contributions. To disentangle differences in discriminative ability across external validation samples due to a lack of model generalizability from differences in sample characteristics, we propose propensity-weighted measures of discrimination. These weighted metrics, which are derived from propensity scores for sample membership, are standardized for case-mix differences between the model development and validation samples, allowing for a fair comparison of discriminative ability in terms of model characteristics in a target population of interest. We illustrate our methods with the validation of eight prediction models for deep vein thrombosis in 12 external validation data sets and assess our methods in a simulation study. In the illustrative example, propensity score standardization reduced between-study heterogeneity of discrimination, indicating that between-study variability was partially attributable to case-mix. The simulation study showed that only flexible propensity-score methods (allowing for non-linear effects) produced unbiased estimates of model discrimination in the target population, and only when the positivity assumption was met. Propensity score-based standardization may facilitate the interpretation of (heterogeneity in) discriminative ability of a prediction model as observed across multiple studies, and may guide model updating strategies for a particular target population. Careful propensity score modeling with attention for non-linear relations is recommended.
Real-world data sources offer opportunities to compare the effectiveness of treatments in practical clinical settings. However, relevant outcomes are often recorded selectively and collected at irregular measurement times. It is therefore common to convert the available visits to a standardized schedule with equally spaced visits. Although more advanced imputation methods exist, they are not designed to recover longitudinal outcome trajectories and typically assume that missingness is non-informative. We, therefore, propose an extension of multilevel multiple imputation methods to facilitate the analysis of real-world outcome data that is collected at irregular observation times. We illustrate multilevel multiple imputation in a case study evaluating two disease-modifying therapies for multiple sclerosis in terms of time to confirmed disability progression. This survival outcome is derived from repeated measurements of the Expanded Disability Status Scale, which is collected when patients come to the healthcare center for a clinical visit and for which longitudinal trajectories can be estimated. Subsequently, we perform a simulation study to compare the performance of multilevel multiple imputation to commonly used single imputation methods. Results indicate that multilevel multiple imputation leads to less biased treatment effect estimates and improves the coverage of confidence intervals, even when outcomes are missing not at random.
Most clinical specialties have a plethora of studies that develop or validate one or more prediction models, for example, to inform diagnosis or prognosis. Having many prediction model studies in a particular clinical field motivates the need for systematic reviews and meta-analyses, to evaluate and summarise the overall evidence available from prediction model studies, in particular about the predictive performance of existing models. Such reviews are fast emerging, and should be reported completely, transparently, and accurately. To help ensure this type of reporting, this article describes a new reporting guideline for systematic reviews and meta-analyses of prediction model research.
When data are available from individual patients receiving either a treatment or a control intervention in a randomized trial, various statistical and machine learning methods can be used to develop models for predicting future outcomes under the two conditions, and thus to predict treatment effect at the patient level. These predictions can subsequently guide personalized treatment choices. Although several methods for validating prediction models are available, little attention has been given to measuring the performance of predictions of personalized treatment effect. In this article, we propose a range of measures that can be used to this end. We start by defining two dimensions of model accuracy for treatment effects, for a single outcome: discrimination for benefit and calibration for benefit. We then amalgamate these two dimensions into an additional concept, decision accuracy, which quantifies the model's ability to identify patients for whom the benefit from treatment exceeds a given threshold. Subsequently, we propose a series of performance measures related to these dimensions and discuss estimating procedures, focusing on randomized data. Our methods are applicable for continuous or binary outcomes, for any type of prediction model, as long as it uses baseline covariates to predict outcomes under treatment and control. We illustrate all methods using two simulated datasets and a real dataset from a trial in depression. We implement all methods in the R package predieval. Results suggest that the proposed measures can be useful in evaluating and comparing the performance of competing models in predicting individualized treatment effect.
Background: Many children with pulmonary tuberculosis remain undiagnosed and untreated with related high morbidity and mortality. Diagnostic challenges in children include low bacterial burden, challenges around specimen collection, and limited access to diagnostic expertise. Algorithms that guide decisions to initiate tuberculosis treatment at primary healthcare centres in resource-limited settings could help to close the persistent childhood tuberculosis treatment gap. Recent advances in childhood tuberculosis algorithm development have incorporated prediction modelling, but studies conducted to date have been small and localised, with limited generalisability. We assembled individual participant data (IPD) from children being investigated for pulmonary tuberculosis in high-tuberculosis incidence settings, which we leveraged to 1) evaluate the performance of currently used diagnostic algorithms and 2) develop evidence-based algorithms to assist in tuberculosis treatment decision-making for children presenting to primary healthcare settings.
Methods: We collated IPD including clinical, bacteriological, and radiologic information from prospective diagnostic studies in high-tuberculosis incidence settings enrolling children <10 years with presumptive pulmonary tuberculosis. Using this dataset, we first retrospectively evaluated the performance of several existing treatment-decision algorithms. We then developed multivariable prediction models and investigated model generalisability using an internal-external cross-validation framework. A team of experts provided input to adapt the models into scoring systems with pre-determined sensitivity thresholds of 85% to be incorporated into pragmatic treatment-decision algorithms for use in resource-limited, primary healthcare settings.
Findings: Of 4,718 children from 13 studies from 12 countries, 1,811 (38.4%) were classified as having pulmonary tuberculosis; 541 (29.9%) bacteriologically confirmed and 1,270 (70.1%) unconfirmed. Existing treatment-decision algorithms had highly variable diagnostic performance. The scoring system derived from the prediction model that included clinical features and features from chest x-ray had a combined sensitivity of 86% [95% confidence interval (CI): 0.68-0.94] and specificity of 37% [95% CI: 0.15-0.66] against a composite reference standard. The scoring system derived from the model that included only clinical features had a combined sensitivity of 84% [95% confidence interval (CI): 0.66-0.93] and specificity of 30% [95% CI: 0.13-0.56] against a composite reference standard.
Interpretation: We adopted an evidence-based approach to develop pragmatic algorithms to guide tuberculosis treatment decisions in children, irrespective of the resources locally available. This approach will empower health workers in resourcelimited, primary healthcare settings to initiate tuberculosis treatment in children in order to improve access to care and reduce tuberculosis-related mortality. These algorithms have been included in the operational handbook accompanying the latest WHO guidelines on the management of tuberculosis in children and adolescents. Future prospective evaluation of algorithms, including those developed in this work, is necessary to investigate clinical performance.
The increasing availability of large combined datasets (or big data), such as those from electronic health records and from individual participant data meta-analyses, provides new opportunities and challenges for researchers developing and validating (including updating) prediction models. These datasets typically include individuals from multiple clusters (such as multiple centres, geographical locations, or different studies). Accounting for clustering is important to avoid misleading conclusions and enables researchers to explore heterogeneity in prediction model performance across multiple centres, regions, or countries, to better tailor or match them to these different clusters, and thus to develop prediction models that are more generalisable. However, this requires prediction model researchers to adopt more specific design, analysis, and reporting methods than standard prediction model studies that do not have any inherent substantial clustering. Therefore, prediction model studies based on clustered data need to be reported differently so that readers can appraise the study methods and findings, further increasing the use and implementation of such prediction models developed or validated from clustered datasets.
A common problem in the analysis of multiple data sources, including individual participant data meta-analysis (IPD-MA), is the misclassification of binary variables. Misclassification may lead to biased estimators of model parameters, even when the misclassification is entirely random. We aimed to develop statistical methods that facilitate unbiased estimation of adjusted and unadjusted exposure-outcome associations and between-study heterogeneity in IPD-MA, where the extent and nature of exposure misclassification may vary across studies.
We present Bayesian methods that allow misclassification of binary exposure variables to depend on study- and participant-level characteristics. In an example of the differential diagnosis of dengue using two variables, where the gold standard measurement for the exposure variable was unavailable for some studies which only measured a surrogate prone to misclassification, our methods yielded more accurate estimates than analyses naive with regard to misclassification or based on gold standard measurements alone. In a simulation study, the evaluated misclassification model yielded valid estimates of the exposure-outcome association, and was more accurate than analyses restricted to gold standard measurements.
Our proposed framework can appropriately account for the presence of binary exposure misclassification in IPD-MA. It requires that some studies supply IPD for the surrogate and gold standard exposure, and allows misclassification to follow a random effects distribution across studies conditional on observed covariates (and outcome). The proposed methods are most beneficial when few large studies that measured the gold standard are available, and when misclassification is frequent.
Objective: To externally validate various prognostic models and scoring rules for predicting short term mortality in patients admitted to hospital for covid-19.
Design: Two stage individual participant data meta-analysis.
Setting: Secondary and tertiary care.
Participants: 46914 patients across 18 countries, admitted to a hospital with polymerase chain reaction confirmed covid-19 from November 2019 to April 2021.
Data sources: Multiple (clustered) cohorts in Brazil, Belgium, China, Czech Republic, Egypt, France, Iran, Israel, Italy, Mexico, Netherlands, Portugal, Russia, Saudi Arabia, Spain, Sweden, United Kingdom, and United States previously identified by a living systematic review of covid-19 prediction models published in The BMJ, and through PROSPERO, reference checking, and expert knowledge.
Model selection and eligibility criteria: Prognostic models identified by the living systematic review and through contacting experts. A priori models were excluded that had a high risk of bias in the participant domain of PROBAST (prediction model study risk of bias assessment tool) or for which the applicability was deemed poor.
Methods: Eight prognostic models with diverse predictors were identified and validated. A two stage individual participant data meta-analysis was performed of the estimated model concordance (C) statistic, calibration slope, calibration-in-the-large, and observed to expected ratio (O:E) across the included clusters.
Main outcome measures: 30 day mortality or in-hospital mortality.
Results: Datasets included 27 clusters from 18 different countries and contained data on 46 914patients. The pooled estimates ranged from 0.67 to 0.80 (C statistic), 0.22 to 1.22 (calibration slope), and 0.18 to 2.59 (O:E ratio) and were prone to substantial between study heterogeneity. The 4C Mortality Score by Knight et al (pooled C statistic 0.80, 95% confidence interval 0.75 to 0.84, 95% prediction interval 0.72 to 0.86) and clinical model by Wang et al (0.77, 0.73 to 0.80, 0.63 to 0.87) had the highest discriminative ability. On average, 29% fewer deaths were observed than predicted by the 4C Mortality Score (pooled O:E 0.71, 95% confidence interval 0.45 to 1.11, 95% prediction interval 0.21 to 2.39), 35% fewer than predicted by the Wang clinical model (0.65, 0.52 to 0.82, 0.23 to 1.89), and 4% fewer than predicted by Xie et al's model (0.96, 0.59 to 1.55, 0.21 to 4.28).
Conclusion: The prognostic value of the included models varied greatly between the data sources. Although the Knight 4C Mortality Score and Wang clinical model appeared most promising, recalibration (intercept and slope updates) is needed before implementation in routine care.