Better predictions using big data sets

Thomas Debray

Clinical prediction models are an important tool in contemporary medical decision making and abundant in the medical literature. Prediction models estimate the probability/risk that a certain condition is present or will occur in the future by combining information from multiple variables (predictors) from an individual, e.g. predictors from patient history, physical examination or medical testing. Prediction models are used to determine referral of patients for further testing, for planning lifestyle or therapeutic decisions or to risk-stratify participants in therapeutic clinical trials.

Probability estimates provided by prediction models should be sufficiently accurate, otherwise incorrect management decisions are being made, leading to suboptimal outcomes for individuals and unnecessary health care costs. Unfortunately, many prediction models predict much worse than anticipated during their development. A major reason for unsatisfactory accuracy and limited use in clinical practice is that they are typically developed from relatively small datasets, and subsequently used in populations/settings too different from the original development population/setting, without proper validation and adaptation to the new situation.

To improve the accuracy and generalizability of prediction models, their development and subsequent validation should be based on larger datasets. This strategy is increasingly common by sharing of research data. Currently, however, there is a lack of statistical approaches to properly develop, validate and adapt prediction models when predictor effects vary across individuals due to differences in predictor/outcome burden or in measurement techniques across studies, populations, settings or time periods.



Assessment of heterogeneity in an individual participant data meta-analysis of prediction models: An overview and illustration. Stat Med 2019.

Evidence synthesis in prognosis research. Diagnostic and Prognostic Research 2019.

Development and validation of a novel prediction model to identify patients in need of specialized trauma care during field triage: design and rationale of the GOAT study. Diagnostic and Prognostic Research 2019.

Understanding the relation between Zika virus infection during pregnancy and adverse fetal, infant and child outcomes: a protocol for a systematic review and individual participant data meta-analysis of longitudinal studies of pregnant women and their infants and children. BMJ Open 2019.

Performance of the Framingham risk models and pooled cohort equations for predicting 10-year risk of cardiovascular disease: a systematic review and meta-analysis. BMC Medicine 2019.

Empirical evidence of the impact of study characteristics on the performance of prediction models: a meta-epidemiological study. BMJ Open 2019.

A guide to systematic review and meta-analysis of prognostic factor studies. BMJ 2019.

The use of Prognostic Scores for Causal Inference with General Treatment Regimes. Stat Med 2019.

Cardiovascular risk prediction models for women in the general population: A systematic review. PLoS One 2019.

A framework for meta-analysis of prediction model studies with binary and time-to-event outcomes. Stat Methods Med Res 2018.

Predicition models for delayed graft function: external validation on The Dutch Prospective Renal Transplantation Registry. Nephrology Dialysis Transplantation 2018.

Multiple imputation for multilevel data with continuous and binary variables. Stat Sci 2018.

Prediction of personalised prognosis in patients with amyotrophic lateral sclerosis: development and validation of a prediction model. Lancet Neurology 2018.

The development of CHAMP: a checklist for the appraisal of moderators and predictors. BMC Med Res Methodol 2017.

Detecting small-study effects and funnel plot asymmetry in meta-analysis of survival data: a comparison of new and existing tests. Res Synth Methods 2018.

Meta-analysis of prediction model performance across multiple studies: Which scale helps ensure between-study normality for the C-statistic and calibration measures?. Stat Methods Med Res 2017.

Reporting of Bayesian analysis in epidemiologic research should become more transparent. J Clin Epidemiol 2017.

Predictive performance of the CHA2DS2-VASc rule in atrial fibrillation: a systematic review and meta-analysis. J Thromb Haemost 2017.

A guide to systematic review and meta-analysis of prediction model performance. BMJ 2017.

The life expectancy of Stephen Hawking, according to the ENCALS model. Lancet Neurology 2018.

The Netherlands Organisation for Health Research and Development

Source of Funding

The Netherlands Organisation for Scientific Research supports a strong system of sciences in the Netherlands by encouraging quality and innovation in science. Our conviction is that scientific research contributes to our prosperity and well-being and that it provides for our growing need for knowledge: for facing societal challenges, for economic development and to better understand ourselves and the world.

Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley, Lorem Ipsum is simply dummy text of the printing and typesetting industry.