“Big data” and precision prevention: opportunities and challenges for biostatisticians

Giovanni Veronesi1 and Antonella Zambon2

1. Research Center in Epidemiology and Preventive Medicine, Department of Medicine and Surgery, University of Insubria, Varese, Italy
2. Department of Statistics and Quantitative Methods, University of Milano Bicocca, Milano, Italy


Non-communicable diseases (NCDs) represent the major global health challenge of the 21st century. Despite NCDs are largely preventable or can be delayed until late in life, 76% of the Italian 2016 healthcare costs were allocated to disease management, and only 4% to prevention. Whether this healthcare model is still affordable in the near future, given population ageing, is questionable. “Precision public health” or “precision prevention” has been defined as the possibility to provide the right intervention at the right population at the right time, before disease manifestation. One exemplification is the identification of high-risk sub-populations (frailty, low socioeconomic, genetic susceptibility i.e. family history of disease) to which deliver personalized interventions. To date, preventive strategies are designed for the “average individual” in a population, although standard recommendations may not be beneficial or even be harmful in specific subgroups. The availability of large epidemiological studies, health care databases and biobanks is crucial to the development of evidence-based recommendations that apply to subsets of the population, and it may eventually fuel the paradigm shift from disease management to prevention.  
From a methodological viewpoint, the use of “big data” for precision prevention determines new challenges for biostatisticians. First, data can be prone to missingness-not-at-random, or selection bias to use a more epidemiological terminology. An example is provided by studies linking physical activity and sleep to cardiovascular health arising from the use of mobile health technology. To this extent, the empirical calibration of p-values is a promising method to take bias into account. Second, dealing with large datasets, there is a need to distinguish between a highly precise finding and a finding with potential clinical or interventional importance, which deserves a “call for action” in the preventive agenda. At present, available metrics commonly adopted in the preventive setting to move from association to individual’s outcome prediction suffer by major limitations. Finally, replicability is an issue in the preventive field: for instance, the external validation of individual risk scores is seldom performed. Analytic approaches to big data, often based upon data-driven techniques, may negatively affect replicability. Therefore, a careful planning of external validation is required.