Clinical data are rarely as clean, compact or convenient as the examples we often use when teaching statistics or R. Real hospital datasets are usually distributed across several tables, include missing values, contain repeated structures, and require careful documentation before they can be reused.
The new R package DIVINE is interesting precisely because it brings that reality into the R ecosystem in an accessible way. Available on CRAN, DIVINE provi
des a curated collection of datasets from a multicentre cohort of hospitalized COVID-19 patients in the south metropolitan area of Barcelona. The package is accompanied by a recent publication in Scientific Data, which describes the database, its structure, data collection process and potential reuse for clinical epidemiology, teaching and methodological research.
A clinical dataset packaged for R
The package includes 14 datasets covering different clinical domains, such as demographics, comorbidities, symptoms, vital signs, severity scores, ICU information, treatments, complications, vaccination and end-of-follow-up data.
This relational structure is one of the most valuable aspects of the package. Instead of providing a single pre-merged analysis file, DIVINE preserves the logic of a real clinical database, where information is distributed across several linked tables. This makes it especially useful for applied teaching and for demonstrating realistic data-management workflows in R.
For example:
install.packages("DIVINE")
library(DIVINE)
data(package = "DIVINE")
The datasets can then be loaded in the usual way:
data("demographic")
data("vital_signs")
data("scores")
The common identifiers allow users to combine information across tables and build analysis datasets depending on the research question.
More than a data package
Although the datasets are the main contribution, DIVINE also includes helper functions for common epidemiological data workflows. These include:
data_overview()
multi_join()
stats_table()
multi_plot()
impute_missing()
export_data()
These functions are not intended to replace the broader R ecosystem, but they make the package easier to use in teaching, exploratory analysis and reproducible examples.
A minimal workflow might look like this:
library(DIVINE)
data("demographic")
data("vital_signs")
data("scores")
baseline <- multi_join(
list(demographic, vital_signs, scores),
key = c("record_id", "covid_wave", "center"),
join_type = "left"
)
data_overview(baseline)
stats_table(
baseline,
vars = c("age", "sex"),
by = "covid_wave",
statistic_type = "median_iqr",
pvalue = TRUE
)
This example already illustrates several important aspects of clinical data analysis: understanding table structure, joining related datasets, checking variables, and producing descriptive summaries.
Why it is useful for R users
For a specialised R audience, the value of DIVINE is not only that it provides COVID-19 data. Its main interest is that it offers a realistic, documented and reusable clinical database within a familiar R workflow.
The package may be useful for:
-
teaching data management with relational clinical datasets;
-
preparing examples for biostatistics or epidemiology courses;
-
demonstrating descriptive clinical analyses;
-
exploring missing data and variable availability;
-
developing prognostic modelling examples;
-
validating prediction models;
-
creating reproducible workflows using real-world health data.
This makes DIVINE particularly attractive for applied biostatisticians, epidemiologists, clinical researchers and R instructors who want to move beyond toy datasets.