DICOM Parsing with R

Abstract

This blog post is to describe how to parse medically relevant non-image meta information from DICOM files using the programming language R. The resulting structure of the whole parsing process is an R data frame in form of a name – value table that is both easy to handle and flexible.

We first describe the general structure of DICOM files and which kind of information they contain. Following this, our DicomParseR module in R is explained in detail. The package has been developed as part of practical DICOM parsing from our hospital’s cardiac magnetic resonance (CMR) modalities in order to populate a scientific database. The given examples hence refer to CMR information parsing, however, due to its generic nature, DicomParseR may be used to parse information from any type of DICOM file.

The following graph illustrates the use of DicomParseR in our use case as an example:

Structure of CMR DICOM files

On top level, a DICOM file generated by a CMR modality consists of a header (hdr) section and the image (img) information. In between an XML part can be found.

The hdr section mainly contains baseline information about the patient, including name, birth date and system ID. It also contains contextual information about the observation, such as date and time, ID of the modality and the observation protocol.

The XML section contains quantified information generated by the modality’s embedded AI, e. g. regarding myocardial blood flow (MBF). All information is stored between specifically named sub tags. These tags will serve to search for specific information. For further information on DICOM files, please refer to dicomstandard.org.

The heterogeneous structure of DICOM files as described above requires the use of distinct submodules to compose a technically harmonized name – value table. The information from the XML section will be extended by information from the hdr section. The key benefit of our DicomParseR module is to parse these syntactically distinct sections, which will be described in the following.

Technical Approach

To extract information from the sub tags in the XML section and any additional relevant meta information from the hdr section of the DICOM file, following steps are performed:

    1. Check if a DICOM file contains desired XML tag
    2. If the desired tag is present, extract and transform baseline information from hdr part
    3. If step 1 applied, extract and transform desired tag information from XML part
    4. Combine the two sets of information into an integrated R data frame
    5. Write the data frame into a suitable database for future scientific analysis

The steps mentioned above will be explained in detail in the following.

Step 1: Check if a DICOM file contains the desired XML tag

At the beginning of processing, DicomParseR will check whether a certain tag is present in the DICOM file, in our case <ismrmrdMeta>. In case that tag exists, quantified medical information will be stored here. Please refer to ISMRMRD for further information about the ismrmrd data format.

For this purpose, DicomParseR offers the function file_has_content() that expects the file and a search tag as parameters. The function will use base::readLines() to read in the file and stringr::str_detect() to detect if the given tag is available in the file. Performance tests with help of the package microbenchmark have proven stringr’s outstanding processing speed in this context. If the given tag was found, TRUE is returned, otherwise FALSE.

Any surrounding application may hence call

if (DicomParseR::file_has_content(file, "ismrmrdMeta")) {…}

to only continue parsing DICOM files that contain the desired tag.

It is important to note, that the information generated by the CMR modality is actually not a single DICOM file but rather a composition of a multitude of files. These files may or may not contain the desired XML tag(s). If step 1 were omitted, our parsing module would import many more files than necessary.

Step 2: Extract and transform baseline hdr information

Step 2 will extract hdr information from the file. For this purpose, DicomParseR uses the function readDICOMFile() provided by package oro.dicom. By calling

oro.dicom::readDICOMFile(dicom_file)[["hdr"]]

the XML and image part are removed. The hdr section contains information such as patient’s name, sex and birthdate as well as meta information about the observation, such as date, time and contrast bolus. DicomParseR will save the hdr part as a data frame (in the following called df_hdr) in this step and later append it to the data frame that is generated in the next step.

Note that the oro.dicom package provides functionality to extract header and image data from a DICOM file as shown in the code snippet. However, it does not provide an out-of-the-box solution to extract the XML section and return it as an R data frame. For this purpose, the DicomParseR wraps the extra functionality required around existing packages for DICOM processing.

Step 3: Extract and transform information from XML part

In this step, the data within the provided XML tag is extracted and transformed into a data frame.

Following snippet shows an example about how myocardial blood flow numbers are stored in the respective DICOM files (values modified in terms of data privacy):

<ismrmrdMeta>
                …
                 <meta>
                               <name>GADGETRON_FLOW_ENDO_S_1</name>
                               <value>1.95</value>
                               <value>0.37</value>
                               <value>1.29</value>
                               <value>3.72</value>
                               <value>1.89</value>
                               <value>182</value>
                </meta>
                …
</ismrmrdMeta>

Within each meta tag, “name” specifies the context of the observation and “value” stores the myocardial blood flow data. The different data points between the value tags correspond to different descriptive metrics, such as mean, median, minimum and maximum values. Other meta tags may be structured differently. In order to stay flexible, the final extraction of a concrete value is done in the last step of data processing, see step 5.

Now, to extract and transform the desired information from the DICOM file, DicomParseR will first use its function extract_xml_from_file() for extraction and subsequently the function convert_xml2df() for transformation. With

extract_xml_from_file <- function(file, tag) {
  file_content <- readLines(file, encoding = "UTF-16LE", skipNul = TRUE)
  indeces <- grep(tag, file_content)
  xml_part <- paste(file_content[indeces[[1]]:indeces[[2]]], collapse = "\n")
  return(xml_part)
}

and “ismrmrdMeta” as tag, the function will return a string in XML structure. That string is then converted to an R data frame in the form of a name – value table by convert_xml2df(). Based on our example above, the resulting data frame will look like this:

name value [index]
GADGETRON_FLOW_ENDO_S1 1.95 [1]
GADGETRON_FLOW_ENDO_S1 0.37 [2]

That data frame is called df_ismrmrdMeta in the following. A specific value can be accesses with the combination of name and index, see the example in step 5.

Step 4: Integrate hdr and XML data frames

At this point in time, two data frames have resulted from processing the original DICOM file: df_hdr and df_ismrmrdMeta.

In this step, those two data frames are combined into one single data frame called df_filtered. This is done by using base::rbind().

For example, executing

df_filtered <- rbind(c("Pat_Weight", df_hdr$value[df_hdr$name=="PatientsWeight"][1]), df_ismrmrdMeta)

will extend the data frame df_ismrmrdMeta by the patient’s weight. The result is returned in form of the target data frame df_filtered. As with df_ismrmrdMeta, df_filtered will be a name – value table. This design has been chosen in order to stay as flexible as possible when it comes to subsequent data analysis.

Step 5: Populate scientific database

The data frame df_filtered contains all information from the DICOM file as a name – value table. In the final step 5, df_filtered may now be split again as required to match the use case specific schema of the scientific database.

For example, in our use case, the table “cmr_data” in the scientific database is dedicated to persist MBF values. An external program (in this case, an R Shiny application providing a GUI for end-user interaction) will call its function transform_input_to_cmr_data() to generate a data frame in format of the “cmr_data” table. By calling

transform_input_to_cmr_data <- function(df) {
  mbf_endo_s1 = as.double(df$value[df$name=="GADGETRON_FLOW_ENDO_S_1"][1])
  mbf_endo_s2 = ...
}

with df_filtered as parameter, the mean MBF values of the heart segments are extracted and can now be sent to the database. Another sub step would be to call transform_input_to_baseline_data() to persist baseline information in the database.

Summary and Outlook

This blog post has described the way DICOM files from CMR observations can be processed with R in order to extract quantified myocardial blood flow values for scientific analysis. Apart from R, different approaches by other institutes have been discussed publicly as well, e. g. by using MATLAB. Interested readers may refer to NIH, among others, for further information.

The chosen approach tries to respect both the properties of DICOM files, that is, their heterogeneous inner structure, the different types of information and their file size, as well as the specific data requirements by our institute’s use cases. With a single R data frame in form of a name – value table, the result of the process is easy to handle for further data analysis. At the same time, due to its flexible setup, DicomParseR may serve as a module in any kind of DICOM-related use case.

Thomas Schröder
Centrum für medizinische Datenintegration BHC Stuttgart

Benchmarking API Performance: R-Native and Plumber in Data Extraction and Sending

R, a language best known for its prowess in statistical analysis and data science, might not be the first choice that comes to mind when thinking about building APIs. However, rapid prototyping, scalability, seamless integration with data analysis,  and ease of debugging are reasons for me to encapsulate API functionality within R packages. In doing so, I like to distinguish between two approaches:



The R-native approach (blue): API interaction is available by functions of relevant R-packages that are directly installed into the R-session.

The Plumber approach (green): The Plumber package allows R code to be exposed as a RESTful web service through special decorator comments.  As a results, API functionality is assesed by sending REST calls (GET, POST, …) rather than  calling functions of R-packages directly. 

#* This endpoint is a dummy for loading data
#* @get load_from_db
#* @serializer unboxedJSON
#* @param row_limit :int
#* @response 200 OK with result
#* @response 201 OK without result
#* @response 401 Client Error: Invalid Request - Class
#* @response 402 Client Error: Invalid Request - Type
#* @response 403 Client Error: Missing or invalid parameter
#* @response 500 Server Error: Unspecified Server error
#* @response 501 Server Error: Database writing failed
#* @response 502 Server Error: Database reading failed
#* @tag demo
function(res, row_limit = 10000) {
   # load data from db 
   data <- dplyr::tibble()
   res$status <- 200
   res$body <- as.list(data)
   return(res$body)
}

Which approach to use?

Certainly, both approaches have their strengths and limitations. It should be no surprise, that in terms of execution time, CPU utilization and format consistency, the R-native approach is likely to be the first choice, as code and data is processed within one context. Furthermore, the approach offers flexibility for complex data manipulations, but can be challenging when it comes to maintenance, especially propagating new releases of packages to all relevant processes and credential management. To the best of my knowledge, there is no automated way of re-installing new releases directly into all related R-packages, even with using the POSIT package manager – so this easily becomes tedious.

In contrast, the Plumber API encourages modular design that enhances code organization and facilitates integration with a wide array of platforms and systems.  It streamlines package updates while ensuring a consistent interface. This means that interacting with a Plumber API remains separate from the underlying code logic provided by the endpoint. This approach not only improves version management but also introduces a clear separation between the client and server. In general, decoupling functionality through a RESTful API offers the possibility of dividing tasks into separate development teams more easily and thus a higher degree of flexibility and external support. Additionally, I found distributing a Plumber API notably more straightforward than handing over a raw R package. 

The primary goal of this blog post is to quantify the performance difference between the two approaches when it comes to getting data in and out of a database. Such benchmark can be particularly valuable for ETL (Extract, Transform, Load) processes, thereby shedding light on the threshold at which the advantages of the Plumber approach cease to justify its constraints. In doing so, we hope to provide information to developers who are faced with the decision of whether it makes sense to provide or access R functionalities via Plumber APIs.

Experimental Setup
The experimental setup encompassed a virtual machine instance equipped with  64GB RAM and an Intel(R) Xeon(R) Gold 6152 CPU clocked at 2.1GHz, incorporating 8 kernels, running Ubuntu 22.04 LTS, hosting the POSIT Workbench and Connect server (for hosting the Plumber API) and employing R version v4.2.1. Both POSIT services were granted identical access permissions to the virtual machine’s computational resources.

Both approaches are evaluated in terms of execution times, simply measured with system.time(), and maximal observed CPU load, the latter being expecially an important indicator on how how much data can be extracted and send at once. For each fixed number of data row, ranging from 10^4 to 10^7, 10 trials are being conducted and results beeing plotted by using a jittered beeswarm plot. For assessing the cpu load during the benchmark, I build a separate function that returns a new session object, within which every 10 seconds the output of NCmisc::top(CPU = FALSE) is appended to a file.

get_cpu_load <- function(interval = 10, root, name, nrow) {
  rs_cpu <- callr::r_session$new()
  rs_cpu$call(function(nrow, root, name, interval) {
    files <- list.files(root)
    n_files <- sum(stringr::str_detect(files, sprintf("%s_%s_", name, format(nrow, scientific = FALSE))))
    l <- c()
    while (TRUE) {
      ret <- NCmisc::top(CPU = FALSE)
      l <- c(l, ret$RAM$used * 1000000)
      save(l, file = sprintf("%s/%s_%s_%s.rda", root, name, format(nrow, scientific = FALSE), n_files + 1))
      Sys.sleep(interval)
    }
  }, args = list(nrow, root, name, interval))
  return(rs_cpu)
}


Result
Execution time: in the following figure A), the data extraction process is observed to be approximately 10 times slower, when utilizing the plumber API as compared to the R-native approach across all dataset sizes.  


(y-axis in logarithmic scale)

Both approaches display a linear increase in execution time on a logarithmic time scale, indicating exponential growth in the original data domain. Specifically, the mean execution times for R-native and Plumber start at 0.00078 and 0.00456 minutes, respectively, and escalate to 0.286 and 2.61 minutes. It is reasonable to assume that this exponential trend persists for larger datasets, potentially resulting in execution times exceeding half an hour for very large tables (> 100 million rows) when using Plumber.

Conversely, subfigure B) shows the execution time for sending data and illustrates that both approaches provide rather comparable performance, particularly with larger numbers of rows. While for 10,000 rows, the R-native approach is still twice as fast (average of 0.0023 minutes) compared to Plumber (0.00418), the advantage of being in one context diminishes as the number of rows increases. At 10 million rows, the Plumber approach is even faster than the R-native approach (1.88 min), averaging 1.7 minutes. Once again, the execution time exhibits an exponential growth trend with an increasing number of rows.

CPU Load: In examining maximum observable CPU load during data receiving and sending, notable differences emerge between the Plumber API and the R-native approach. 

(y-axis in logarithmic scale)

A) For data extraction up to 1 million rows, CPU utilization remains below 10% for both approaches. However, the utilization patterns diverge as row counts increase. Notably, the R-native approach maintains relatively consistent CPU usage (averaging 5.53%, 5.48%, 5.47%) up to 1 million rows, whereas the Plumber approach already experiences a noticeable increase (5.97%, 6.05%, 8.6%). When extracting 10 million rows, CPU usage surpasses 30% for Plumber, while R-native extraction incurs approximately five times less computational overhead. B) In contrast to execution time, a clear difference in CPU utilization becomes evident also during sending data. The R-native approach consistently demonstrates at least half as less CPU demands compared to Plumber across all data row sizes. For 10,000,000 rows, the plumber approach even consumes over three times more CPU power13.1% vs. 43.2%.  This makes up to almost 30GB in absolute terms.

Conclusion

The Plumber approach, while offering several advantages, encounters clear limitations when dealing with large datasets, be it tables with a substantial number of rows or extensive columns. As a result,  data extraction becomes roughly ten times slower, with CPU utilization being up to five and three times higher during getting data out and in, respectively. Digging deeper into it reveals that this gap is likely to result from the necessity of converting data into JSON format when using a web-based architecture. Plumber can’t handle R dataframes directly, which is why serializer have to to be used before sending and retrieving data from an endpoint. Even with lots of RAM capacity, this conversion process can lead to execution errors in practice as JSON representations may surpass the allowed byte size for the R datatype character.

>jsonlite::toJSON(dataframe)
Error in collapse_object(objnames, tmp, indent):
R character strings are limited to 2^31-1 bytes

The only viable workaround in such scenarios involves breaking down tables into smaller chunks based on certain identifiers.

Providing a precise table size limitation where the Plumber approach remains suitable proves challenging, as it hinges on a multitude of factors, including the number of rows, columns, and cell content within the dataset. Personally, I will stick to using the Plumber API for scenarios with limited data traffic, such as querying terminology or a statistical summary, as I generally prioritize code encapsulation and ease of maintenance over maximizing performance.

Micha Christ
Bosch Health Campus Centrum für Medizinische Datenintegration