GitLab CI for R-package development

— A basic R-phile introduction to continuous integration on GitLab

I have been using the GitLab repository for some time for mainly two reasons: I can have private projects at no monetary costs (I later came to realise that I as an academic can have the same on GitHub), and most importantly GitLab has so far gone under the radar of our IT department, meaning I can access it from my work computer. GitHub on the other hand is flagged as file sharing.

A simple CI config

Most of my time with R is spend trying to make heads and tails of various kinds of data, and I have so far just authored one R-package. While I can see the benefits of a continuous integration (CI) work flow, I just never bothered to actually set it up. Now where I am putting together code in smaller packages for internal use, it seemed like the right time to learn a little.

The Internet gives a few pointers on how to go about setting up CI on GitLab; one of the resources is the blog post Docker, GitLab CI and Developing R Packages by Mustafa Hasanbulli, who gives a simple .gitlab-ci.yml for testing packages. Mustafa’s solution make use of the rocker/tidyverse Docker image and install the dependency packages before running check() from devtools. It’s a good solution and combining with the .gitlab-ci.yml shared as a gist on Github by Artem Klevtsov, I managed to get the coverage badge I though nice to have. The .gitlab-ci.yml for a smaller package can be along the lines of:

image: rocker/tidyverse

stages:
  - check
  - coverage

check_pkg:
  stage: check
  script:
    - R -e 'install.packages(c())'
    - R -e 'devtools::check()'

coverage:
   stage: coverage
   script:
     - R -e 'covr::package_coverage(type = c("tests", "examples"))'

To extract the coverage to the coverage badge, add Coverage: \d+.\d+%$ to the section ‘Test coverage parsing’ in Settings -> CI/CD -> General pipelines.

Introducing cache

For my package, each of the two stages took about 45 minutes to complete, and I realized that the wast majority of the time was spent on downloading and especially installing packages. This was mainly do to the Bioconductor packages I rely on.

If only there would be a way to pass the installed packages between the stages, or even between runs of the CI pipeline. There is – GitLab 9.0 saw the option to specify a cache. The next problem is that the cache must be a directory of the cloned project directory. Since R prefers to install packages in /usr/lib/R/library in the Docker images, the .libpaths() must be changed. In addition you would have to remember to add any new package to the .gitlab-ci.yml. Which I for one would always forget, and therefore painstakingly have to figure out which packages to add.

A much simpler solution is to use packrat – something you anyway should consider to use. It also allows you to use the rocker/r-base image and just the packages actually required for your CI. How much of a win in terms of traffic rocker/r-base is over rocker/tidyverse probably depends on the packages you have to add. The .gitlab-ci.yml caching packages could look like this:

image: rocker/r-base

stages:
  - setup
  - test

cache:
  # Ommit key to use the same cache across all pipelines and branches
  key: "$CI_COMMIT_REF_SLUG"
  paths:
    - packrat/lib/

setup:
  stage: setup
  script:
    - R -e 'source("ci.R"); ci_setup()'

check:
  stage: test
  dependencies:
    - setup
  when: on_success
  script:
    - R -e 'source("ci.R"); ci_check()'

coverage:
  stage: test
  dependencies:
    - setup
  when: on_success
  only:
    - master
  script:
    - R -e 'source("ci.R"); ci_coverage()'

with the ci.R looking like this:

install_if_needed <- function(package_to_install){
  package_path <- find.package(package_to_install, quiet = TRUE)

  if(length(package_path) == 0){
    # Only install if not present
    install.packages(package_to_install)
  }
}

ci_setup <- function(){
  install_if_needed("packrat")
  packrat::restore()
}

ci_check <- function(){
  install_if_needed("devtools")
  devtools::check()
}

ci_coverage <- function(){
  install_if_needed("covr")
  covr::package_coverage(type = c("tests", "examples"))
}

The cache key $CI_COMMIT_REF_SLUG gives you the advantage of different cache for different branches. Using $CI_COMMIT_SHA will give you a separate cache for each commit.

Adding the packrat subdirectories src and lib* to the .gitignore will keep your repository small – and I find it quite useful to commit just the packrat.lock whenever I add or remove a package. But then again, I am the only one working with my repositories, and there might be advantages I don’t know of.

I have noticed that the stages after the setup stage sometimes fail in the first run. If this happens because of the cache, rerunning the failed stage makes everything well.

Using the above for my package, the first run of the pipeline took about 45 minutes, but the second run only about 8 minutes. A considerable reduction in time.

I hope .gitlab-ci.yml and ci.R outlined here will help you getting started on caching your R-packages in your CI. The two modules are quite simple, and if you are loking for something more sophisticated, I can recommend looking Matt Dowle works on data.table and of course the GitLab Runner help pages.