Packaging R: repositories

In this post I want to collect some thoughts about R repositories. In R we have multiple repositories that store packages for users. In this post I want to write about the purpose, functionality, benefits and drawbacks of R repositories and how packages are managed. The goal is to summarize what I’ve learnt these last years about them. I’ll also collect some information about them from various sources to make it easier for myself to find it later on.

I am writing this because I am worried about the future of CRAN and R. Due to multiple circumstances, the current position is not sustainable as is. I hope that this post, will help me to understand the past, present and create some concrete steps to do.

History

I was not there, but the first repository started around April 1997. This repository is CRAN: the Comprehensive R Archive Network. The first mention I found is already about changes in it, but it was not until the end of the month when it was announced.

CRAN was created by a few volunteers, some of which are still mainting it 25 years later. The current team is listed on their website. From the beginning it was “a collection of sites which carry identical material, consisting of the R&R R distribution(s), the contributed extensions, documentation for R, and binaries.”

Omegahat was another repository created shortly after CRAN:

The Omega project began in July, 1998, with discussions among designers responsible for three current statistical languages (S, R, and Lisp-Stat), with the idea of working together on new directions with special emphasis on web-based software, Java, the Java virtual machine, and distributed computing.

Many developers of Omegahat were in the R Core or CRAN team. It was available as a repository from the R source code but was removed definitely in version R 4.1, in 20211.

Bioconductor, was the next major repository that appeared. It was funded by Robert Gentleman and others in 2004 (it started the mailing list). A paper describing it appeared in late 2004:

an initiative for the collaborative creation of extensible software for computational biology and bioinformatics.

Through its history repositories have evolved with R and R with them. For example: R was released twice a year at the beginning, and Bioconductor did too. But when R moved to be released once per year (in 2013 with version 3.0) Bioconductor kept using two releases a year. This introduced some problems when installing packages from Bioconductor, when a single R release can be compatible with two Bioconductor releases2.

In other cases, checks have evolved. For instance Solaris was used to test packages in CRAN until 2021, if I recall correctly, because it allowed to test in a proprietary C or C++ compiler. This lead to discover bugs but also to more distress in R-package developers which had difficulties checking their packages in that environment.

Other checks evolve with R, becoming more strict with time: In the early versions of R the use of NAMESPACE was not regulated. But since R version 2.15, 2012 it was compulsory even for data-only packages3. This was synchronized with repositories checks.

Last, some goals/desires of CRAN are not fulfilled (or where abandoned). For example, from the start CRAN aimed to have packages authenticated (see the bottom of the announcement). This might be due to lack of time, resources or that the plans are in progress but require (volunteer) time.

With time, different repositories arose:

  • MRAN, which was available since September 17th, 2014 to July 1st, 2022.

  • The Rstudio Public Package Manager later renamed Posit Public Package Manager has binaries for several OS since 2019.

  • There is the R4pi repository with binaries for Raspberry Pi.

  • I remember a proteomics repository available.

  • rOpenSci started its own repository which later evolved into the r-universe. The r-universe currently can provide binaries of packages that are hosted in a git repository.

Literature

The role and prominence of the repositories has lead to many articles being written about it. I wanted to link and collect some of them for easier retrieval.

I was wondering how CRAN is described by the volunteers that built it. From the announcing email:

CRAN is a collection of sites which carry identical material, consisting of the R&R R distribution(s), the contributed extensions, documentation for R, and binaries.

From the website (at 2023/12/09):

CRAN is a network of ftp and web servers around the world that store identical, up-to-date, versions of code and documentation for R.

Initially there was R NEWS, with an article dedicated to CRAN and one to Omegahat too. These articles usually describe new package additions but sometimes they also provide information about changes:

Later it became the R Journal:

In addition, several articles and blog posts have appeared. From those I found it is worth mentioning the following:

And my own posts:

Characteristics

The predominance of CRAN and its role as primary and default R repository has lead to some special treatment of the repository.

CRAN checks are in the R source code itself. While other repositories have their own checks in different tools. In addition, the CRAN environmental variables used are documented in the R-internals (they are more or less accessible in the svn repository too).

Others who know more have stated the benefits of CRAN too: This text is copied from Henrik Bengstsson in Bioconductor Slack:

FOREVER ARCHIVE:

The first one is that it publishes packages and versions of them until the end of time. When a package has been published on CRAN, it takes a lot for it to be removed from there. I don’t know if it ever happened, but I can imagine a package can be fully removed if it was illegally published in the first place (e.g. copyright, illegal content, ...) or malicious.

INSTALLATION SERVICE:

Then CRAN also provides a R package repository service for installing packages on CRAN using built-in R functions. The set of packages in the package repo is a subset of all packages on CRAN. The CRAN package repo makes a promise that all packages listed in PACKAGES can be installed. If they cannot make that promise, they’ll archive the package (=remove it from PACKAGES). I should also say, install.packages(url) can be used to install from the set of packages that are archived. Technically, old package versions are always archived.

CHECK SERVICE:

The content of the R package repository is guided by the CRAN package checks that run on R-oldrel, R-release, and R-devel across multiple platforms. The minimal requirement is that no package should remain in the package repository if the checks detects ERRORs (and those errors are not due to recently introduced bugs in R-devel). WARNINGs can also cause a package to be archived, but that process often takes longer. AFAIK, NOTEs are not a cause for a package being archived (but I could be wrong). The CRAN incoming checks, which you have to pass when you submit a new package, or an updated version, will make sure that the published package pass with all OKs. (It’s possible to argue for NOTEs being false positives, or for them not to be fixed, but that requires a manual approval by the CRAN Team).

I think there are many more resources discussing R repositories. If you know more I’ll be happy to update this post.

I think before I drag too much on the steps I’ll post this and collect some more articles I might have missed.

Last, Uwe Liegges presented about CRAN in useR!2017, thanks Tim Taylor for sharing it. In this video there is an explanation of why the solaris OS was used.

It has come to my attention that there is an article, by G. Brooke Anderson and Dirk Eddelbuette, about the R package repositories structure (among other things): Hosting Data Packages via drat: A Case Study with Hurricane Exposure Data

Reproducibility

## ─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────
##  setting  value
##  version  R version 4.3.1 (2023-06-16)
##  os       Ubuntu 22.04.3 LTS
##  system   x86_64, linux-gnu
##  ui       X11
##  language (EN)
##  collate  en_US.UTF-8
##  ctype    en_US.UTF-8
##  tz       Europe/Madrid
##  date     2024-01-15
##  pandoc   3.1.1 @ /usr/lib/rstudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown)
## 
## ─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────
##  package     * version date (UTC) lib source
##  blogdown      1.18    2023-06-19 [1] CRAN (R 4.3.1)
##  bookdown      0.37    2023-12-01 [1] CRAN (R 4.3.1)
##  bslib         0.6.1   2023-11-28 [1] CRAN (R 4.3.1)
##  cachem        1.0.8   2023-05-01 [1] CRAN (R 4.3.1)
##  cli           3.6.2   2023-12-11 [1] CRAN (R 4.3.1)
##  digest        0.6.33  2023-07-07 [1] CRAN (R 4.3.1)
##  evaluate      0.23    2023-11-01 [1] CRAN (R 4.3.2)
##  fastmap       1.1.1   2023-02-24 [1] CRAN (R 4.3.1)
##  htmltools     0.5.7   2023-11-03 [1] CRAN (R 4.3.2)
##  jquerylib     0.1.4   2021-04-26 [1] CRAN (R 4.3.1)
##  jsonlite      1.8.8   2023-12-04 [1] CRAN (R 4.3.1)
##  knitr         1.45    2023-10-30 [1] CRAN (R 4.3.2)
##  lifecycle     1.0.4   2023-11-07 [1] CRAN (R 4.3.2)
##  R6            2.5.1   2021-08-19 [1] CRAN (R 4.3.1)
##  rlang         1.1.3   2024-01-10 [1] CRAN (R 4.3.1)
##  rmarkdown     2.25    2023-09-18 [1] CRAN (R 4.3.1)
##  rstudioapi    0.15.0  2023-07-07 [1] CRAN (R 4.3.1)
##  sass          0.4.8   2023-12-06 [1] CRAN (R 4.3.1)
##  sessioninfo   1.2.2   2021-12-06 [1] CRAN (R 4.3.1)
##  xfun          0.41    2023-11-01 [1] CRAN (R 4.3.2)
##  yaml          2.3.8   2023-12-11 [1] CRAN (R 4.3.1)
## 
##  [1] /home/lluis/bin/R/4.3.1
##  [2] /opt/R/4.3.1/lib/R/library
## 
## ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

  1. In version 3.1.2 Omegahat didn’t provide Windows binaries and in 4.1 from the default repositories (See 4.1 in NEWS(.4)).↩︎

  2. This lead to the need of having a special function to install packages from Bioconductor. Initially a function biocLite and later with the BiocManager package.↩︎

  3. NEWS in 2.15 section↩︎

Avatar
Lluís Revilla Sancho
Bioinformatician

Bioinformatician with interests in functional enrichment, data integration and transcriptomics.

Related