Exploring CRAN's files: part 1

Using the CRAN’s files to look up the evolution of CRAN

Introduction

There are many great things in base R, one of them is the tools package. This package has the functions that are used to build, check and create packages, documentation and manuals.

As I wanted to know how CRAN works and its changes I was looking into the source code of tools. I found some internal functions that access freely available files with information about CRAN packages. These private functions are at the CRANtools.R file.

packages <- tools::CRAN_package_db()
# current <- tools:::CRAN_current_db()
# archive <- tools:::CRAN_archive_db()
# issues <- tools::CRAN_check_issues()
# alias <- tools:::CRAN_aliases_db()
# rdxrefs <- tools:::CRAN_rdxrefs_db()

As I was not sure of the information on these files I asked on R-devel but I did not receive an answer. They seem to be quite obscure and as private functions might be removed without notice and shouldn’t be used in any dependency. However, as the files contain information about CRAN they might provide interesting clues about the history of CRAN and how it is operated.

On this post I will focus on the first file. I’ll explore a couple of fields and in future posts I will use the other files to explore more about CRAN history.

packages file

First of all a very brief exploration of what is in this file:

##    Package Version Priority                        Depends
## 1       A3   1.0.0     <NA> R (>= 2.15.0), xtable, pbapply
## 2 AATtools   0.0.1     <NA>                   R (>= 3.6.0)
## 3   ABACUS   1.0.0     <NA>                   R (>= 3.1.0)
##                                 Imports LinkingTo
## 1                                  <NA>      <NA>
## 2  magrittr, dplyr, doParallel, foreach      <NA>
## 3 ggplot2 (>= 3.1.0), shiny (>= 1.3.1),      <NA>
##                               Suggests Enhances    License License_is_FOSS
## 1                  randomForest, e1071     <NA> GPL (>= 2)            <NA>
## 2                                 <NA>     <NA>      GPL-3            <NA>
## 3 rmarkdown (>= 1.13), knitr (>= 1.22)     <NA>      GPL-3            <NA>

Packages has similar information as available.packages() but with many more columns with published date, reverse dependencies, X-CRAN-Comment, who packaged it… Also note that all this packages are not filtered to match R version, OS_type, subarch and there are almost duplicates (I learned about this filtering while reading the great documentation of available.packages() and also finding some mentions online).

As we have data from several years I’ll sometimes show the release dates of different R versions to provide some context. Without further delay let’s explore the data!

Published packages

CRAN started some time ago (in 1997) but it hasn’t remained frozen. The package archive (the A in CRAN) has been updating since then. For instance the current packages do not include packages that were removed, archived or those replaced by updates.

First packages are submitted to CRAN and once accepted they are published. As accepted and published usually are almost instantaneous I might use them as synonyms. Looking at the current available packages and their publication date, we can see the following:

ggplot2 plot of date vs packages accepted on a given day. Until2020 less than 10 packages were accepted daily. Lately more than 30 are added to CRAN. The plot also displays the R release versions from 2.12 in 2010 to 4.2.0 in 2022.

Figure 1: Packages accepted on CRAN by the publication date.

The oldest package added was in 2010. This means a package without issues, dependencies changes, bugs detected by the automatic checks since 12 years!

The daily rate of acceptance has increased from less than 10 a day till 2020 to more than 30 this year 2022. If we summarize that information for month we see the same, but the little bump in 2020 disappears but we see other patterns:

ggplot figure with the monthly published packages. till 2015 it raises very slowly, then in is around 50 monthly packages and there are some wobbles. In 2022 it raised to over 800 packages.

Figure 2: Monthly packages published to CRAN. Some monthly variance is observed.

Instead of just one bump we see some waves with less packages on CRAN accepted late in the year and an increase of packages the first months of the year.

If we look at the accumulated packages on CRAN we see an exponential growth:

Plot with the accumulative number of packages in CRAN. Raising from a few 10 to currently more than 18000.

Figure 3: Acumulation of packages. Most of the packages have been published in the last 2 years.

In fact, most packages currently on CRAN where added since March 2021 than all the previous years.

Line with percentages of packages in CRAN by date. Close to 50% of current packages were published between 2010 and 2021.

Figure 4: Percentage of current packages on CRAN according to their date of publication. Most of them were published/updated on the last year and a half.

This is a good time to remind that the date being used is the date of publication of this version of the packages. Many had previous versions on CRAN:

Processing time

Previously I found that CRAN submissions present some key differences between new packages and already published packages which impact how long do they need to wait to be published on CRAN. With the existing data we can compare how fast is the process by comparing the published date with the build date.

The build date is added to the tar.gz file automatically when the developer builds the package via R CMD build. However, the published date is set by CRAN once the packages are accepted on CRAN.

To visualize the differences I will also compare if there is some difference with new packages and those that were already on CRAN:

Histogram of packages and the time between build and publication. They take less than 50 days usually.

Figure 5: Histogram of time difference between building and publishing a package. Color indicates if the package is new to CRAN or not. Most of the published packages take more or less the same time regardless of if it is the first time or not.

There doesn’t seem to be much difference between date of building and date of publication according to if it is the first release or not. The precision is just a day and this is usually a fast process well below 50 days. Few packages exceed spend so much after build before publication and they are too few to be noticeable at this scale. Since 2016/05/02 there is a check that raises an issue if the build is older than a month.

Note that one might need to build multiple times the package before it is accepted. Packages published for the first time on CRAN might have been submitted previously, but when they finally built and pass the checks and manual review they are handled as fast as packages already on CRAN.

However, this time between build and acceptance might have changed with time:

Smoothed lines of published packages with different linetype and color depending on if it is the first time they are on CRAN or not. New packages currently take less than 4 days and old packages less than 2. This is down from 2018 to 2021, when new packages took above 4 days to be published on CRAN

Figure 6: Processing time between building the package and being published by date. There is a high difference between new packages and old ones. New packages usually take more time while existing packages take less than a day currently.

We clearly see a difference in processing time for those packages already on CRAN and those that are not. Keep in mind that for the few packages from before 2016 the estimation might not be accurate. At the same time this is consistent with the manual review process (For more information see my previous post about the review process of CRAN or my talk at the useR2021). It also means that there is a huge variation of time about how packages are handled. However this seems to be reducing: while in 2010 it took around 2 weeks, nowadays it takes less than a week and getting closer to a 1 day of median time between a package being built and appearing on CRAN that takes for existing packages.

This difference might be explainable due to experience: authors and maintainers whose package(s) are already in CRAN know better how to submit a new version without problems the checks.

It could also be that new packages need more time from the CRAN team. In 2020 we see it took longer than in previous years for packages to be added on CRAN. Maybe the increase in the processing time in 2020 was due the huge volume of submissions CRAN received or more checks on the developer side before submitting it to CRAN.

Both explanations are not mutually exclusive.

More packages published the same day mean more processing time? It doesn’t look like it.
ggplot graphic with the time of processing time and the number of packages accepted the same day. New packages have less delay than already published packages, but the more packages are accepted, the less delay there is.

Figure 7: Packages accepted the same day and processing time.New packages are accepted sooner than packages on CRAN respect to the builddate.

Surprisingly, we see a lot of variation on the delay of packages already accepted on CRAN. In addition, the more new packages accepted the same day, the less delay there is. I think this just means that when reviewers work on the submission queue several packages might be approved.

This might also mean packages have already been built several times before finally being accepted and now the errors, warnings and notes have been solved. Last, this could indicate that developers with their package already on CRAN wait a bit between building and submitting the package as the developer might be taking some time to double check before submission (dependencies, on several machines, other?) or a time zone difference (submitting in the noon of a region but at the reviewers night).

Conclusion

  • There are packages that for 12 years have been working without problems despite the several major changes in R (See figure 1). This speaks volumes of the packages’ quality, and the backward compatibility that the R core aims and CRAN checks.

  • CRAN accepts an incredible amount of packages daily and monthly. The system and the team are doing an incredible work mostly on their free time (See figure 2). Many thanks!

  • Accepted packages are handled very fast, in less than a week usually (See figure 7). But it is not possible to distinguish alone time in the submission system and time on the developer computer.

Future parts

We’ve explored a snapshot of current packages and a brief window of all the history of CRAN. There is much more that can be done with all the other files.

On future posts I’ll explore:

  • patterns accepting packages and updates in packages.

  • who handled the packages.

  • Size of packages.

  • the relation between dependencies, initial release and updates.

Other suggestions?

Edit: Many thanks to Maëlle Salmon and Dirk Eddelbuettel for their feedback on an initial version of this series of posts.

Reproducibility

## - Session info -------------------------------------------------------------------------------------------------------
##  setting  value
##  version  R version 4.2.1 (2022-06-23)
##  os       Ubuntu 20.04.4 LTS
##  system   x86_64, linux-gnu
##  ui       X11
##  language (EN)
##  collate  C
##  ctype    C
##  tz       Europe/Madrid
##  date     2022-07-23
##  pandoc   2.18 @ /usr/lib/rstudio/bin/quarto/bin/tools/ (via rmarkdown)
## 
## - Packages -----------------------------------------------------------------------------------------------------------
##  package      * version    date (UTC) lib source
##  assertthat     0.2.1      2019-03-21 [2] RSPM (R 4.2.0)
##  base64enc      0.1-3      2015-07-28 [2] CRAN (R 4.0.0)
##  blogdown       1.10       2022-05-10 [2] RSPM (R 4.2.0)
##  bookdown       0.27       2022-06-14 [2] RSPM (R 4.2.0)
##  bslib          0.4.0      2022-07-16 [2] RSPM (R 4.2.0)
##  cachem         1.0.6      2021-08-19 [2] RSPM (R 4.2.0)
##  cli            3.3.0      2022-04-25 [2] RSPM (R 4.2.0)
##  codetools      0.2-18     2020-11-04 [2] RSPM (R 4.2.0)
##  colorspace     2.0-3      2022-02-21 [2] RSPM (R 4.2.0)
##  crayon         1.5.1      2022-03-26 [2] RSPM (R 4.2.0)
##  curl           4.3.2      2021-06-23 [2] RSPM (R 4.2.0)
##  data.table     1.14.2     2021-09-27 [2] RSPM (R 4.2.0)
##  DBI            1.1.3      2022-06-18 [2] RSPM (R 4.2.0)
##  digest         0.6.29     2021-12-01 [2] RSPM (R 4.2.0)
##  dplyr        * 1.0.9      2022-04-28 [2] RSPM (R 4.2.0)
##  ellipsis       0.3.2      2021-04-29 [2] RSPM (R 4.2.0)
##  evaluate       0.15       2022-02-18 [2] RSPM (R 4.2.0)
##  fansi          1.0.3      2022-03-24 [2] RSPM (R 4.2.0)
##  farver         2.1.1      2022-07-06 [2] RSPM (R 4.2.0)
##  fastmap        1.1.0      2021-01-25 [2] RSPM (R 4.2.0)
##  flextable    * 0.7.2      2022-06-12 [2] RSPM (R 4.2.0)
##  forcats      * 0.5.1      2021-01-27 [2] RSPM (R 4.2.0)
##  gdtools        0.2.4      2022-02-14 [2] RSPM (R 4.2.0)
##  generics       0.1.3      2022-07-05 [2] RSPM (R 4.2.0)
##  geomtextpath * 0.1.0      2022-01-24 [2] CRAN (R 4.2.1)
##  ggplot2      * 3.3.6.9000 2022-06-29 [2] Github (tidyverse/ggplot2@7571122)
##  ggrepel      * 0.9.1      2021-01-15 [2] RSPM (R 4.2.0)
##  glue           1.6.2      2022-02-24 [2] RSPM (R 4.2.0)
##  gtable         0.3.0      2019-03-25 [2] CRAN (R 4.0.0)
##  highr          0.9        2021-04-16 [2] RSPM (R 4.2.0)
##  htmltools      0.5.3      2022-07-18 [2] RSPM (R 4.2.0)
##  jquerylib      0.1.4      2021-04-26 [2] RSPM (R 4.2.0)
##  jsonlite       1.8.0      2022-02-22 [2] RSPM (R 4.2.0)
##  knitr          1.39       2022-04-26 [2] RSPM (R 4.2.0)
##  labeling       0.4.2      2020-10-20 [2] RSPM (R 4.2.0)
##  lattice        0.20-45    2021-09-22 [3] CRAN (R 4.2.0)
##  lifecycle      1.0.1      2021-09-24 [2] RSPM (R 4.2.0)
##  lubridate    * 1.8.0      2021-10-07 [2] RSPM (R 4.2.0)
##  magrittr       2.0.3      2022-03-30 [2] RSPM (R 4.2.0)
##  Matrix         1.4-1      2022-03-23 [2] RSPM (R 4.2.0)
##  mgcv           1.8-40     2022-03-29 [2] RSPM (R 4.2.0)
##  munsell        0.5.0      2018-06-12 [2] RSPM (R 4.2.0)
##  nlme           3.1-158    2022-06-15 [2] RSPM (R 4.2.0)
##  officer        0.4.3      2022-06-12 [2] RSPM (R 4.2.0)
##  pillar         1.8.0      2022-07-18 [2] RSPM (R 4.2.0)
##  pkgconfig      2.0.3      2019-09-22 [2] RSPM (R 4.2.0)
##  purrr          0.3.4      2020-04-17 [2] RSPM (R 4.2.0)
##  R6             2.5.1      2021-08-19 [2] RSPM (R 4.2.0)
##  Rcpp           1.0.9      2022-07-08 [2] RSPM (R 4.2.0)
##  rlang          1.0.4      2022-07-12 [2] RSPM (R 4.2.0)
##  rmarkdown      2.14       2022-04-25 [2] RSPM (R 4.2.0)
##  rstudioapi     0.13       2020-11-12 [2] RSPM (R 4.2.0)
##  rversions    * 2.1.1      2021-05-31 [2] RSPM (R 4.2.0)
##  sass           0.4.2      2022-07-16 [2] RSPM (R 4.2.0)
##  scales         1.2.0      2022-04-13 [2] RSPM (R 4.2.0)
##  sessioninfo    1.2.2      2021-12-06 [2] RSPM (R 4.2.0)
##  stringi        1.7.8      2022-07-11 [2] RSPM (R 4.2.0)
##  stringr        1.4.0      2019-02-10 [2] RSPM (R 4.2.0)
##  systemfonts    1.0.4      2022-02-11 [2] RSPM (R 4.2.0)
##  textshaping    0.3.6      2021-10-13 [2] RSPM (R 4.2.0)
##  tibble         3.1.7      2022-05-03 [2] RSPM (R 4.2.0)
##  tidyr        * 1.2.0      2022-02-01 [2] RSPM (R 4.2.0)
##  tidyselect     1.1.2      2022-02-21 [2] RSPM (R 4.2.0)
##  utf8           1.2.2      2021-07-24 [2] RSPM (R 4.2.0)
##  uuid           1.1-0      2022-04-19 [2] RSPM (R 4.2.0)
##  vctrs          0.4.1      2022-04-13 [2] RSPM (R 4.2.0)
##  withr          2.5.0      2022-03-03 [2] RSPM (R 4.2.0)
##  xfun           0.31       2022-05-10 [2] RSPM (R 4.2.0)
##  xml2           1.3.3      2021-11-30 [2] RSPM (R 4.2.0)
##  yaml           2.3.5      2022-02-21 [2] RSPM (R 4.2.0)
##  zip            2.2.0      2021-05-31 [2] RSPM (R 4.2.0)
## 
##  [1] /home/lluis/bin/R/4.2.1
##  [2] /usr/lib/R/site-library
##  [3] /usr/lib/R/library
## 
## ----------------------------------------------------------------------------------------------------------------------
Avatar
Lluís Revilla Sancho
Bioinformatician

Bioinformatician with interests in functional enrichment, data integration and transcriptomics.

Related