Exploring CRAN's files: part 1
Using the CRAN’s files to look up the evolution of CRAN
Introduction
There are many great things in base R, one of them is the tools package. This package has the functions that are used to build, check and create packages, documentation and manuals.
As I wanted to know how CRAN works and its changes I was looking into the source code of tools. I found some internal functions that access freely available files with information about CRAN packages. These private functions are at the CRANtools.R file.
packages <- tools::CRAN_package_db()
# current <- tools:::CRAN_current_db()
# archive <- tools:::CRAN_archive_db()
# issues <- tools::CRAN_check_issues()
# alias <- tools:::CRAN_aliases_db()
# rdxrefs <- tools:::CRAN_rdxrefs_db()
As I was not sure of the information on these files I asked on R-devel but I did not receive an answer. They seem to be quite obscure and as private functions might be removed without notice and shouldn’t be used in any dependency. However, as the files contain information about CRAN they might provide interesting clues about the history of CRAN and how it is operated.
On this post I will focus on the first file. I’ll explore a couple of fields and in future posts I will use the other files to explore more about CRAN history.
packages file
First of all a very brief exploration of what is in this file:
## Package Version Priority Depends
## 1 A3 1.0.0 <NA> R (>= 2.15.0), xtable, pbapply
## 2 AATtools 0.0.1 <NA> R (>= 3.6.0)
## 3 ABACUS 1.0.0 <NA> R (>= 3.1.0)
## Imports LinkingTo
## 1 <NA> <NA>
## 2 magrittr, dplyr, doParallel, foreach <NA>
## 3 ggplot2 (>= 3.1.0), shiny (>= 1.3.1), <NA>
## Suggests Enhances License License_is_FOSS
## 1 randomForest, e1071 <NA> GPL (>= 2) <NA>
## 2 <NA> <NA> GPL-3 <NA>
## 3 rmarkdown (>= 1.13), knitr (>= 1.22) <NA> GPL-3 <NA>
Packages has similar information as available.packages()
but with many more columns with published date, reverse dependencies, X-CRAN-Comment, who packaged it…
Also note that all this packages are not filtered to match R version, OS_type, subarch and there are almost duplicates (I learned about this filtering while reading the great documentation of available.packages()
and also finding some mentions online).
As we have data from several years I’ll sometimes show the release dates of different R versions to provide some context. Without further delay let’s explore the data!
Published packages
CRAN started some time ago (in 1997) but it hasn’t remained frozen. The package archive (the A in CRAN) has been updating since then. For instance the current packages do not include packages that were removed, archived or those replaced by updates.
First packages are submitted to CRAN and once accepted they are published. As accepted and published usually are almost instantaneous I might use them as synonyms. Looking at the current available packages and their publication date, we can see the following:
The oldest package added was in 2010. This means a package without issues, dependencies changes, bugs detected by the automatic checks since 12 years!
The daily rate of acceptance has increased from less than 10 a day till 2020 to more than 30 this year 2022. If we summarize that information for month we see the same, but the little bump in 2020 disappears but we see other patterns:
Instead of just one bump we see some waves with less packages on CRAN accepted late in the year and an increase of packages the first months of the year.
If we look at the accumulated packages on CRAN we see an exponential growth:
In fact, most packages currently on CRAN where added since March 2021 than all the previous years.
This is a good time to remind that the date being used is the date of publication of this version of the packages. Many had previous versions on CRAN:
First release | Packages |
No | 14,294 |
Yes | 4,113 |
Processing time
Previously I found that CRAN submissions present some key differences between new packages and already published packages which impact how long do they need to wait to be published on CRAN. With the existing data we can compare how fast is the process by comparing the published date with the build date.
The build date is added to the tar.gz file automatically when the developer builds the package via R CMD build
. However, the published date is set by CRAN once the packages are accepted on CRAN.
To visualize the differences I will also compare if there is some difference with new packages and those that were already on CRAN:
There doesn’t seem to be much difference between date of building and date of publication according to if it is the first release or not. The precision is just a day and this is usually a fast process well below 50 days. Few packages exceed spend so much after build before publication and they are too few to be noticeable at this scale. Since 2016/05/02 there is a check that raises an issue if the build is older than a month.
Note that one might need to build multiple times the package before it is accepted. Packages published for the first time on CRAN might have been submitted previously, but when they finally built and pass the checks and manual review they are handled as fast as packages already on CRAN.
However, this time between build and acceptance might have changed with time:
We clearly see a difference in processing time for those packages already on CRAN and those that are not. Keep in mind that for the few packages from before 2016 the estimation might not be accurate. At the same time this is consistent with the manual review process (For more information see my previous post about the review process of CRAN or my talk at the useR2021). It also means that there is a huge variation of time about how packages are handled. However this seems to be reducing: while in 2010 it took around 2 weeks, nowadays it takes less than a week and getting closer to a 1 day of median time between a package being built and appearing on CRAN that takes for existing packages.
This difference might be explainable due to experience: authors and maintainers whose package(s) are already in CRAN know better how to submit a new version without problems the checks.
It could also be that new packages need more time from the CRAN team. In 2020 we see it took longer than in previous years for packages to be added on CRAN. Maybe the increase in the processing time in 2020 was due the huge volume of submissions CRAN received or more checks on the developer side before submitting it to CRAN.
Both explanations are not mutually exclusive.
More packages published the same day mean more processing time? It doesn’t look like it.
Surprisingly, we see a lot of variation on the delay of packages already accepted on CRAN. In addition, the more new packages accepted the same day, the less delay there is. I think this just means that when reviewers work on the submission queue several packages might be approved.
This might also mean packages have already been built several times before finally being accepted and now the errors, warnings and notes have been solved. Last, this could indicate that developers with their package already on CRAN wait a bit between building and submitting the package as the developer might be taking some time to double check before submission (dependencies, on several machines, other?) or a time zone difference (submitting in the noon of a region but at the reviewers night).
Conclusion
There are packages that for 12 years have been working without problems despite the several major changes in R (See figure 1). This speaks volumes of the packages’ quality, and the backward compatibility that the R core aims and CRAN checks.
CRAN accepts an incredible amount of packages daily and monthly. The system and the team are doing an incredible work mostly on their free time (See figure 2). Many thanks!
Accepted packages are handled very fast, in less than a week usually (See figure 7). But it is not possible to distinguish alone time in the submission system and time on the developer computer.
Future parts
We’ve explored a snapshot of current packages and a brief window of all the history of CRAN. There is much more that can be done with all the other files.
On future posts I’ll explore:
patterns accepting packages and updates in packages.
who handled the packages.
Size of packages.
the relation between dependencies, initial release and updates.
Other suggestions?
Edit: Many thanks to Maëlle Salmon and Dirk Eddelbuettel for their feedback on an initial version of this series of posts.
Reproducibility
## - Session info -------------------------------------------------------------------------------------------------------
## setting value
## version R version 4.2.1 (2022-06-23)
## os Ubuntu 20.04.4 LTS
## system x86_64, linux-gnu
## ui X11
## language (EN)
## collate C
## ctype C
## tz Europe/Madrid
## date 2022-07-23
## pandoc 2.18 @ /usr/lib/rstudio/bin/quarto/bin/tools/ (via rmarkdown)
##
## - Packages -----------------------------------------------------------------------------------------------------------
## package * version date (UTC) lib source
## assertthat 0.2.1 2019-03-21 [2] RSPM (R 4.2.0)
## base64enc 0.1-3 2015-07-28 [2] CRAN (R 4.0.0)
## blogdown 1.10 2022-05-10 [2] RSPM (R 4.2.0)
## bookdown 0.27 2022-06-14 [2] RSPM (R 4.2.0)
## bslib 0.4.0 2022-07-16 [2] RSPM (R 4.2.0)
## cachem 1.0.6 2021-08-19 [2] RSPM (R 4.2.0)
## cli 3.3.0 2022-04-25 [2] RSPM (R 4.2.0)
## codetools 0.2-18 2020-11-04 [2] RSPM (R 4.2.0)
## colorspace 2.0-3 2022-02-21 [2] RSPM (R 4.2.0)
## crayon 1.5.1 2022-03-26 [2] RSPM (R 4.2.0)
## curl 4.3.2 2021-06-23 [2] RSPM (R 4.2.0)
## data.table 1.14.2 2021-09-27 [2] RSPM (R 4.2.0)
## DBI 1.1.3 2022-06-18 [2] RSPM (R 4.2.0)
## digest 0.6.29 2021-12-01 [2] RSPM (R 4.2.0)
## dplyr * 1.0.9 2022-04-28 [2] RSPM (R 4.2.0)
## ellipsis 0.3.2 2021-04-29 [2] RSPM (R 4.2.0)
## evaluate 0.15 2022-02-18 [2] RSPM (R 4.2.0)
## fansi 1.0.3 2022-03-24 [2] RSPM (R 4.2.0)
## farver 2.1.1 2022-07-06 [2] RSPM (R 4.2.0)
## fastmap 1.1.0 2021-01-25 [2] RSPM (R 4.2.0)
## flextable * 0.7.2 2022-06-12 [2] RSPM (R 4.2.0)
## forcats * 0.5.1 2021-01-27 [2] RSPM (R 4.2.0)
## gdtools 0.2.4 2022-02-14 [2] RSPM (R 4.2.0)
## generics 0.1.3 2022-07-05 [2] RSPM (R 4.2.0)
## geomtextpath * 0.1.0 2022-01-24 [2] CRAN (R 4.2.1)
## ggplot2 * 3.3.6.9000 2022-06-29 [2] Github (tidyverse/ggplot2@7571122)
## ggrepel * 0.9.1 2021-01-15 [2] RSPM (R 4.2.0)
## glue 1.6.2 2022-02-24 [2] RSPM (R 4.2.0)
## gtable 0.3.0 2019-03-25 [2] CRAN (R 4.0.0)
## highr 0.9 2021-04-16 [2] RSPM (R 4.2.0)
## htmltools 0.5.3 2022-07-18 [2] RSPM (R 4.2.0)
## jquerylib 0.1.4 2021-04-26 [2] RSPM (R 4.2.0)
## jsonlite 1.8.0 2022-02-22 [2] RSPM (R 4.2.0)
## knitr 1.39 2022-04-26 [2] RSPM (R 4.2.0)
## labeling 0.4.2 2020-10-20 [2] RSPM (R 4.2.0)
## lattice 0.20-45 2021-09-22 [3] CRAN (R 4.2.0)
## lifecycle 1.0.1 2021-09-24 [2] RSPM (R 4.2.0)
## lubridate * 1.8.0 2021-10-07 [2] RSPM (R 4.2.0)
## magrittr 2.0.3 2022-03-30 [2] RSPM (R 4.2.0)
## Matrix 1.4-1 2022-03-23 [2] RSPM (R 4.2.0)
## mgcv 1.8-40 2022-03-29 [2] RSPM (R 4.2.0)
## munsell 0.5.0 2018-06-12 [2] RSPM (R 4.2.0)
## nlme 3.1-158 2022-06-15 [2] RSPM (R 4.2.0)
## officer 0.4.3 2022-06-12 [2] RSPM (R 4.2.0)
## pillar 1.8.0 2022-07-18 [2] RSPM (R 4.2.0)
## pkgconfig 2.0.3 2019-09-22 [2] RSPM (R 4.2.0)
## purrr 0.3.4 2020-04-17 [2] RSPM (R 4.2.0)
## R6 2.5.1 2021-08-19 [2] RSPM (R 4.2.0)
## Rcpp 1.0.9 2022-07-08 [2] RSPM (R 4.2.0)
## rlang 1.0.4 2022-07-12 [2] RSPM (R 4.2.0)
## rmarkdown 2.14 2022-04-25 [2] RSPM (R 4.2.0)
## rstudioapi 0.13 2020-11-12 [2] RSPM (R 4.2.0)
## rversions * 2.1.1 2021-05-31 [2] RSPM (R 4.2.0)
## sass 0.4.2 2022-07-16 [2] RSPM (R 4.2.0)
## scales 1.2.0 2022-04-13 [2] RSPM (R 4.2.0)
## sessioninfo 1.2.2 2021-12-06 [2] RSPM (R 4.2.0)
## stringi 1.7.8 2022-07-11 [2] RSPM (R 4.2.0)
## stringr 1.4.0 2019-02-10 [2] RSPM (R 4.2.0)
## systemfonts 1.0.4 2022-02-11 [2] RSPM (R 4.2.0)
## textshaping 0.3.6 2021-10-13 [2] RSPM (R 4.2.0)
## tibble 3.1.7 2022-05-03 [2] RSPM (R 4.2.0)
## tidyr * 1.2.0 2022-02-01 [2] RSPM (R 4.2.0)
## tidyselect 1.1.2 2022-02-21 [2] RSPM (R 4.2.0)
## utf8 1.2.2 2021-07-24 [2] RSPM (R 4.2.0)
## uuid 1.1-0 2022-04-19 [2] RSPM (R 4.2.0)
## vctrs 0.4.1 2022-04-13 [2] RSPM (R 4.2.0)
## withr 2.5.0 2022-03-03 [2] RSPM (R 4.2.0)
## xfun 0.31 2022-05-10 [2] RSPM (R 4.2.0)
## xml2 1.3.3 2021-11-30 [2] RSPM (R 4.2.0)
## yaml 2.3.5 2022-02-21 [2] RSPM (R 4.2.0)
## zip 2.2.0 2021-05-31 [2] RSPM (R 4.2.0)
##
## [1] /home/lluis/bin/R/4.2.1
## [2] /usr/lib/R/site-library
## [3] /usr/lib/R/library
##
## ----------------------------------------------------------------------------------------------------------------------