Reasons why packages are archived on CRAN
On the Repositories working group of the R Consortium Rich FitzJohn posted a comment to a file that seems to be were the CRAN team stores and uses to check the package history.
The structure is not defined anywhere I could find (I haven’t looked much to be honest).
Package: <package name>
X-CRAN-Comment: Archived on YYYY-MM-DD as <reason>.
X-CRAN-History: Archived on YYYY-MM-DD as <reason>.
Unarchived on YYYY-MM-DD.
.
<Optional clarification of archival reason>
<Optional fields like License_restricts_use, Replaced_by, Maintainer: ORPHANED, OS_type: unix>
I think the X-CRAN-Comment is what appears on the website of an archived package, like on radix package. However, other comments on the website do not appear on that file.
In addition, the file doesn’t have some records of archiving and unarchiving of some packages, but there are old records from 2013 or before to now. But we can use it to see understand what are the reasons of archiving packages, which seems to be the main purpose of the file.
The data
First step is read the record.
As it seems that it has some key: value
structure similar to DESCRIPTION file of packages it seems it is a DCF format: Debian Control File format which is easy to read with the read.dcf
function.
Exploring
A brief exploration of the data:
comment | history | packages |
---|---|---|
yes | no | 3612 |
no | yes | 2345 |
yes | yes | 434 |
no | no | 70 |
Many packages have either comments or history but relatively few both. I’m not sure when either of them is used, as I would expect that all that have history would have a comment.
Replaced_by | packages |
---|---|
no | 6360 |
yes | 101 |
Many packages are simply replaced by some other package.
Maintainer | packages |
---|---|
no | 6366 |
yes | 95 |
Most of the packages that have a Maintainer field are orphaned/archived. Does it mean that all the others are not orphaned?
Extracting reasons
Now that it is in R data structure, we can extract the relevant information, dates, type of action and reasons for each archivation event.
I use strcapture
for this task with a regex to extract the action, the date and the explanation it migh have.
I don’t know how the file is written probably it is a mix of automated tools and manual editing so there isn’t a simple way to collect all the information in a structured way. Simply because the structure has been changing along the years as well as the details of what is stored has changed, or there are missing events. However, the extracted information should be enough for our purposes.
Action | Events |
---|---|
archived | 7096 |
orphaned | 341 |
removed | 113 |
renamed | 2 |
replaced | 4 |
unarchived | 2973 |
As expected the most common recorded event are archivations, but there are some orphaned packages and even some removed packages. Also note the number of orphaned packages is greater than those with the Maintainer field, supporting my theory that the format has changed and that this shouldn’t be taken as an exhaustive and complete analysis of archivations.
How are they along time?
Even if there are some events recorded from 2009 it seems that this file has been more used more recently (last commit related to this was on 2015). I know that there are some old events not recorded on the file, because there are some packages currently present on CRAN that they had been archived but do not have an unarchived action, so conversely it could happen. So, this doesn’t necessarily mean that there are currently more packages archived from CRAN. But it is a clear indication that now at least there is a more accurate record of archived packages on this file.
Another source of records of archived packages might be cranberries. It would be nice to compare this file with the records on the database there.
Now that most of the package events are collected and we have the reasons of the actions, we can explore and classify the reasons. Using some simple regex I explore for key words or sentences.
We can look at the most frequent error reasons for archiving packages, patterns I found with more than 100 cases:
The most frequent error is that errors are not corrected or checks, even when there are reminders.
Next are the packages archived because they depend on other packages already not on CRAN.
There are some packages that are replaced by others and some maintainers might not want to continue supporting the package when they receive a message from CRAN about fixing an error.
Policy violation makes to the top 5 but with less than 500 events. Dependencies problems are the sixth cause, followed by email errors (bouncing, incorrect email…) and then come very sporadic problems about license, not fixing on updates of R, authorship problems or requests from authors.
Some of these errors happen at the same time for each event, but grouping these reasons together we get a similar table:
package_not_corrected | request_maintainer | dependencies | other | events |
---|---|---|---|---|
yes | no | no | no | 4366 |
no | no | no | no | 1530 |
no | no | yes | no | 767 |
no | no | no | yes | 374 |
yes | no | no | yes | 15 |
yes | no | yes | no | 13 |
no | no | yes | yes | 2 |
yes | no | yes | yes | 2 |
yes | yes | no | no | 2 |
yes | yes | no | yes | 1 |
Surprisingly the second most frequent group of archiving actions are due to many different reasons. This is probably the Pareto’s principle in action because they are around 15% of the archiving events but the causes are very diverse between them.
However, if we look at the packages which were archived (not at the request of maintainers), most of them just happen once:
Events | packages |
---|---|
1 | 5304 |
2 | 594 |
3 | 115 |
4 | 31 |
5 | 8 |
This suggests that once a package is archived maintainers do not make the effort to put it back on CRAN except on very few cases were there are multiple attempts. To check we can see the current available packages and see how many of those are still present on CRAN:
CRAN | Packages | Proportion |
---|---|---|
no | 3869 | 64% |
yes | 2183 | 36% |
Many packages are currently on CRAN despite their past archivation but close to 64% are currently not on CRAN.
Almost all that are on CRAN have now no X-CRAN-Comment
, except for a few:
Package | X-CRAN-Comment |
---|---|
geiger |
Orphaned and corrected on 2022-05-09. Repeated notifications about USE_FC_LEN_T were ignored. |
alphahull | Versions up to 2.3 have been removed for mirepresentation of authorship. |
udunits2 | Orphaned on 2022-01-06 as installation problems were not corrected. |
bibtex | Orphaned and corrected on 2020-09-19 as check problems were not corrected in time. |
CRAN team might have missed these few packages and didn’t move the comments to X-CRAN-history.
There are some packages that are not archived that don’t have a CRAN-history happens too, but they usually have other fields changed.
Discussion
Most packages archived on CRAN are due to the maintainers not correcting errors found on the package by CRAN checks. It is clear that the checks that CRAN help packages to have a high quality but it has high cost on the maintainers and specially on CRAN team. Maintainers don’t seem to have enough time to fix the issues on time. And the CRAN team sends personalized reminders to maintainers and sometimes patches to the packages.
Although the desire to have packages corrected and with no issues is the common goal there are few options on light of these:
Be more restrictive
Prevent a package to be accepted if it breaks dependencies or archive packages when they fail checks. This will make it harder to keep packages on CRAN but would lift some pressure on the CRAN team. This would go against the current on other languages repositories, which often they don’t check the packages/modules and even have less restrictions on dependencies (so it might be an unpopular decision).
Be more permissive:
One option would be to allow for more time for maintainers to fix issues. I haven’t find any report of how long does it take for a package since an error to a fix on CRAN but often it is quite long. I have seen packages with a warning for months if not years and they weren’t archived from CRAN.
Maybe if users get a warning on installing packages that a package or one of its dependencies is not clear on all CRAN checks (without error or warnings). This might help to make users more conscious of their dependencies but this might add pressure to maintainers who already don’t have enough time to fix the problems of their packages.
Provide more help or tools to maintainers
Another option is to provide a mechanism for maintainers to receive help or fix the package. Currently CRAN requires that new packages that break dependencies to give enough notice in advance to other maintainers to fix their package. On R-pkg-devel mailing list there are often requests for help on submitting and fixing some errors detected by CRAN checks which often result on other maintainers sharing their solutions for the same problem.
There high percentage of packages that once archived do not come back to CRAN might be a good place to start helping maintainers and an opportunity for users to step in and help maintainers of packages they have been using. There is need for something else? How would that work?
At the same time it is admirable that after so many years there are few errors on the data. However, the archival process might be a good process to automate, providing the reason on the webpage and add it to X-CRAN-Comment and moving the comments to X-CRAN-History once it is unarchived. Knowing more about how these actions are performed by the CRAN team and how the community could help on the process will be beneficial to all.
Note: This blog was updated on 2022/01/02 to improve the parsing of actions and dates on packages. Resulting on a change on the first plot to include unarchived which slightly modified the second plot of reasons why packages are archived. This overall only affected the numbers of the plots not the conclusions or discussion.
Reproducibility
## ─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────
## setting value
## version R version 4.2.0 (2022-04-22)
## os Ubuntu 20.04.4 LTS
## system x86_64, linux-gnu
## ui X11
## language (EN)
## collate en_US.UTF-8
## ctype en_US.UTF-8
## tz Europe/Madrid
## date 2022-05-09
## pandoc 2.17.1.1 @ /usr/lib/rstudio/bin/quarto/bin/ (via rmarkdown)
##
## ─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────
## package * version date (UTC) lib source
## assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.2.0)
## blogdown 1.9 2022-03-28 [1] CRAN (R 4.2.0)
## bookdown 0.26 2022-04-15 [1] CRAN (R 4.2.0)
## bslib 0.3.1 2021-10-06 [1] CRAN (R 4.2.0)
## cli 3.3.0 2022-04-25 [1] CRAN (R 4.2.0)
## colorspace 2.0-3 2022-02-21 [1] CRAN (R 4.2.0)
## ComplexUpset * 1.3.3 2021-12-11 [1] CRAN (R 4.2.0)
## crayon 1.5.1 2022-03-26 [1] CRAN (R 4.2.0)
## DBI 1.1.2 2021-12-20 [1] CRAN (R 4.2.0)
## digest 0.6.29 2021-12-01 [1] CRAN (R 4.2.0)
## dplyr * 1.0.9 2022-04-28 [1] CRAN (R 4.2.0)
## ellipsis 0.3.2 2021-04-29 [1] CRAN (R 4.2.0)
## evaluate 0.15 2022-02-18 [1] CRAN (R 4.2.0)
## fansi 1.0.3 2022-03-24 [1] CRAN (R 4.2.0)
## farver 2.1.0 2021-02-28 [1] CRAN (R 4.2.0)
## fastmap 1.1.0 2021-01-25 [1] CRAN (R 4.2.0)
## generics 0.1.2 2022-01-31 [1] CRAN (R 4.2.0)
## ggplot2 * 3.3.6 2022-05-03 [1] CRAN (R 4.2.0)
## glue 1.6.2 2022-02-24 [1] CRAN (R 4.2.0)
## gtable 0.3.0 2019-03-25 [1] CRAN (R 4.2.0)
## highr 0.9 2021-04-16 [1] CRAN (R 4.2.0)
## htmltools 0.5.2 2021-08-25 [1] CRAN (R 4.2.0)
## jquerylib 0.1.4 2021-04-26 [1] CRAN (R 4.2.0)
## jsonlite 1.8.0 2022-02-22 [1] CRAN (R 4.2.0)
## knitr 1.39 2022-04-26 [1] CRAN (R 4.2.0)
## labeling 0.4.2 2020-10-20 [1] CRAN (R 4.2.0)
## lifecycle 1.0.1 2021-09-24 [1] CRAN (R 4.2.0)
## magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.2.0)
## munsell 0.5.0 2018-06-12 [1] CRAN (R 4.2.0)
## patchwork 1.1.1 2020-12-17 [1] CRAN (R 4.2.0)
## pillar 1.7.0 2022-02-01 [1] CRAN (R 4.2.0)
## pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.2.0)
## purrr 0.3.4 2020-04-17 [1] CRAN (R 4.2.0)
## R6 2.5.1 2021-08-19 [1] CRAN (R 4.2.0)
## rlang 1.0.2 2022-03-04 [1] CRAN (R 4.2.0)
## rmarkdown 2.14 2022-04-25 [1] CRAN (R 4.2.0)
## rstudioapi 0.13 2020-11-12 [1] CRAN (R 4.2.0)
## sass 0.4.1 2022-03-23 [1] CRAN (R 4.2.0)
## scales 1.2.0 2022-04-13 [1] CRAN (R 4.2.0)
## sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.2.0)
## stringi 1.7.6 2021-11-29 [1] CRAN (R 4.2.0)
## stringr 1.4.0 2019-02-10 [1] CRAN (R 4.2.0)
## tibble 3.1.7 2022-05-03 [1] CRAN (R 4.2.0)
## tidyselect 1.1.2 2022-02-21 [1] CRAN (R 4.2.0)
## utf8 1.2.2 2021-07-24 [1] CRAN (R 4.2.0)
## vctrs 0.4.1 2022-04-13 [1] CRAN (R 4.2.0)
## withr 2.5.0 2022-03-03 [1] CRAN (R 4.2.0)
## xfun 0.30 2022-03-02 [1] CRAN (R 4.2.0)
## yaml 2.3.5 2022-02-21 [1] CRAN (R 4.2.0)
##
## [1] /home/lluis/bin/R/4.2.0/lib/R/library
##
## ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────