CRAN review

I’ve been doing some analysis on the review submissions of several projects of R. However, till recently I couldn’t analyze the CRAN submission. There was cransays’ package to check package submissions which on the online documentation provided a dashboard which updated each hour. Since 2020/09/12 the status of the queues and folders of submissions are saved on a branch. Using this information and basing in script provided by Tim Taylor I’ll check how are the submissions on CRAN handled.

I’ll look at the CRAN queue, I’ll explore some time patterns and also check the meaning of those subfolder. Later I’ll go to a more practical information for people submitting a package. Lastly, we’ll see how hard is the job of the CRAN team by looking at the reliability of the Github action used.

Before all this we need preliminary work to download the data and clean it:

# Downloading the cransays repository branch history
download.file("https://github.com/lockedata/cransays/archive/history.zip", 
              destfile = "static/cransays-history.zip")
path_zip <- here::here("static", "cransays-history.zip") 
# We unzip the files to read them
dat <- unzip(path_zip, exdir = "static")
csv <- dat[grepl("*.csv$", x = dat)]
f <- lapply(csv, read.csv)
m <- function(x, y) {
  merge(x, y, sort = FALSE, all = TRUE)
}
updates <- Reduce(m, f) # Merge all files (Because the file format changed)
write.csv(updates, file = "static/cran_till_now.csv",  row.names = FALSE)
# Clean up
unlink("static/cransays-history/", recursive = TRUE)
unlink("static/cransays-history.zip", recursive = TRUE)

Once we have the data we can load it, and we load the libraries used for the analysis:

library("tidyverse")
library("lubridate")
library("hms")
path_file <- here::here("static", "cran_till_now.csv")
cran_submissions <- read.csv(path_file)
theme_set(theme_minimal()) # For plotting
col_names <- c("package", "version", "snapshot_time", "folder", "subfolder")
cran_submissions <- cran_submissions[, col_names]

The period we are going to analyze is from the beginning of the records till 2021-01-30. It includes some well earned holiday time for the CRAN team, during which submissions were not possible.

I’ve read some comments on the inconsistencies of where the holidays of the CRAN teams are reported and I couldn’t find it for previous years.

For the 4 months we are analyzing which only has one holiday period I used a screenshot found on twitter.

holidays <- data.frame(
  start = as.POSIXct("18/12/2020", format = "%d/%m/%Y", tz = "UTC"), 
  end = as.POSIXct("04/01/2021", format = "%d/%m/%Y", tz = "UTC")
)

Know that we know the holidays and have the data in a single data.frame it’s time to explore it and clean it:

Cleaning the data

After preparing the files in one big file we can load and work with it. First steps, check the data and prepare it for what we want:

# Use appropiate class
cran_submissions$snapshot_time <- as.POSIXct(cran_submissions$snapshot_time,
                                             tz = "UTC")
# Fix subfolders structure
cran_submissions$subfolder[cran_submissions$subfolder %in% c("", "/")] <- NA
# Remove files or submissions without version number
cran_submissions <- cran_submissions[!is.na(cran_submissions$version), ]
cran_submissions <- distinct(cran_submissions, 
                             snapshot_time, folder, package, version, subfolder,
                             .keep_all = TRUE)

After loading and a preliminary cleanup, set date format, homogenize folder format, remove submissions that are not packages (yes there are pdf and other files on the queue), and remove duplicates we can start.

As always start with some checks of the data. Note: I should follow more often this advice myself, as this is the last section I’m writing on the post.

packges_multiple_versions <- cran_submissions %>% 
  group_by(package, snapshot_time) %>% 
  summarize(n = n_distinct(version)) %>% 
  filter(n != 1) %>% 
  distinct(package) %>% 
  pull(package)

There are 92 packages with multiple versions at the same time on CRAN queue.

Perhaps because package are left in different folders (2 or even 3) at the same time:

package_multiple <- cran_submissions %>% 
  group_by(snapshot_time, package) %>% 
  count() %>% 
  group_by(snapshot_time) %>% 
  count(n) %>% 
  filter(n != 1) %>% 
  summarise(n = sum(nn)) %>% 
  ungroup()
ggplot(package_multiple) +
  geom_point(aes(snapshot_time, n), size = 1) +
  geom_rect(data = holidays, aes(xmin = start, xmax = end, ymin = 0, ymax = 6),
            alpha = 0.25, fill = "red") +
  annotate("text", x = holidays$start + (holidays$end - holidays$start)/2, 
           y = 3.5, label = "CRAN holidays") +
  scale_x_datetime(date_labels = "%Y/%m/%d", date_breaks = "2 weeks", 
                   expand = expansion()) +
  scale_y_continuous(expand = expansion()) +
  labs(title = "Packages in multiple folders and subfolders", 
       x = element_blank(), y = element_blank())

This happens in 1915 snapshots of 3260, probably due to the manual labor of the CRAN reviews. I don’t really know the cause of this, it could be an error on the script recording the data, copying the data around the server. But perhaps this indicates further improvements and automatization of the process can be done.

cran_submissions <- cran_submissions %>% 
  arrange(package, snapshot_time, version, folder) %>% 
  group_by(package, snapshot_time) %>% 
  mutate(n = 1:n()) %>% 
  filter(n == n()) %>% 
  ungroup() %>% 
  select(-n)

Now we removed ~3500 records of packages with two versions remaining on the queue. Next we check packages in multiple folders but with the same version and remove them until we are left with a single one (under the reasoning that there aren’t parallel steps on the review):

cran_submissions <- cran_submissions %>% 
  arrange(package, snapshot_time, folder) %>% 
  group_by(package, snapshot_time) %>% 
  mutate(n = 1:n()) %>% 
  filter(n == n()) %>% 
  ungroup() %>% 
  select(-n)

Last we want to know how many submission, in this period, did a package submit:

diff0 <- structure(0, class = "difftime", units = "hours")
cran_submissions <- cran_submissions %>% 
  arrange(package, version, snapshot_time) %>% 
  group_by(package) %>% 
  # Packages last seen in queue less than 24 ago are considered same submission
  mutate(diff_time = difftime(snapshot_time,  lag(snapshot_time), units = "hour"),
         diff_time = if_else(is.na(diff_time), diff0, diff_time), # Fill NAs
         diff_v = version != lag(version),
         diff_v = ifelse(is.na(diff_v), TRUE, diff_v), # Fill NAs
         new_version = !near(diff_time, 1, tol = 24) & diff_v, 
         new_version = if_else(new_version == FALSE & diff_time == 0, 
                               TRUE, new_version),
         submission_n = cumsum(as.numeric(new_version))) %>%
  ungroup() %>% 
  select(-diff_time, -diff_v, -new_version)

As sometimes after a release there is soon a fast update to fix some bugs raised on the new features introduced, if a package isn’t seen in 24h in the queue it is considered a new submission. Also if it change the package version but not if it change the version while addressing the feedback from reviewers.

Now we have the data ready for further analysis.

CRAN load

We all know that CRAN is busy with updates to fix bugs, improve package, or with petitions to have new packages included on the repository.

A first plot we can make is showing the number of distinct packages on each moment:

cran_queue <- cran_submissions %>% 
  group_by(snapshot_time) %>% 
  summarize(n = n_distinct(package))
ggplot(cran_queue) +
  geom_rect(aes(xmin = start, xmax = end, ymin = 0, ymax = 230),
            alpha = 0.5, fill = "red", data = holidays) +
  annotate("text", x = holidays$start + (holidays$end - holidays$start)/2, 
           y = 150, label = "CRAN holidays") +
  geom_path(aes(snapshot_time, n)) +
  scale_x_datetime(date_labels = "%Y/%m/%d", date_breaks = "2 weeks", 
                   expand = expansion()) +
  scale_y_continuous(expand = expansion()) +
  labs(x = element_blank(), y = element_blank(), 
       title = "Packages on CRAN review process")


# Several version of the same packages in the same time.
# This makes this completley wrong in those cases. 

We can see that there are some ups and downs, and ranges between 200 to 50.

There are some instance were the number of package on the queue have a sudden drop and then a recovery to previous levels. This might be an error on the CRAN system, or the cransays’ script. We can also see that people do not tend to rush and push the package before the holidays. But clearly there is some build up of submissions after holidays, as the the highest packages on the queue is reached after holidays.

On the CRAN review process classifying package in folders seems to be part of the process:

man_colors <- RColorBrewer::brewer.pal(8, "Dark2")
names(man_colors) <- unique(cran_submissions$folder)
cran_submissions %>% 
  group_by(folder, snapshot_time) %>% 
  summarize(packages = n_distinct(package)) %>% 
  ggplot() +
  geom_rect(data = holidays, aes(xmin = start, xmax = end, ymin = 0, ymax = 200),
            alpha = 0.25, fill = "red") +
  annotate("text", x = holidays$start + (holidays$end - holidays$start)/2, 
           y = 105, label = "CRAN holidays") +
  geom_path(aes(snapshot_time, packages, col = folder)) +
  scale_x_datetime(date_labels = "%Y/%m/%d", date_breaks = "2 weeks", 
                   expand = expansion()) +
  scale_y_continuous(expand = expansion()) +
  scale_color_manual(values = man_colors) +
  labs(x = element_blank(), y = element_blank(),
       title = "Packages by folder", col = "Folder") +
  theme(legend.position = c(0.6, 0.7))

The queue trend is mostly driven by newbies folder (which ranges between 25 and 150) and after holidays by the pretest folder.

Surprisingly when the queue is split by folder we don’t see those sudden drops. This might indicate that there is a clean up on some of the folders1. What we clearly see is a clean up on holidays for all the folders, when almost all was cleared up.

Also the pretest folder might indicate that the package there are just updating. As there isn’t any later increase of other folders as we can see if we
focus on the holidays and after them.
cran_submissions %>% 
  group_by(folder, snapshot_time) %>% 
  summarize(packages = n_distinct(package)) %>% 
  filter(snapshot_time >= holidays$start) %>% 
  ggplot() +
  geom_path(aes(snapshot_time, packages, col = folder)) +
  geom_rect(data = holidays, aes(xmin = start, xmax = end, ymin = 0, ymax = 200),
            alpha = 0.25, fill = "red") +
  annotate("text", x = holidays$start + (holidays$end - holidays$start)/2, 
           y = 105, label = "CRAN holidays") +
  scale_x_datetime(date_labels = "%Y/%m/%d", date_breaks = "1 day", 
                   expand = expansion()) +
  scale_y_continuous(expand = expansion(), limits = c(0, NA)) +
  scale_color_manual(values = man_colors) +
  labs(x = element_blank(), y = element_blank(),
       title = "Holidays", col = "Folder") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        legend.position = c(0.8, 0.7))

It seems like that on the 31st there was a clean up of some packages on the waiting list. And we can see the increase of submissions on the first week of January.

Time patterns

Some people has expressed they view to submit to CRAN when there are few packages on the queue. Thus, looking when does this low moments happens could be relevant. We can look for patterns on the queue:

Note: I have few to none experience with time series, so the following plots are just using the defaults of `geom_smooth`, just omitting the holidays.

By day of the month

Looking at folder:

cran_times <- cran_submissions %>% 
  mutate(seconds = seconds(snapshot_time),
         month = month(snapshot_time),
         mday = mday(snapshot_time),
         wday = wday(snapshot_time, locale = "en_GB.UTF-8"),
         week = week(snapshot_time),
         date = as_date(snapshot_time))
cran_times %>% 
  arrange(folder, date, mday) %>% 
  filter(snapshot_time < holidays$start | snapshot_time  > holidays$end) %>% 
  group_by(folder, date, mday) %>% 
  summarize(packages = n_distinct(package),
            week = unique(week)) %>% 
  group_by(folder, mday) %>% 
  ggplot() +
  # geom_point(aes(mday, packages, col = fct_reorder2(folder, mday, packages, sum))) +
  geom_smooth(aes(mday, packages, col = folder)) +
  labs(x = "Day of the month", y = element_blank(), col = "Folder",
       title = "Evolution by month day") +
  scale_color_manual(values = man_colors) +
  coord_cartesian(ylim = c(0, NA), xlim = c(1, NA)) +
  scale_x_continuous(expand = expansion()) +
  scale_y_continuous(expand = expansion()) 

At the beginning and end of the month there is more variation on several folders (This could also be that there isn’t information of the end of December and beginning of January). There seems to be an increase of new packages submissions towards the middle of the month and later and increase on the human folder by the beginning of the month.

By day of the week

I first thought about this, because I was curious if there is more submission on weekends (when aficionados and open source developers might have more time) or the rest of the week.

cran_times %>% 
  filter(snapshot_time < holidays$start | snapshot_time  > holidays$end) %>% 
  group_by(folder, date, wday) %>% 
  summarize(packages = n_distinct(package),
            week = unique(week)) %>% 
  group_by(folder, wday) %>% 
  ggplot() +
  # geom_point(aes(wday, packages, col = fct_reorder2(folder, wday, packages, sum))) +
  geom_smooth(aes(wday, packages, col = folder)) +
  labs(x = "Day of the week", y = "Packages", col = "Folder",
       title = "Evolution by week day") +
  scale_color_manual(values = man_colors) +
  scale_x_continuous(breaks = 1:7, expand = expansion()) +
  scale_y_continuous(expand = expansion(), limits = c(0, NA))

It doesn’t seem to be a special pattern only on the human folder there is an increase on weekdays and a decrease on the weekend. What we see is a rise towards the middle of the week of the packages on the pretest folder, indicating new packages submissions.

Other folders

There are some folders that seem to be from R Contributors. We see that some packages are on these folders:

cran_members <- c("LH", "GS", "JH")
cran_times %>% 
  filter(subfolder %in% cran_members) %>% 
  group_by(subfolder, snapshot_time) %>% 
  summarize(packages = n_distinct(package)) %>% 
  ggplot() +
  geom_smooth(aes(snapshot_time, packages, col = subfolder)) +
    labs(x = element_blank(), y = element_blank(), col = "Folder",
       title = "Packages on folders") +
  scale_y_continuous(expand = expansion(), breaks = 0:10) +
  coord_cartesian(y = c(0, NA))  +
  scale_x_datetime(date_labels = "%Y/%m/%d", date_breaks = "2 weeks", 
               expand = expansion(add = 2))

There doesn’t seem to be any rule about using those folders or, the work was quick enough to not be recorded.

Looking for any temporal pattern doesn’t worth it.
cran_times %>% 
  filter(subfolder %in% cran_members) %>% 
  group_by(subfolder, mday) %>% 
  summarize(packages = n_distinct(package)) %>% 
  ungroup() %>% 
  ggplot() +
  geom_smooth(aes(mday, packages, col = subfolder)) +
  labs(x = "Day of the month", y = element_blank(), col = "Subfolder",
       title = "Packages on subfolers by day of the month") +
  scale_y_continuous(expand = expansion()) +
  scale_x_continuous(expand = expansion(), breaks = c(1,7,14,21,29)) +
  coord_cartesian(ylim = c(0, NA))

Low number of packages and great variability (except on those that just have 1 package on the folder) on day of month.

cran_times %>% 
  filter(subfolder %in% cran_members) %>% 
  group_by(subfolder, wday) %>% 
  summarize(packages = n_distinct(package)) %>% 
  ungroup() %>% 
  ggplot() +
  geom_smooth(aes(wday, packages, col = subfolder)) +
  labs(x = "Day of the week", y = element_blank(), col = "Subfolder",
       title = "Evolution by week day") +
  scale_y_continuous(expand = expansion()) +
  scale_x_continuous(breaks = 1:7, expand = expansion()) +
  coord_cartesian(ylim =  c(0, NA))

There seem to be only 2 people usually working with their folders. Suppose that there aren’t a common set of rules the reviewers follow.

Information for submitters

I’ve read lots of comments recently around CRAN submission. However, with the few data available compared to open reviews on Bioconductor and rOpenSci it is hard to answer them (See those related posts). On Bioconductor and rOpenSci it is possible to see people involved, message from the reviewers and other interested parties, steps done to be accepted…

One of the big question we can provide information about with the data available is how long it will be a package on the queue:

subm <- cran_times %>%
  group_by(package, submission_n) %>% 
  arrange(snapshot_time) %>% 
  select(package, version, submission_n, snapshot_time) %>% 
  filter(row_number() %in% c(1, last(row_number()))) %>% 
  arrange(package, submission_n)

There are 429 packages that are only seen once. It might mean that it is an abandoned, delayed or rejected submissions, other might be acceptances in less than an hour2.

If we look at the package submission by date we can see the quick increase of packages:

rsubm <- subm %>% 
  filter(n_distinct(snapshot_time) %% 2 == 0) %>%
  select(-version) %>% 
  mutate(time = c("start", "end")) %>% 
  pivot_wider(values_from = snapshot_time, names_from = time) %>% 
  ungroup() %>% 
  mutate(r = row_number(), 
         time  =  round(difftime(end, start, units = "hour"), 0)) %>% 
  group_by(package) %>% 
  mutate(n_submission = 1:n()) %>% 
  ungroup()
lv <- levels(fct_reorder(rsubm$package, rsubm$start, .fun = min, .desc = FALSE))
ggplot(rsubm) +
  geom_rect(data = holidays, aes(xmin = start, xmax = end), 
            ymin = first(lv), ymax = last(lv), alpha = 0.5, fill = "red") +
  geom_linerange(aes(y = fct_reorder(package, start, .fun = min, .desc = FALSE),
                      x = start, xmin = start, xmax = end, 
                     col = as.factor(n_submission))) + 
  labs(x = element_blank(), y = element_blank(), title = 
         "Packages on the queue", col = "Submissions") +
  scale_x_datetime(date_labels = "%Y/%m/%d", date_breaks = "2 weeks", 
                   expand = expansion(add = 2)) +
  scale_colour_viridis_d() +
  theme_minimal() +
  theme(panel.grid.major.y = element_blank(),
        axis.text.y = element_blank(),
        legend.position = c(0.15, 0.7))

We can see the rapid growth of packages and that on this period some people submitted more than 5 times their packages in this period.

Also note that the definition used for submission is a package with different version number. Some authors do change the version number when CRAN reviewers require changes before accepting the package on CRAN while others do not and only change the version number according to their release cycle.

rsubm %>% 
  arrange(start) %>% 
  filter(start < holidays$start, # Look only before the holidays
    submission_n == 1,# Use only the first submission
    start > min(start)) %>%   # Just new submissions
  mutate(r = row_number(),
         start1 = as.numeric(seconds(start))) %>% 
  lm(start1 ~ r, data = .) %>% 
  broom::tidy() %>%  
  mutate(estimate = estimate/(60*60)) # Hours
## # A tibble: 2 x 5
##   term         estimate std.error statistic p.value
##   <chr>           <dbl>     <dbl>     <dbl>   <dbl>
## 1 (Intercept) 444325.     6321.     253047.       0
## 2 r                1.03      4.76      779.       0

More or less there is a new package submission every hour.

Despite this submission rate we can see that most submissions are on the queue a short time:

library("patchwork")
p1 <- rsubm %>% 
  group_by(package) %>% 
  summarize(time = sum(time)) %>% 
  ggplot() +
  geom_histogram(aes(time), bins = 100) +
  labs(title = "Packages total time on queue", x = "Hours", 
       y = element_blank()) +
  scale_x_continuous(expand = expansion()) +
  scale_y_continuous(expand = expansion())
p2 <- rsubm %>% 
  group_by(package) %>% 
  summarize(time = sum(time)) %>% 
  ggplot() +
  geom_histogram(aes(time), binwidth = 24) +
  coord_cartesian(xlim = c(0, 24*7)) +
  labs(subtitle = "Zoom", x = "Hours", y = element_blank()) +
  scale_x_continuous(expand = expansion(), breaks = seq(0, 24*7, by = 24)) +
  scale_y_continuous(expand = expansion()) +
  theme(panel.background = element_rect(colour = "white"))
p1 + inset_element(p2, 0.2, 0.2, 1, 1)

Also note that I found some packages that remained on the submission queue, and thus picked up by cransays, even after acceptance.

p1 <- rsubm %>% 
  group_by(package, submission_n) %>% 
  summarize(time = sum(time)) %>% 
  ggplot() +
  geom_histogram(aes(time), bins = 100) +
  labs(title = "Submission time on queue", x = "Hours", 
       y = element_blank()) +
  scale_x_continuous(expand = expansion()) +
  scale_y_continuous(expand = expansion())
p2 <- rsubm %>% 
  group_by(package, submission_n) %>% 
  summarize(time = sum(time)) %>%  
  ggplot() +
  geom_histogram(aes(time), binwidth = 24) +
  coord_cartesian(xlim = c(0, 24*7)) +
  labs(subtitle = "Zoom", x = "Hours", y = element_blank()) +
  scale_x_continuous(expand = expansion(), breaks = seq(0, 24*7, by = 24)) +
  scale_y_continuous(expand = expansion()) +
  theme(panel.background = element_rect(colour = "white"))
p1 + inset_element(p2, 0.2, 0.2, 1, 1)

By submission many more are shortly spanned. Perhaps hinting that more testing should be done before or what to expect on the review should be more clear to the authors, or that they are approved very fast, or…

There are 429 packages that are only seen once. It might mean that it is an abandoned/rejected submissions other might be acceptances in less than an hour.

subm2 <- cran_times %>%
  group_by(package, submission_n, folder) %>% 
  arrange(snapshot_time) %>% 
  select(package, version, submission_n, snapshot_time, folder) %>% 
  filter(row_number() %in% c(1, last(row_number()))) %>% 
  arrange(submission_n)
rsubm2 <- subm2 %>% 
  filter(n_distinct(snapshot_time) %% 2 == 0) %>%
  mutate(time = c("start", "end")) %>% 
  pivot_wider(values_from = snapshot_time, names_from = time) %>% 
  ungroup() %>% 
  mutate(r = row_number(), 
         time  =  round(difftime(end, start, units = "hour"), 0)) %>% 
  ungroup() %>% 
  filter(!is.na(start), !is.na(end))
lv <- levels(fct_reorder(rsubm2$package, rsubm2$start, .fun = min, .desc = FALSE))
ggplot(rsubm2) +
  geom_rect(data = holidays, aes(xmin = start, xmax = end), 
            ymin = first(lv), ymax = last(lv), alpha = 0.5, fill = "red") +
  geom_linerange(aes(y = fct_reorder(package, start, .fun = min, .desc = FALSE),
                      x = start, xmin = start, xmax = end, col = folder)) + 
  labs(x = element_blank(), y = element_blank(), title = 
         "Packages on the queue") +
  scale_color_manual(values = man_colors) +
  scale_x_datetime(date_labels = "%Y/%m/%d", date_breaks = "2 weeks", 
               expand = expansion(add = 2)) +
  labs(col = "Folder") +
  theme_minimal() +
  theme(panel.grid.major.y = element_blank(),
        axis.text.y = element_blank(),
        legend.position = c(0.2, 0.7))

Looking at the submissions by the folder they are, looks like some packages take a long time to change folder. Some packages are recorded to be in just 1 folder and some other go through multiple folders:

rsubm2 %>% 
  group_by(package, submission_n) %>% 
  summarize(n_folder = n_distinct(folder)) %>% 
  ggplot() + 
  geom_histogram(aes(n_folder)) +
  labs(title = "Folders by submission", x = element_blank(), 
       y = element_blank())

Most submissions end up in one folder, but some go up to 5 folders.

subm_folder <- cran_times %>% 
  group_by(package, submission_n) %>% 
  distinct(folder) %>% 
  summarise(folder = list(folder)) %>% 
  ungroup() %>% 
  count(folder, sort = TRUE)

The most 5 common folders process of submissions are pretest; pretest, inspect; pretest, newbies; inspect; pretest, newbies, human.

Another way of seeing whether it is a right moment to submit your package, aside of how many packages are on the queue, is looking how much activity there is:

subm3 <- cran_times %>%
  arrange(snapshot_time) %>% 
  group_by(package) %>% 
  mutate(autor_change = submission_n != lag(submission_n),
         cran_change = folder != lag(folder)) %>% 
  mutate(autor_change = ifelse(is.na(autor_change), TRUE, autor_change),
         cran_change = ifelse(is.na(cran_change), FALSE, cran_change)) %>% 
  mutate(cran_change = case_when(subfolder != lag(subfolder) ~ TRUE,
                                 TRUE ~ cran_change)) %>% 
  ungroup()
subm3 %>% 
  group_by(snapshot_time) %>% 
  summarize(autor_change = sum(autor_change), cran_change = sum(cran_change)) %>% 
  filter(row_number() != 1) %>% 
  filter(autor_change != 0 | cran_change != 0) %>% 
  ggplot() +
  geom_rect(data = holidays, aes(xmin = start, xmax = end), 
            ymin = -26, ymax = 26, alpha = 0.5, fill = "grey") +
  geom_point(aes(snapshot_time, autor_change), fill = "blue", size = 0) +
  geom_area(aes(snapshot_time, autor_change), fill = "blue") +
  geom_point(aes(snapshot_time, -cran_change), fill = "red", size = 0) +
  geom_area(aes(snapshot_time, -cran_change), fill = "red") +
  scale_x_datetime(date_labels = "%Y/%m/%d", date_breaks = "2 weeks", 
                   expand = expansion(add = 2)) +
  scale_y_continuous(expand = expansion(add = c(0, 0))) + 
  coord_cartesian(ylim = c(-26, 26)) +
  annotate("text", label = "CRAN's", y = 20, x = as_datetime("2020/11/02")) +
  annotate("text", label = "Author's", y = -20, x = as_datetime("2020/11/02")) +
  labs(y = "Changes", x = element_blank(), title = "Activity on CRAN:")

On this plot we can see that changes on folders or submissions are not simultaneous.

Review process

There is a scheme about how does the review process work. However, it has been pointed out that it needs an update.

We’ve seen which folders go before which ones, but we haven’t looked up what is the last folder in which package appear:

last_seen <- cran_times %>% 
  ungroup() %>% 
  group_by(package, submission_n) %>% 
  arrange(snapshot_time) %>% 
  filter(1:n() == last(n())) %>% 
  ungroup()
count(last_seen, folder, sort = TRUE) %>% 
  knitr::kable(col.names = c("Last folder", "Submissions"))
Last folder Submissions
pretest 1653
newbies 981
inspect 890
recheck 555
publish 469
waiting 441
human 332
pending 225

We can see that many submissions were left at the pretest, and just a minority on the human or publish folders.

Time it takes to disappear from the system

One of the motivations to do this post was a question on R-pkg-devel, about how long does it usually take for a package to be accepted on CRAN. We can look how long does each submission take until it is removed from the queue:

package_submissions <- cran_times %>% 
  group_by(package, submission_n) %>% 
  summarise(submission_period = difftime(max(snapshot_time), 
                                         min(snapshot_time), 
                                         units = "hour"),
            submission_time = min(snapshot_time)) %>% 
  ungroup() %>% 
  filter(submission_period != 0)

This is a good aproximation about how long does it take a package to be accepted or rejected, but some packages remain on the queue after they are accepted and appear on CRAN. Joining this data with data from metacran we could know how often does this happend. But I leave that for the reader or some other posts. Let’s go back to time spend on the queue:

package_submissions %>% 
  # filter(submission_time < holidays$start) %>% 
  ggplot() +
  geom_point(aes(submission_time, submission_period, col = submission_n)) +
  geom_rect(data = holidays, aes(xmin = start, xmax = end),
            ymin = 0, ymax = 3500, alpha = 0.5, fill = "red") + 
  scale_x_datetime(date_labels = "%Y/%m/%d", date_breaks = "2 weeks",
                   expand = expansion(add = 10)) +
  scale_y_continuous(expand = expansion(add = 10)) +
  labs(title = "Time on the queue according to the submission",
       x = "Submission", y = "Time (hours)", col = "Submission") +
  theme(legend.position = c(0.5, 0.8))

These diagonals suggest that the work is done in batches in a day or an afternoon. And the prominent diagonal after holidays are packages still on the queue.

package_submissions %>% 
  ungroup() %>%
  mutate(d = as.Date(submission_time)) %>%
  group_by(d) %>% 
  summarize(m = median(submission_period)) %>% 
  ggplot() +
  geom_rect(data = holidays, aes(xmin = as.Date(start), xmax = as.Date(end)),
            ymin = 0, ymax = 80, alpha = 0.5, fill = "red") + 
  geom_smooth(aes(d, m)) +
  coord_cartesian(ylim = c(0, NA)) +
  labs(x = "Submission", y = "Daily median time in queue (hours)", 
       title = "Submission time")

package_submissions %>% 
  group_by(submission_n) %>% 
  mutate(submission_n = as.character(submission_n)) %>% 
  ggplot() +
  geom_jitter(aes(submission_n, submission_period), height = 0) +
  labs(title = "Submission time in queue", y = "Hours", x = element_blank()) +
  scale_y_continuous(limits = c(1, NA), expand = expansion(add = c(0, 2)))
## Warning: Removed 142 rows containing missing values (geom_point).

Surprisingly sometimes a submissions goes missing from the folders for some days (I checked with one package I submitted and it doesn’t appear for 7 days, although it was on the queue). This might affect this analysis as it will count them as new submissions but some of them won’t be.

package_submissions %>% 
  filter(submission_period != 0) %>% 
  group_by(submission_n) %>% 
  mutate(submission_n = as.character(submission_n)) %>% 
  filter(n() > 5) %>% 
  summarize(median(submission_period)) %>% 
  knitr::kable(col.names = c("Submission", "Median time (hours"))
Submission Median time (hours
1 36.12750 hours
2 18.27264 hours
3 16.46528 hours
4 11.27389 hours
5 13.37028 hours
6 38.08417 hours

So it usually takes more than a day for new packages and later is around 12 hours.

Github action reliability

The data of this post was collected using Github actions by cransays. At the same time I set up a bot/package to scan all the available packages on Bioconductor and CRAN, to track those packages that are accepted and when. I left it running for 3 months without checking whether I could recover the data. When I started writing this post I found the data is not usable on the current format.

Instead I’ll use cransays data to test how reliable are GitHub actions are.

download.file("https://github.com/llrs/cranis/archive/history.zip",
              destfile = "static/cranis-history.zip")
path_zip <- here::here("static", "cranis-history.zip") # From downloading the cransays repository branch history
dat <- unzip(path_zip, list = TRUE)
csv <- dat$Name[grepl("*.csv$", x = dat$Name)]
dates <- gsub("available-packages-(.+)\\.csv", "\\1", basename(csv))
date <- as.POSIXct(dates, format = "%Y%m%dT%H%M", "UTC")
gha <- data.frame(
  month = month(date), 
  mday = mday(date), 
  wday = wday(date), 
  week = week(date),
  minute = minute(date), 
  hour = hour(date), 
  type = "cranis")
gha <- cbind(cran_times[, c("month", "mday", "wday", "week")], 
      minute = minute(cran_submissions$snapshot_time), 
      hour = hour(cran_times$snapshot_time),
      type = "cransays") %>% 
  distinct()
gha %>% 
  ggplot() +
  geom_violin(aes(as.factor(hour), minute)) +
  scale_y_continuous(expand = expansion(add = 0.5), 
                     breaks = c(0, 15, 30, 45, 60), limits = c(0, 60)) +
  scale_x_discrete(expand = expansion())  +
  labs(x = "Hour", y = "Minute", title = "Daily variation")

There seems to be a lower limit around 10 minutes except for some builds that I think were manually triggered. Aside from this, there is usually low variation and the process end around ~ 15 minutes but it can end much later. This is just for one simple script scrapping a site. Compared to thousands of packages builds and checks it is much simpler.

And last how reliable it is?

We can compare how many hours are between the first and the last report and how many do we have recorded. If we have less this indicate errors on GHA.

So the script and github actions worked on ~96.9646576% of the times.

These numbers are great, but on CRAN and Bioconductor all packages are checked daily on several OS consistently.

Conclusions

One big difference between CRAN and Bioconductor or rOpenSci is that even if your package is already included each time you want to fix something it gets reviewed by someone. This ensures a high quality of the packages, as well as increases the work for the reviewers.

The next big difference is the lack of transparency on the process of the review itself. Perhaps because CRAN started earlier (1997) while Bioconductor in 2002 and rOpenSci much later. A public review could help reduce the burden to CRAN reviewers as outsiders could help solving errors (although this is somewhat already fulfilled by the mailing list R-package-devel), and would help notice and find a compromise on inconsistencies between reviews. As anecdotal evidence I submitted two packages one shortly after the first, on the second package they asked me to change some URLS that on the first I wasn’t required to change.

Another difference are the manual page. It seems that CRAN repository is equated to be R, so the manual for writing R extensions is under cran.r-project.org as well as the CRAN policies.

The policies change without notice to the existing developers. Sending an email to the maintainers or R-announce mailing list would help developers to notice policy changes. Developers had to create a policy watch and other resources to keep up with CRAN changes.

The CRAN reviewers are involved on multiple demanding tasks: their own regular jobs and outside work (familiar, friend, other interests) commitments, and then, R development and maintenance, CRAN reviews and maintenance, R-journal3. One possible solution to reduce the burden for them is increase the number of reviewers. Perhaps a mentorship program to review packages or a guideline to what to check training and would help reduce the pressure on the current volunteers.

Many thanks to all the volunteers that maintain it, those who donate to the R Foundation and the employers of those volunteers that make possible the CRAN and R work.

Reproducibility

## NULL
## ─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────
##  setting  value                       
##  version  R version 4.0.1 (2020-06-06)
##  os       Ubuntu 20.04.1 LTS          
##  system   x86_64, linux-gnu           
##  ui       X11                         
##  language (EN)                        
##  collate  en_US.UTF-8                 
##  ctype    en_US.UTF-8                 
##  tz       Europe/Madrid               
##  date     2021-01-31                  
## 
## ─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────
##  package      * version date       lib source                           
##  assertthat     0.2.1   2019-03-21 [1] CRAN (R 4.0.1)                   
##  backports      1.2.1   2020-12-09 [1] CRAN (R 4.0.1)                   
##  blogdown       1.0.3   2021-01-15 [1] Github (rstudio/blogdown@876543d)
##  bookdown       0.21    2020-10-13 [1] CRAN (R 4.0.1)                   
##  broom          0.7.3   2020-12-16 [1] CRAN (R 4.0.1)                   
##  cellranger     1.1.0   2016-07-27 [1] CRAN (R 4.0.1)                   
##  cli            2.2.0   2020-11-20 [1] CRAN (R 4.0.1)                   
##  colorspace     2.0-0   2020-11-11 [1] CRAN (R 4.0.1)                   
##  crayon         1.3.4   2017-09-16 [1] CRAN (R 4.0.1)                   
##  DBI            1.1.0   2019-12-15 [1] CRAN (R 4.0.1)                   
##  dbplyr         2.0.0   2020-11-03 [1] CRAN (R 4.0.1)                   
##  digest         0.6.27  2020-10-24 [1] CRAN (R 4.0.1)                   
##  dplyr        * 1.0.3   2021-01-15 [1] CRAN (R 4.0.1)                   
##  ellipsis       0.3.1   2020-05-15 [1] CRAN (R 4.0.1)                   
##  evaluate       0.14    2019-05-28 [1] CRAN (R 4.0.1)                   
##  fansi          0.4.2   2021-01-15 [1] CRAN (R 4.0.1)                   
##  farver         2.0.3   2020-01-16 [1] CRAN (R 4.0.1)                   
##  forcats      * 0.5.0   2020-03-01 [1] CRAN (R 4.0.1)                   
##  fs             1.5.0   2020-07-31 [1] CRAN (R 4.0.1)                   
##  generics       0.1.0   2020-10-31 [1] CRAN (R 4.0.1)                   
##  ggplot2      * 3.3.3   2020-12-30 [1] CRAN (R 4.0.1)                   
##  glue           1.4.2   2020-08-27 [1] CRAN (R 4.0.1)                   
##  gtable         0.3.0   2019-03-25 [1] CRAN (R 4.0.1)                   
##  haven          2.3.1   2020-06-01 [1] CRAN (R 4.0.1)                   
##  here           1.0.1   2020-12-13 [1] CRAN (R 4.0.1)                   
##  highr          0.8     2019-03-20 [1] CRAN (R 4.0.1)                   
##  hms          * 0.5.3   2020-01-08 [1] CRAN (R 4.0.1)                   
##  htmltools      0.5.1   2021-01-12 [1] CRAN (R 4.0.1)                   
##  httr           1.4.2   2020-07-20 [1] CRAN (R 4.0.1)                   
##  jsonlite       1.7.2   2020-12-09 [1] CRAN (R 4.0.1)                   
##  knitr          1.30    2020-09-22 [1] CRAN (R 4.0.1)                   
##  labeling       0.4.2   2020-10-20 [1] CRAN (R 4.0.1)                   
##  lattice        0.20-41 2020-04-02 [1] CRAN (R 4.0.1)                   
##  lifecycle      0.2.0   2020-03-06 [1] CRAN (R 4.0.1)                   
##  lubridate    * 1.7.9.2 2020-11-13 [1] CRAN (R 4.0.1)                   
##  magrittr       2.0.1   2020-11-17 [1] CRAN (R 4.0.1)                   
##  Matrix         1.3-2   2021-01-06 [1] CRAN (R 4.0.1)                   
##  mgcv           1.8-33  2020-08-27 [1] CRAN (R 4.0.1)                   
##  modelr         0.1.8   2020-05-19 [1] CRAN (R 4.0.1)                   
##  munsell        0.5.0   2018-06-12 [1] CRAN (R 4.0.1)                   
##  nlme           3.1-151 2020-12-10 [1] CRAN (R 4.0.1)                   
##  patchwork    * 1.1.1   2020-12-17 [1] CRAN (R 4.0.1)                   
##  pillar         1.4.7   2020-11-20 [1] CRAN (R 4.0.1)                   
##  pkgconfig      2.0.3   2019-09-22 [1] CRAN (R 4.0.1)                   
##  purrr        * 0.3.4   2020-04-17 [1] CRAN (R 4.0.1)                   
##  R6             2.5.0   2020-10-28 [1] CRAN (R 4.0.1)                   
##  RColorBrewer   1.1-2   2014-12-07 [1] CRAN (R 4.0.1)                   
##  Rcpp           1.0.5   2020-07-06 [1] CRAN (R 4.0.1)                   
##  readr        * 1.4.0   2020-10-05 [1] CRAN (R 4.0.1)                   
##  readxl         1.3.1   2019-03-13 [1] CRAN (R 4.0.1)                   
##  reprex         0.3.0   2019-05-16 [1] CRAN (R 4.0.1)                   
##  rlang          0.4.10  2020-12-30 [1] CRAN (R 4.0.1)                   
##  rmarkdown      2.6     2020-12-14 [1] CRAN (R 4.0.1)                   
##  rprojroot      2.0.2   2020-11-15 [1] CRAN (R 4.0.1)                   
##  rstudioapi     0.13    2020-11-12 [1] CRAN (R 4.0.1)                   
##  rvest          0.3.6   2020-07-25 [1] CRAN (R 4.0.1)                   
##  scales         1.1.1   2020-05-11 [1] CRAN (R 4.0.1)                   
##  sessioninfo    1.1.1   2018-11-05 [1] CRAN (R 4.0.1)                   
##  stringi        1.5.3   2020-09-09 [1] CRAN (R 4.0.1)                   
##  stringr      * 1.4.0   2019-02-10 [1] CRAN (R 4.0.1)                   
##  tibble       * 3.0.5   2021-01-15 [1] CRAN (R 4.0.1)                   
##  tidyr        * 1.1.2   2020-08-27 [1] CRAN (R 4.0.1)                   
##  tidyselect     1.1.0   2020-05-11 [1] CRAN (R 4.0.1)                   
##  tidyverse    * 1.3.0   2019-11-21 [1] CRAN (R 4.0.1)                   
##  utf8           1.1.4   2018-05-24 [1] CRAN (R 4.0.1)                   
##  vctrs          0.3.6   2020-12-17 [1] CRAN (R 4.0.1)                   
##  viridisLite    0.3.0   2018-02-01 [1] CRAN (R 4.0.1)                   
##  withr          2.4.0   2021-01-16 [1] CRAN (R 4.0.1)                   
##  xfun           0.20    2021-01-06 [1] CRAN (R 4.0.1)                   
##  xml2           1.3.2   2020-04-23 [1] CRAN (R 4.0.1)                   
##  yaml           2.2.1   2020-02-01 [1] CRAN (R 4.0.1)                   
## 
## [1] /home/lluis/bin/R/4.0.1/lib/R/library

  1. Or a problem with ggplot2 representing a sudden value that is much different from those around them.↩︎

  2. Which now I cannot find the evidence to link to. If anyone finds the tweet I would appreciate it.↩︎

  3. I’m not aware of anyone whose full job is just R reviewing.↩︎

Avatar
Lluís Revilla Sancho
Bioinformatician

Bioinformatician with interests in functional enrichment, data integration and transcriptomics.

Related