Bioconductor submissions
The other day I was curious about how many submissions are on Bioconductor and how do they work, is there any pattern toward the release cycle or not? In this brief post I’ll explore some data available from the github repository where the submissions are done as issues. I mostly focus on what happens when one submits a package so that it can help future submissions.
There are around 1826 issues, I extract the interesting information: date of creation, who creates them, where is the linked package. when are they closed, labels at the moment, reviewers…
We can further mark those approved because they are on bioconductor but do not have the approved label (Sometimes they get accepted but they forget to change labels on the issue).
I didn’t mark as accepted packages that were submitted several times (184 packages have been submitted more than once). Some of them are due to being on the old tracker and were not approved on this repository.
Using also the CRAN repository we could find which packages where submitted to Bioconductor but end up on CRAN. But as I expect a low number of these I won’t check them.
Exploring the submissions
First of all a general overview of the data available:
On this plot we can see almost everything, date of creation, time opened, when was closed, in which release is included if accepted and the rate of submissions to Bioconductor. What we miss is about the authors submitting the packages and about the packages themselves.
Reviewers
If your package fits Bioconductor a bot will assign a reviewer to check the package. Many submissions (~33%) do not get a reviewer assigned and the package is not included on Bioconductor’s repository, most (~90%) of the cases this happens the same day of the submission. But once they get one reviewer ~82% of them are approved.
There core members of Bioconductor are 9 usual reviewers:
Probably you might get assigned the project leader Martin Morgan, but you can work with any other reviewer, as the work load is quite distributed: mtmorgan, lshep, LiNk-NY, nturaga, dvantwisk, hpages, Kayla-Morrell, Liubuntu, vobencha.
The reviews are fairly done by all the members. We can see that all of them approves more than 75% of the packages assigned.
Most of the reviewers take few days until the issue is closed.
Looking at these plots more time reviewing do not mean closing the issue without being accepted.
Correction! If we divide by approved submissions, some reviewers close later submissions that are not approved. They might wait longer to close the issues without approving them, or maybe they prefer to suggest modifications and a second submission of the same package. Others close faster than they accept the packages but as you can see by the error bars it is quite variable. In general you can expect more than 40 days review.
They might accept faster packages but it is because they provide faster feedback with more comments?
Many of the comments I suspect are automatic messages from the bot about building and receiving commits. But apparently there isn’t a difference between them.
Here almost all the reviewers have more comments on approved packages. Also, there is less variability on the number of comments than on the time they are open. Answering doubts and giving feedback to the reviewers and contributors increases acceptance, even if this comment increase is due to automatic messages.
Usually more than 20 comments on the issue will be enough but the more feedback, the better.
Taking both information together we can see the pattern for all the reviewers:
Most of the not approved issues have less comments and usually remain less days open.
Rushing?
Reviewers take their time to accept or reject the packages. But do contributors have some rush to get them accepted at the end of an academic year, before a Bioconductor release or before a publication? While I cannot check if there is an article accompanying the package, I can check other time trends.
So most days there is a new package submission and on the best day there are 8. There doesn’t seem to be a seasonality towards the end of the course, maybe because the contributors are international and do not share the same schedule or holidays.
Using the dates of new Bioconductor releases we can check if there is some rush to submit packages closer to the new release:
The vertical line indicates the usual time the submissions are no longer accepted (around 30 days before the release day). So it seems like people submit close to the date but much consistently around the year as previously seen.
Those that submit closer to the date of release have higher rates of not being included on Bioconductor.
Very few packages went to the worst scenario: submitted right before a deadline and then it wasn’t accepted until many releases later. However there is a curious effect, the closer the submission is to the release day, the shorter are the reviews of the accepted packages.
Here we can observe this effect on the latest release.
I expected to see longer review time for the issues submitted closer to the release but we don’t see it. This indicates a sustained effort to review the submissions.
Pitfalls
We have seen mostly the path to success, but in order to improve the process we must look what lead to having the packages not approved.
We can check if packages
When the user that submits the package is the owner of the repository then it is more likely that the package is accepted. Probably because it is easier to manage the version control system that way.
Most packages are closed fast but some review take longer than a year! Those closed fast are automatically done by the bot due to several reasons. Among these reasons:
Number of packages | Times |
---|---|
0 | 13 |
2 | 11 |
Name | Packages |
---|---|
yourpackagename | 68 |
cytofkit2 | 8 |
test-package | 7 |
PoTRA | 6 |
PFP | 6 |
Summix | 6 |
It is equally probable to not provide a link or to provide two. However many times the template is used as is and the link to the name of the repository remains “yourpackagename”, which is detected by the bot and closed.
Comments | Submissions |
---|---|
1 | 413 |
2 | 42 |
3 | 26 |
4 | 13 |
5 | 11 |
6 | 9 |
12 | 9 |
10 | 8 |
0 | 7 |
7 | 7 |
Most of the rejected submissions is done automatically by the “bioc-issue-bot”, with a brief message. But in general few messages are written before closing the submissions.
Ending | Number of issues |
---|---|
.git | 20 |
.R | 4 |
.data | 3 |
.Gmax.EnsemblPlants.Gmv2 | 3 |
.data.oma.All.Jan2020 | 2 |
.2 | 1 |
Sometimes the user provides a link to a compressed file, or it ends incorrectly with .git as when the url need to clone a repository.
Number of labels | Number of issues |
---|---|
0 | 455 |
1 | 81 |
2 | 75 |
3 | 41 |
4 | 8 |
5 | 2 |
The bot automatically labels the issues to provide information about the build and check of the package and about the stage of the project. Usually this means they do not get assigned a reviewer or built on the Bioconductor server.
Looking at which labels go together, if the review starts the next halting point is an error on the build process, probably the issue is not fixed and the submission not approved.
Conclusion
Most of the packages submitted to Bioconductor are accepted. Those that are not accepted are usually due to submission formatting issues detected automatically by the bioc-issue-bot. Reviewers provide a lot of feedback that if followed leads to acceptance of the package, when this feedback is ignored it leads to a rejection of the submission. Some packages are submitted right before a release and miss it, while others undergo a long review and miss several releases, but if you want to have your package included on the next release, aim to submit 40 days before the release date (usually on the end of April and October).
Reproducibility
## ─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────
## setting value
## version R version 4.0.1 (2020-06-06)
## os Ubuntu 20.04.1 LTS
## system x86_64, linux-gnu
## ui X11
## language (EN)
## collate en_US.UTF-8
## ctype en_US.UTF-8
## tz Europe/Madrid
## date 2021-01-08
##
## ─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────
## package * version date lib source
## assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.0.1)
## backports 1.2.1 2020-12-09 [1] CRAN (R 4.0.1)
## beeswarm 0.2.3 2016-04-25 [1] CRAN (R 4.0.1)
## BiocManager * 1.30.10 2019-11-16 [1] CRAN (R 4.0.1)
## blogdown 0.21.84 2021-01-07 [1] Github (rstudio/blogdown@c4fbb58)
## bookdown 0.21 2020-10-13 [1] CRAN (R 4.0.1)
## broom 0.7.3 2020-12-16 [1] CRAN (R 4.0.1)
## cellranger 1.1.0 2016-07-27 [1] CRAN (R 4.0.1)
## cli 2.2.0 2020-11-20 [1] CRAN (R 4.0.1)
## codetools 0.2-18 2020-11-04 [1] CRAN (R 4.0.1)
## colorspace 2.0-0 2020-11-11 [1] CRAN (R 4.0.1)
## crayon 1.3.4 2017-09-16 [1] CRAN (R 4.0.1)
## curl 4.3 2019-12-02 [1] CRAN (R 4.0.1)
## DBI 1.1.0 2019-12-15 [1] CRAN (R 4.0.1)
## dbplyr 2.0.0 2020-11-03 [1] CRAN (R 4.0.1)
## digest 0.6.27 2020-10-24 [1] CRAN (R 4.0.1)
## dplyr * 1.0.2 2020-08-18 [1] CRAN (R 4.0.1)
## ellipsis 0.3.1 2020-05-15 [1] CRAN (R 4.0.1)
## evaluate 0.14 2019-05-28 [1] CRAN (R 4.0.1)
## fansi 0.4.1 2020-01-08 [1] CRAN (R 4.0.1)
## farver 2.0.3 2020-01-16 [1] CRAN (R 4.0.1)
## forcats * 0.5.0 2020-03-01 [1] CRAN (R 4.0.1)
## fs 1.5.0 2020-07-31 [1] CRAN (R 4.0.1)
## generics 0.1.0 2020-10-31 [1] CRAN (R 4.0.1)
## ggbeeswarm 0.6.0 2017-08-07 [1] CRAN (R 4.0.1)
## ggplot2 * 3.3.2 2020-06-19 [1] CRAN (R 4.0.1)
## gh * 1.2.0 2020-11-27 [1] CRAN (R 4.0.1)
## gitcreds 0.1.1 2020-12-04 [1] CRAN (R 4.0.1)
## glue 1.4.2 2020-08-27 [1] CRAN (R 4.0.1)
## gridExtra 2.3 2017-09-09 [1] CRAN (R 4.0.1)
## gtable 0.3.0 2019-03-25 [1] CRAN (R 4.0.1)
## haven 2.3.1 2020-06-01 [1] CRAN (R 4.0.1)
## highr 0.8 2019-03-20 [1] CRAN (R 4.0.1)
## hms 0.5.3 2020-01-08 [1] CRAN (R 4.0.1)
## htmltools 0.5.0 2020-06-16 [1] CRAN (R 4.0.1)
## httr 1.4.2 2020-07-20 [1] CRAN (R 4.0.1)
## jsonlite 1.7.2 2020-12-09 [1] CRAN (R 4.0.1)
## knitr 1.30 2020-09-22 [1] CRAN (R 4.0.1)
## labeling 0.4.2 2020-10-20 [1] CRAN (R 4.0.1)
## lifecycle 0.2.0 2020-03-06 [1] CRAN (R 4.0.1)
## lubridate * 1.7.9.2 2020-11-13 [1] CRAN (R 4.0.1)
## magrittr 2.0.1 2020-11-17 [1] CRAN (R 4.0.1)
## modelr 0.1.8 2020-05-19 [1] CRAN (R 4.0.1)
## munsell 0.5.0 2018-06-12 [1] CRAN (R 4.0.1)
## pillar 1.4.7 2020-11-20 [1] CRAN (R 4.0.1)
## pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.0.1)
## plyr 1.8.6 2020-03-03 [1] CRAN (R 4.0.1)
## purrr * 0.3.4 2020-04-17 [1] CRAN (R 4.0.1)
## R6 2.5.0 2020-10-28 [1] CRAN (R 4.0.1)
## Rcpp 1.0.5 2020-07-06 [1] CRAN (R 4.0.1)
## readr * 1.4.0 2020-10-05 [1] CRAN (R 4.0.1)
## readxl 1.3.1 2019-03-13 [1] CRAN (R 4.0.1)
## reprex 0.3.0 2019-05-16 [1] CRAN (R 4.0.1)
## rlang 0.4.10 2020-12-30 [1] CRAN (R 4.0.1)
## rmarkdown 2.6 2020-12-14 [1] CRAN (R 4.0.1)
## rstudioapi 0.13 2020-11-12 [1] CRAN (R 4.0.1)
## rvest 0.3.6 2020-07-25 [1] CRAN (R 4.0.1)
## scales 1.1.1 2020-05-11 [1] CRAN (R 4.0.1)
## sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 4.0.1)
## stringi 1.5.3 2020-09-09 [1] CRAN (R 4.0.1)
## stringr * 1.4.0 2019-02-10 [1] CRAN (R 4.0.1)
## tibble * 3.0.4 2020-10-12 [1] CRAN (R 4.0.1)
## tidyr * 1.1.2 2020-08-27 [1] CRAN (R 4.0.1)
## tidyselect 1.1.0 2020-05-11 [1] CRAN (R 4.0.1)
## tidyverse * 1.3.0 2019-11-21 [1] CRAN (R 4.0.1)
## UpSetR * 1.4.0 2019-05-22 [1] CRAN (R 4.0.1)
## utf8 1.1.4 2018-05-24 [1] CRAN (R 4.0.1)
## vctrs 0.3.6 2020-12-17 [1] CRAN (R 4.0.1)
## vipor 0.4.5 2017-03-22 [1] CRAN (R 4.0.1)
## withr 2.3.0 2020-09-22 [1] CRAN (R 4.0.1)
## xfun 0.20 2021-01-06 [1] CRAN (R 4.0.1)
## xml2 1.3.2 2020-04-23 [1] CRAN (R 4.0.1)
## yaml 2.2.1 2020-02-01 [1] CRAN (R 4.0.1)
##
## [1] /home/lluis/bin/R/4.0.1/lib/R/library