Exploring CRAN's files: part 2
Introduction
In the first post of the series we briefly explored packages available on CRAN. Now I’ll focus on history of the packages and its size using the following files:
packages <- tools::CRAN_package_db()
current <- tools:::CRAN_current_db()
archive <- tools:::CRAN_archive_db()
In this part we will use two files: The current
and the archive
, let’s see why.
current file
The current database has has the package size, dates of modification, which I assume is date added to CRAN and user name of who last modified it.
This is the same information returned by file.info
current[1, 1:10]
## size isdir mode mtime ctime atime
## A3 42810 FALSE 664 2015-08-16 23:05:54 2022-09-03 12:02:27 2022-09-03 14:00:19
## uid gid uname grname
## A3 1001 1001 hornik cranadmin
archive file
The archive database returns the same information, but as you might guess by the name it doesn’t provide information about current packages but for packages in the archive and no longer available by default.
archive[[1]]
## size isdir mode mtime ctime
## A3/A3_0.9.1.tar.gz 45252 FALSE 664 2013-02-07 10:00:29 2022-08-22 18:14:53
## A3/A3_0.9.2.tar.gz 45907 FALSE 664 2013-03-26 19:58:40 2022-08-22 18:14:53
## atime uid gid uname grname
## A3/A3_0.9.1.tar.gz 2022-08-22 17:39:50 1001 1001 hornik cranadmin
## A3/A3_0.9.2.tar.gz 2022-08-22 17:39:50 1010 1001 ligges cranadmin
The date matches that available on the web’s old sources, so we can be confident of it’s meaning.
CRAN history
As we have seen there are some files about the archives of CRAN. These include information about date of modification (moving/editing) and user who did it and of course name and sometimes version of the package. These archives are the great treasure of CRAN because they help to make reproducible long time ago run experiments or analysis.
Note that I’m not totally sure that this archive contains the full record of packages, some initial packages might be missing. I’m also aware of some packages removed by CRAN which do not longer appear on this records.
Nevertheless, this should provide an accurate picture of packages available through time. Also as there is no information when a package is archived (here, there is on PACKAGES.in) so I might overestimate the packages available at any given moment.
Remember the plot about acceptance of packages on CRAN? That plot only looked at current packages available, let’s check it with all the archive:
All these packages come from packages with few releases and packages with many releases. If we look at which packages had the most releases:
package | Releases |
spatstat | 206 |
Matrix | 204 |
mgcv | 162 |
RcppArmadillo | 150 |
rgdal | 146 |
nlme | 143 |
caret | 139 |
spdep | 139 |
lattice | 137 |
plotrix | 131 |
sp | 128 |
XML | 126 |
Rcmdr | 123 |
lme4 | 122 |
gstat | 121 |
arm | 119 |
foreign | 117 |
party | 117 |
maptools | 113 |
raster | 108 |
Surprisingly there are packages with more than 200 versions on CRAN!
Most packages have 1 release, usually packages have 3, but the mean is around 6.
Given all this different versions of packages how big are all the packages on CRAN?
CRAN size
Have you ever wondered how big is CRAN? According to the memory size of the source packages all CRAN source packages are approximately 96.8 Gb.
This doesn’t include binaries for multiple architectures and OS. The package size might indicate whether the package has considerable amount of data.
Looking back to the size of the packages along time we can see this pattern:
Packages available on CRAN are smaller than those no longer on CRAN. But versions of packages on CRAN that got archived are usually bigger than current versions. Packages no longer on CRAN are usually bigger. Median size of packages is increasing (quickly).
Typically packages increase their size with each new release up to when they reach 50 releases. For higher releases this plot depends on very few packages and might not be representative.
Here we can appreciate better how packages tend to be below the CRAN threshold. There isn’t much of a difference between packages available on CRAN and those archived.
If we look at the size of package of the first release over time we’ll see a representative view:
Package size tends to increase except for the brief period 2010-2014. Currently it increases less than before that period but is close to its maximum.
Conclusions
Most packages are not updated too much, between 1 and 3 times. But there are packages that are updated quite a lot, this might mean they are data packages and not software packages or that they have frequent minor and major updates.
Most current packages have smaller size than those archived. Packages no longer available usually had bigger size than those packages still on CRAN.
Surprisingly packages increase their size a lot till the 25 release. But also with time except for a period in 2010 and 2014. This decreasing period might be due to a change in CRAN policy.
Future parts
On future posts I’ll explore:
patterns accepting packages and updates in packages.
the relation between dependencies, initial release and updates.
who handled the packages.