Exploring CRAN's files: part 2

Introduction

In the first post of the series we briefly explored packages available on CRAN. Now I’ll focus on history of the packages and its size using the following files:

packages <- tools::CRAN_package_db()
current <- tools:::CRAN_current_db()
archive <- tools:::CRAN_archive_db()

In this part we will use two files: The current and the archive, let’s see why.

current file

The current database has has the package size, dates of modification, which I assume is date added to CRAN and user name of who last modified it. This is the same information returned by file.info

current[1, 1:10]
##     size isdir mode               mtime               ctime               atime
## A3 42810 FALSE  664 2015-08-16 23:05:54 2022-09-03 12:02:27 2022-09-03 14:00:19
##     uid  gid  uname    grname
## A3 1001 1001 hornik cranadmin

archive file

The archive database returns the same information, but as you might guess by the name it doesn’t provide information about current packages but for packages in the archive and no longer available by default.

archive[[1]]
##                     size isdir mode               mtime               ctime
## A3/A3_0.9.1.tar.gz 45252 FALSE  664 2013-02-07 10:00:29 2022-08-22 18:14:53
## A3/A3_0.9.2.tar.gz 45907 FALSE  664 2013-03-26 19:58:40 2022-08-22 18:14:53
##                                  atime  uid  gid  uname    grname
## A3/A3_0.9.1.tar.gz 2022-08-22 17:39:50 1001 1001 hornik cranadmin
## A3/A3_0.9.2.tar.gz 2022-08-22 17:39:50 1010 1001 ligges cranadmin

The date matches that available on the web’s old sources, so we can be confident of it’s meaning.

CRAN history

As we have seen there are some files about the archives of CRAN. These include information about date of modification (moving/editing) and user who did it and of course name and sometimes version of the package. These archives are the great treasure of CRAN because they help to make reproducible long time ago run experiments or analysis.

Note that I’m not totally sure that this archive contains the full record of packages, some initial packages might be missing. I’m also aware of some packages removed by CRAN which do not longer appear on this records.

Nevertheless, this should provide an accurate picture of packages available through time. Also as there is no information when a package is archived (here, there is on PACKAGES.in) so I might overestimate the packages available at any given moment.

Remember the plot about acceptance of packages on CRAN? That plot only looked at current packages available, let’s check it with all the archive:

*Packages on CRAN archive by their addition to it.* There are over 125000 archives on CRAN.

Figure 1: Packages on CRAN archive by their addition to it. There are over 125000 archives on CRAN.

All these packages come from packages with few releases and packages with many releases. If we look at which packages had the most releases:

Surprisingly there are packages with more than 200 versions on CRAN!

*Releases distirbution*. Packages and number of releases

Figure 2: Releases distirbution. Packages and number of releases

Most packages have 1 release, usually packages have 3, but the mean is around 6.

Given all this different versions of packages how big are all the packages on CRAN?

CRAN size

Have you ever wondered how big is CRAN? According to the memory size of the source packages all CRAN source packages are approximately 96.8 Gb.

This doesn’t include binaries for multiple architectures and OS. The package size might indicate whether the package has considerable amount of data.

Looking back to the size of the packages along time we can see this pattern:

*Package and their median size.* Archived packages have become bigger since 2014. Packages on CRAN have been getting bigger since 2017.

Figure 3: Package and their median size. Archived packages have become bigger since 2014. Packages on CRAN have been getting bigger since 2017.

Packages available on CRAN are smaller than those no longer on CRAN. But versions of packages on CRAN that got archived are usually bigger than current versions. Packages no longer on CRAN are usually bigger. Median size of packages is increasing (quickly).

*Size of package with releases.* Package are usually small but seem to gain weight when updating.

Figure 4: Size of package with releases. Package are usually small but seem to gain weight when updating.

Typically packages increase their size with each new release up to when they reach 50 releases. For higher releases this plot depends on very few packages and might not be representative.

*Size of package with releases by availability.* Packages no longer in CRAN are usually smaller than those in it. The continous black line is CRAN's current threshold, while the discontinous black line is current median size.

Figure 5: Size of package with releases by availability. Packages no longer in CRAN are usually smaller than those in it. The continous black line is CRAN’s current threshold, while the discontinous black line is current median size.

Here we can appreciate better how packages tend to be below the CRAN threshold. There isn’t much of a difference between packages available on CRAN and those archived.

If we look at the size of package of the first release over time we’ll see a representative view:

*Size of the first release by time*. Package size increases with time with a peak around 2010 and increasing again since 2014 but still hasn't surprased the previous record.

Figure 6: Size of the first release by time. Package size increases with time with a peak around 2010 and increasing again since 2014 but still hasn’t surprased the previous record.

Package size tends to increase except for the brief period 2010-2014. Currently it increases less than before that period but is close to its maximum.

Conclusions

  • Most packages are not updated too much, between 1 and 3 times. But there are packages that are updated quite a lot, this might mean they are data packages and not software packages or that they have frequent minor and major updates.

  • Most current packages have smaller size than those archived. Packages no longer available usually had bigger size than those packages still on CRAN.

  • Surprisingly packages increase their size a lot till the 25 release. But also with time except for a period in 2010 and 2014. This decreasing period might be due to a change in CRAN policy.

Future parts

On future posts I’ll explore:

  • patterns accepting packages and updates in packages.

  • the relation between dependencies, initial release and updates.

  • who handled the packages.

Edit this page

Avatar
Lluís Revilla Sancho
Data scientist

Data scientist with interests in software quality, mostly R.

Related