Exploring CRAN's files: part 3
References: links between pages, packages and base R
In the blog post of CRAN files, I initially explored the database (mostly around CRAN_package_db()
). On the second post I moved to analyze the archive around CRAN_archive_db()
.
Recently at useR!2024, Kurt Hornik offered to export data from CRAN. After an email several new functions were exported and introduced to base R.
In that presentation and in a previous exchange with Kurt, he explained his project about providing the HTML manual pages of all CRAN packages. One of the problems with this projects is providing links to the right pages of the packages (others are making it accessible for all users).
In R-devel now there are test that ensure links to other pages are in the right format 1. In this post we we’ll explore links between documentation pages, what manuals say about them and what is the state of base R and the CRAN packages.
Introduction
Help pages are defined in R documentation files, commonly with the .Rd
extension.
Each documentation file can have multiple topics.
A topic is defined in R documentation format with \alias{topic}
, so once can use interchangeably alias and topics.
Links, or cross-references (xref for short) should be to an alias, not to files (allowing to move topics between help pages)2.
One can define links with \link{alias}
or \link[=alias]{name}
if the link is to a different place than the name 3.
In the second form the content within the square bracket is known as anchor.
There are two other forms accepted: \link[pkg]{alias}
and \link[pkg:alias]{name}
which do not use the =
inside the anchor 4.
These last two forms are only used in the HTML format.
Packages referred to on these square brackets should be declared in the DESCRIPTION file, in the ‘Depends’, ‘Imports’, ‘Suggests’ or ‘Enhances’ fields.
One important thing is that links are case sensitive: ?Complex
leads to a different help page than ?complex
.
With this in mind we can check to which packages have more manual pages, have more cross-references, or have an overview help page (those with pkgname-package.Rd). We will also check which documentation pages have more links, are more linked to, or needs its links fixed.
Base R
Before looking to CRAN packages we can start by looking to R base packages.
Previously there wasn’t a way to know the links and aliases in base R. Since recently (~2024/08/20) we can download them (on R-devel till it is released on R 4.5) with tools::base_aliases_db()
and tools::base_rdxrefs_db()
.
Aliases
On base R there are 1458 help pages with 4676 topics/aliases.
There are some help files with the same name in base R that can be accessed via their targets (via ?Target
).
Those files are dev.Rd, format.Rd, units.Rd.
To distinguish between those files one need to check the title(s) and content of the pages.
Similarly there are multiple help pages of different packages with the same topics:
Alias | Packages |
---|---|
Complex | base, methods |
GSC | base, tools |
Logic | base, methods |
Math | base, methods |
Ops | base, methods |
R_GSCMD | base, tools |
S4 | base, methods |
Summary | base, methods |
clipboard | base, utils |
hat | grDevices, stats |
matrixOps | base, methods |
plot | base, graphics |
show-methods | methods, stats4 |
symbol | base, grDevices |
Most of them are in base and other packages, from the methods package to tools, grDevices, graphics, and utils.
To access them directly one needs to use help()
: help("clipboard", package = "utils")
otherwise one will need to choose them from the menu it will pop up from using ?clipboard
.
This works to access any other help page too.
There are 665 help pages that only have one alias.
For example, .BasicFunsList
, .Platform
or .Last.value
.
Some of these are datasets or very specific help pages.
Last, we can check the topics for the packages’ help page.
While usually packages have at least two alias, pkg
and one pkgname-package
, there are two packages that do not have this, grid and methods packages.
These packages do not have the alias for pkgname
to their own package help page.
?grid
could be confused with graphics::grid
and ?methods
could be confused with utils::methods
.
Cross references
Having explored the topics on the help pages, it is time to move to the cross references on R.
We can check where these help pages link to, starting by knowing how many are directly to the alias or those that use an anchor:
We can further explore to where they link if we find out the package and topic, instead of just the topic used.
We can check for pages that link to themselves. The cross-reference system doesn’t provide a way to link to a specific section (yet?). When users follow those links they are redirected to the top of the same help page, while probably they were expecting to go to another help page. At least this is something I have felt in the past. In my opinion these links could be deleted and rewritten to explain which section should the user read:
Package | Help page |
---|---|
base | double.Rd |
base | numeric.Rd |
base | parse.Rd |
base | call.Rd |
base | is.unsorted.Rd |
base | showConnections.Rd |
base | taskCallback.Rd |
grDevices | axisTicks.Rd |
grid | grid.add.Rd |
grid | grid.get.Rd |
methods | setIs.Rd |
parallel | makeCluster.Rd |
parallel | unix/children.Rd |
stats | uniroot.Rd |
stats | NLSstRtAsymptote.Rd |
stats | dist.Rd |
stats | dendrogram.Rd |
stats | stepfun.Rd |
utils | update.packages.Rd |
There are 19 links to their own help pages that could be removed.
R documentation files with references to other help pages:
We can clearly see that the packages link more between their own package but there are some heavily connected packages like utils and base or stats and base. Surprisingly, compiler package doesn’t have any reference to other packages.
There were help pages that use the file names instead of topics to link pages5. I raised this issue on the mailing list and submitted a patch later on 18815 that got them fixed before publishing this post.
We also observed that some packages do not provide links. How frequent is this and how it affects base R?
Link from | Reference to | Help pages | perc |
---|---|---|---|
TRUE | TRUE | 1093 | 74.8% |
TRUE | FALSE | 211 | 14.4% |
FALSE | FALSE | 113 | 7.7% |
FALSE | TRUE | 44 | 3.0% |
Almost all help pages have links and most of them receive references from other help pages (compre this with Table 9). Only 113 help pages do not have links or receive references, so unless they are documented somewhere else, like R internals, Writing R extensions, R Installation and Administration, it will be hard to find them.
This seems a good opportunity to patch documentation and make it easier to find and use those help pages.
For example there is currently no CRAN package, or base package that links to Rmcd()
.
This makes it harder to discover and use it.
While so far we focus on links if we compare that with the existing help pages we might find some interesting patterns:
Package | Help pages |
---|---|
datasets | 67 |
grid | 50 |
base | 46 |
stats | 40 |
utils | 32 |
tools | 28 |
methods | 20 |
graphics | 10 |
stats4 | 10 |
grDevices | 7 |
tcltk | 6 |
splines | 5 |
parallel | 3 |
As one could expect there aren’t many links on documentation help pages from datasets. Surprisingly grid is the next package without links from other base packages, followed by base itself. It would be interesting to see which help pages, as connecting these help pages to other topics might help users to find what R is capable.
Package | Help pages |
---|---|
datasets | 71 |
base | 20 |
grid | 13 |
stats | 12 |
tools | 12 |
utils | 11 |
grDevices | 6 |
methods | 2 |
parallel | 2 |
splines | 2 |
stats4 | 2 |
tcltk | 2 |
compiler | 1 |
graphics | 1 |
Similar to the table above there are many help pages that do not link to them
But having links to and from help pages is not enough for discovering them. They could form a closed graph, where some pages are unreachable from the others.
Apparently there are some help page on base R that don’t have links to navigate to the rest of the documentation. There are 4 isolated clusters of help pages. These are tools:toHTML.Rd, tools:HTMLheader.Rd, grid:absolute.size.Rd, grid:widthDetails.Rd, grid:gridCoords.Rd, grid:grobCoords.Rd. These help pages would be easier to find if at least they were linked from other help pages from base R.
To CRAN
We can also ask which packages are linked from R that are not part of R.
Package | links | Priority | Maintainer | Core member |
---|---|---|---|---|
MASS | 23 | recommended | Brian Ripley | TRUE |
Matrix | 10 | recommended | Martin Maechler | TRUE |
KernSmooth | 9 | recommended | Brian Ripley | TRUE |
lattice | 8 | recommended | Deepayan Sarkar | TRUE |
nlme | 7 | recommended | R Core Team | TRUE |
cluster | 6 | recommended | Martin Maechler | TRUE |
survival | 3 | recommended | Terry M Therneau | FALSE |
SuppDists | 3 | Thorsten Pohlert | FALSE | |
coin | 3 | Torsten Hothorn | FALSE | |
knitr | 2 | Yihui Xie | FALSE | |
mathjaxr | 2 | Wolfgang Viechtbauer | FALSE | |
vcd | 2 | David Meyer | FALSE | |
nnet | 1 | recommended | Brian Ripley | TRUE |
rpart | 1 | recommended | Beth Atkinson | FALSE |
chron | 1 | Kurt Hornik | TRUE | |
date | 1 | Kurt Hornik | TRUE | |
robustbase | 1 | Martin Maechler | TRUE | |
round | 1 | Martin Maechler | TRUE | |
Kendall | 1 | A.I. McLeod | FALSE | |
multcomp | 1 | Torsten Hothorn | FALSE | |
pcaPP | 1 | Valentin Todorov | FALSE | |
pspearman | 1 | Petr Savicky | FALSE |
We can see that some packages linked from base R are from the recommended packages and many from the R core members or maintained by the whole R core group. Some other links include popular packages, while others are not so popular.
CRAN packages
Now that we explored base R packages we can also explore CRAN’s help pages.
1## Aliases
As before we start exploring CRAN aliases. Which are the packages that have more aliases?
Package | Help page |
---|---|
aroma.affymetrix | 2158 |
Matrix | 1997 |
VGAM | 1948 |
CVXR | 1723 |
paws.management | 1580 |
paws.security.identity | 1580 |
aroma.core | 1566 |
paws.compute | 1429 |
paws.analytics | 1349 |
photobiology | 1319 |
The top packages with more help pages have more than 2000! On the other side there are more than 800 packages (835) with only one alias.
I’m surprised to see so many packages with just one alias. This might indicates packages that only try to do just one thing, while more complex packages will have more aliases and maybe also more help pages.
We can also find those targets that are duplicated on CRAN:
The most repeated topic is the magrittr pipe %>%
followed by reexports (curious as usually the pipe is also re-exported).
We can see some common methods like plot
, print
, show
and some that are related to the dplyr and shiny, mixed between some methods used for statistics.
Help pages
Similar to the alias on CRAN we can check what happens with help pages:
Help page |
---|
1573 |
1573 |
1422 |
1342 |
1283 |
1091 |
970 |
911 |
901 |
874 |
The top packages with more help pages have more than 800! On the other side there are more than 1000 packages (1086) with only one help page.
Overall doesn’t seem like there is a distribution pattern. Packages end up with as many aliases as maintainers see fit.
We can check the ratio of alias per help page and number of help pages by package:
It is clear that more help pages might lead to more topic pages but there are some packages with low number of help pages but high number of alias (those in green on the plot above). These packages tend to have 5 alias per help page or more.
However this is not normal, the median number of help pages is 11 and the median number of topics is 14. So usually CRAN packages tend to be much smaller.
Cross references
So far we analyzed CRAN help pages by the number of alias and help pages they have. While this is interesting I think it can be more revealing checking what they are linking to.
Similar to what we did with R cross-references we first start finding links to base packages and then we’ll search for links across CRAN packages.
Now that I have corrected many issues I’ll check if it matches CRAN’s output.
Comparing links with CRAN’s checks
We can compare the data with CRAN’s package details of issues found related to cross-references.
There are 1876 packages with issues in “Rd cross-references” across all flavors. These can be found online too, for example for r-devel in debian, (pick your flavor).
There are 176 packages currently with cross references notes that could be improved by using topics on the cross-references (instead of file names). There are 1159 that are currently not detected by CRAN that could be improved the same way, by instead of using file names of other CRAN packages use their topic.
Further analysis
Pages with links
Link from | References to | Help pages | perc |
---|---|---|---|
FALSE | FALSE | 220340 | 47.5% |
TRUE | TRUE | 136230 | 29.4% |
TRUE | FALSE | 73727 | 15.9% |
FALSE | TRUE | 33351 | 7.2% |
CRAN help pages by contrast with base help pages have many less references to other help pages (See 9). Almost 50% receive no reference or links to other help pages, but the next biggest group is those that have links and receive references. Surprising there are some help pages that receive reference that do not link to other help pages.
Most CRAN’s help pages are not linked , not even by their own package, or have links. Followed by those with links to and from the help page, then by those that are linked but do not have a link. Last, those that have a link but are not linked elsewhere.
Links to base
First we can check the number of links to base R:
There are many packages that link a lot base R, but most of them link them fairly few.
From the above plot I find surprising there are these few links to the utils package. Perhaps looking at which files are linked could help.
As previously reported there are many links to different help pages across packages.
Many of the most popular pages are linked from different many different packages too.
Surprisingly all links to the compiler package are to a single page: help("compile", package = "compiler")
CRAN packages
Now it is time to check for CRAN packages.
There are 1387 packages that lack destination of the R document. There are also 11684 that miss anchors, the majority of these are to their own R documentation help pages. There are 695 packages that link to a packages that is not on the dependency, while these could be packages using “LinkingTo” this could also be packages linking to archived packages. There are 1335 packages that link to a file instead of a topic. While this is not forbidden (yet)
We can look at packages are CRAN packages linking to (excluding themselves):
There are few packages thightly linked together (those that start with paws.*
) and then there are the base packages linked by many different packages.
If we focus more on the most common packages we might find some other common practices:
But who is linking to these packages?
There are many packages that link to few packages but from many help pages. There are some that link to many packages from relatively few help pages. There is broom that links to many distinct packages from many help pages.
Putting this together we can explore which packages are more connected, those that provide links, those that recieve and those that do both or neither.
Receives_links | Provides_links | Packages |
---|---|---|
FALSE | FALSE | 11056 |
FALSE | TRUE | 7466 |
TRUE | TRUE | 2578 |
TRUE | FALSE | 743 |
Most packages do not provide links to external packages or receive them. Next come those that link to other packages but are not linked back, followed by those that are linked to and are linked back. Last there are those package that receives links but do not link to other resources.
broom is a package that receives few links from packages but link to multiple packages.
CRAN’s packages tend to have more links to other packages than base R.
Conclusions
As a summary of the analysis of the links and alias. About base:
- Help page more linked to from base: options.Rd
- Help page more linked to from CRAN: character.Rd
- Help page with more links: par.Rd
- Package more linked to from CRAN: base
- Page more linked from different packages: par.Rd
- Package with more links: base
- Package with less links: compiler
About CRAN packages:
- Package with more help pages: paws.management
- Package more linked by other packages: ggplot2
- Package help with more links: spatstat-package.Rd
- Help page more repeated: reexports
- Package help with linked to: R6Class.Rd
- Package with more links to other packages: paws
- Package with more links : keras3
- Topic more repeated:
%>%
- Links I couldn’t resolve: 5358
- Packages with link related problems:
- Packages linking to Bioconductor:
Wrong assumptions I had about cross-references:
- All packages will have one link to other help pages.
- One cannot link to a file.
- Base packages wouldn’t link to CRAN packages.
I don’t think I was ever aware of the compiler package.
Reproducibility
Session info
## - Session info -------------------------------------------------------------------------------------------------------
## setting value
## version R Under development (unstable) (2024-12-20 r87452)
## os Ubuntu 24.04.1 LTS
## system x86_64, linux-gnu
## ui X11
## language en
## collate C
## ctype C
## tz Europe/Madrid
## date 2025-01-01
## pandoc 3.2 @ /usr/lib/rstudio/resources/app/bin/quarto/bin/tools/x86_64/ (via rmarkdown)
## quarto 1.4.553 @ /usr/local/bin/quarto
##
## - Packages -----------------------------------------------------------------------------------------------------------
## package * version date (UTC) lib source
## BiocManager 1.30.25 2024-08-28 [1] CRAN (R 4.5.0)
## blogdown 1.19 2024-02-01 [1] CRAN (R 4.5.0)
## bookdown 0.41 2024-10-16 [1] CRAN (R 4.5.0)
## bslib 0.8.0 2024-07-29 [1] CRAN (R 4.5.0)
## cachem 1.1.0 2024-05-16 [1] CRAN (R 4.5.0)
## cli 3.6.3 2024-06-21 [1] CRAN (R 4.5.0)
## codetools 0.2-20 2024-03-31 [2] CRAN (R 4.5.0)
## colorspace 2.1-1 2024-07-26 [1] CRAN (R 4.5.0)
## digest 0.6.37 2024-08-19 [1] CRAN (R 4.5.0)
## dplyr * 1.1.4 2023-11-17 [1] CRAN (R 4.5.0)
## evaluate 1.0.1 2024-10-10 [1] CRAN (R 4.5.0)
## fansi 1.0.6 2023-12-08 [1] CRAN (R 4.5.0)
## farver 2.1.2 2024-05-13 [1] CRAN (R 4.5.0)
## fastmap 1.2.0 2024-05-15 [1] CRAN (R 4.5.0)
## forcats * 1.0.0 2023-01-29 [1] CRAN (R 4.5.0)
## generics 0.1.3 2022-07-05 [1] CRAN (R 4.5.0)
## ggforce 0.4.2 2024-02-19 [1] CRAN (R 4.5.0)
## ggplot2 * 3.5.1 2024-04-23 [1] CRAN (R 4.5.0)
## ggraph * 2.2.1 2024-03-07 [1] CRAN (R 4.5.0)
## ggrepel * 0.9.6 2024-09-07 [1] CRAN (R 4.5.0)
## glue 1.8.0 2024-09-30 [1] CRAN (R 4.5.0)
## graphlayouts 1.2.1 2024-11-18 [1] CRAN (R 4.5.0)
## gridExtra 2.3 2017-09-09 [1] CRAN (R 4.5.0)
## gtable 0.3.6 2024-10-25 [1] CRAN (R 4.5.0)
## htmltools 0.5.8.1 2024-04-04 [1] CRAN (R 4.5.0)
## igraph 2.1.1 2024-10-19 [1] CRAN (R 4.5.0)
## jquerylib 0.1.4 2021-04-26 [1] CRAN (R 4.5.0)
## jsonlite 1.8.9 2024-09-20 [1] CRAN (R 4.5.0)
## knitr * 1.49 2024-11-08 [1] CRAN (R 4.5.0)
## labeling 0.4.3 2023-08-29 [1] CRAN (R 4.5.0)
## lifecycle 1.0.4 2023-11-07 [1] CRAN (R 4.5.0)
## magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.5.0)
## MASS 7.3-61 2024-06-13 [2] CRAN (R 4.5.0)
## memoise 2.0.1 2021-11-26 [1] CRAN (R 4.5.0)
## munsell 0.5.1 2024-04-01 [1] CRAN (R 4.5.0)
## pillar 1.9.0 2023-03-22 [1] CRAN (R 4.5.0)
## pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.5.0)
## polyclip 1.10-7 2024-07-23 [1] CRAN (R 4.5.0)
## purrr 1.0.2 2023-08-10 [1] CRAN (R 4.5.0)
## R6 2.5.1 2021-08-19 [1] CRAN (R 4.5.0)
## Rcpp 1.0.13-1 2024-11-02 [1] CRAN (R 4.5.0)
## rlang 1.1.4 2024-06-04 [1] CRAN (R 4.5.0)
## rmarkdown 2.29 2024-11-04 [1] CRAN (R 4.5.0)
## rstudioapi 0.17.1 2024-10-22 [1] CRAN (R 4.5.0)
## sass 0.4.9 2024-03-15 [1] CRAN (R 4.5.0)
## scales 1.3.0 2023-11-28 [1] CRAN (R 4.5.0)
## sessioninfo 1.2.2.9000 2024-09-15 [1] Github (r-lib/sessioninfo@442a686)
## tibble 3.2.1 2023-03-20 [1] CRAN (R 4.5.0)
## tidygraph * 1.3.1 2024-01-30 [1] CRAN (R 4.5.0)
## tidyr 1.3.1 2024-01-24 [1] CRAN (R 4.5.0)
## tidyselect 1.2.1 2024-03-11 [1] CRAN (R 4.5.0)
## tweenr 2.0.3 2024-02-26 [1] CRAN (R 4.5.0)
## utf8 1.2.4 2023-10-22 [1] CRAN (R 4.5.0)
## vctrs 0.6.5 2023-12-01 [1] CRAN (R 4.5.0)
## viridis 0.6.5 2024-01-29 [1] CRAN (R 4.5.0)
## viridisLite 0.4.2 2023-05-02 [1] CRAN (R 4.5.0)
## withr 3.0.2 2024-10-28 [1] CRAN (R 4.5.0)
## xfun 0.49 2024-10-31 [1] CRAN (R 4.5.0)
## yaml 2.3.10 2024-07-26 [1] CRAN (R 4.5.0)
##
## [1] /home/lluis/bin/R/4.5
## [2] /opt/R/devel/lib/R/library
## * -- Packages attached to the search path.
##
## ----------------------------------------------------------------------------------------------------------------------
By setting
_R_CHECK_XREFS_NOTE_MISSING_PACKAGE_ANCHORS_=true
as environmental variable.↩︎But using links to files is still fine.↩︎
There is also the
\linkS4class{abc}
that expands to\link[=abc-class]{abc}
for S4 classes.↩︎So there are three names for (almost) the same thing: topic, alias and anchor. Anchors do not need to start with
=
but should be resolvable.↩︎Yes, even though the documentation makes it clear that this shouldn’t be done the help system works around this issue.↩︎