Fast correlations


One of the few methods that are commonly done are correlations. I found several implementations of Pearson correlations, and I was curious to know if which is the fastest one.


So far the implementations I found are:


Most of them are on CRAN and one is at github.


If we are only interested on the correlation most of the dependencies that a package would bring aren’t relevant. So how many dependencies have each of those packages?

ip <- installed.packages()
dp <- tools::package_dependencies(c("stats", "WGCNA", "miRcomb", "coop", "HiClimR"), 
                            db = ip, which = "Imports")
sort(lengths(dp), decreasing = TRUE)
##   WGCNA HiClimR   stats    coop miRcomb 
##      15       5       3       0       0


x <- runif(50)
y <- runif(50)
stats_cor <- function() {
  stats::cor(x, y, method = "pearson")

WGCNA_cor <- function() {
  WGCNA::cor(x, y, method = "pearson")[1, 1]

coop_cor <- function() {
  coop::pcor(x, y)
HiClimR_cor <- function() {
  HiClimR::fastCor(matrix(c(x, y), nrow = 50, ncol = 2), upperTri = FALSE, verbose = FALSE)[2, 1]

bm <- mark(stats_cor(), WGCNA_cor(), coop_cor(), HiClimR_cor(), iterations = 10000)
## # A tibble: 4 x 6
##   expression         min   median `itr/sec` mem_alloc `gc/sec`
##   <bch:expr>    <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
## 1 stats_cor()     30.1µs   32.4µs    24351.    88.6KB     4.87
## 2 WGCNA_cor()     66.3µs   83.7µs     7315.   207.3MB     1.46
## 3 coop_cor()      31.5µs   33.2µs    19694.    82.7KB     1.97
## 4 HiClimR_cor()  115.6µs  126.9µs     5425.   171.1KB     2.17
plot(bm) + ggplot2::theme_bw()
## Loading required namespace: tidyr

So we can see that for this basic comparison stats is king followed by coop, WGCNA, and HiClimR.

It might be that these other functions are optimized for matrices so lets see it again

x <- matrix(runif(100), ncol = 10, nrow = 10)
y <- matrix(runif(100), ncol = 10, nrow = 10)
stats_cor2 <- function() {
  stats::cor(x, x, method = "pearson")

WGCNA_cor2 <- function() {
  WGCNA::cor(x, x, method = "pearson")

coop_cor2 <- function() {
HiClimR_cor2 <- function() {
  HiClimR::fastCor(x, upperTri = FALSE, verbose = FALSE)

bm2 <- mark(stats_cor2(), WGCNA_cor2(), coop_cor2(), HiClimR_cor2(), iterations = 10000)
## # A tibble: 4 x 6
##   expression          min   median `itr/sec` mem_alloc `gc/sec`
##   <bch:expr>     <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
## 1 stats_cor2()     33.4µs   35.4µs    25493.      848B     2.55
## 2 WGCNA_cor2()     64.2µs   69.4µs     9988.    3.36KB     2.00
## 3 coop_cor2()      30.9µs   33.8µs    18887.    49.8KB     1.89
## 4 HiClimR_cor2()  193.3µs  214.1µs     3466.    5.02KB     2.43
plot(bm2) + ggplot2::theme_bw()

Here we can see that coop takes the lead and stats is the second fastest followed by WGCNA and HiClimR.



Lluís Revilla Sancho

Bioinformatician with interests in software quality, mostly R.
