Social activities on GitHub

On my last post I explored the Bioconductor submissions using {gh} to retrieve some data. After some feedback from the Bioconductor community I realized I should download other kind of data to improve my analysis on the reviews.

To make this I developed a new package to retrieve information from GitHub.

socialGH

This package based on {gh}, allows to retrieve, data from Github.

You can install it with

remotes::install_github("llrs/socialGH")

Basically pulls the data in list format and transforms it into a data.frame in order to be able to do analysis, filter it or analyze it.

library("socialGH")
library("tidyverse")
## ── Attaching packages ───────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.1     ✓ purrr   0.3.4
## ✓ tibble  3.0.1     ✓ dplyr   1.0.0
## ✓ tidyr   1.1.0     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.5.0
## ── Conflicts ──────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

It allows to selective download comments, pull requests, issues, events, labels and the timeline of an issue.

With the issues we can see the labels, how many coments and many information:

issues_blog <- get_issues("llrs/blogR")
dim(issues_blog)
## [1] 34 15
colnames(issues_blog)
##  [1] "assignees"   "assignee"    "label"       "state"       "locked"     
##  [6] "milestone"   "n_comments"  "title"       "created"     "updated"    
## [11] "association" "text"        "id"          "closer"      "poster"
# Labels used
issues_blog %>% 
  pull(label) %>% 
  unlist(FALSE, FALSE) %>% 
  table()
## .
##  b101nfo.blogspot.com          Bioconductor                  CRAN 
##                     1                     3                     2 
##           help wanted               invalid               package 
##                     1                     1                     1 
##                  Post todo 🗒 
##                    23                     3
count(issues_blog, state)
##    state  n
## 1 closed 19
## 2   open 15
count(issues_blog, n_comments)
##   n_comments  n
## 1          0 20
## 2          1  9
## 3          2  2
## 4          3  3

However, it doesn’t retrieve each comment of an issue.

# Issues with comments
issues_blog %>% 
  filter(n_comments > 0) %>% 
  pull(id)
##  [1] 34 29 28 25 23 16 10  9  8  6  5  4  3  2

comments <- get_comments("llrs/blogR")
dim(comments)
## [1] 22  6
colnames(comments)
## [1] "text"        "created"     "updated"     "association" "id"         
## [6] "commenter"
count(comments, association)
##   association  n
## 1       OWNER 22

We can see that I was the only one writing on the issues and we already retrieved the text of the comments.

We can also look for events on issues:

events <- get_events("llrs/blogR", 23)
count(events, event)
##     event n
## 1 labeled 1
## 2  closed 1

On all the functions you can provide a number of the issue and you’ll retrieve the information just for that issue. If you don’t provide an issue it will search the whole repository:

events <- get_events("llrs/blogR")
count(events, event)
##        event  n
## 1     closed 19
## 2   assigned  6
## 3 subscribed  3
## 4  mentioned  3
## 5    labeled 36
## 6  unlabeled  1
## 7    renamed  3

However it is better if we look to the timeline of an issue:, which downloads each comment of the issues.

gt <- get_timelines("llrs/blogR", 23)
gt[, c("label", "event", "created", "association", "poster")]
##   label     event             created association            poster
## 1  NULL commented 2020-02-14 00:39:47       OWNER llrs, User, FALSE
## 2  NULL commented 2020-02-14 09:44:26       OWNER llrs, User, FALSE
## 3  Post   labeled 2020-02-18 10:10:35        <NA>              NULL
## 4  NULL commented 2020-02-29 17:58:51       OWNER llrs, User, FALSE
## 5  NULL    closed 2020-02-29 17:58:51        <NA>              NULL

With timeline we don’t get the initial information of when the issue was created and we’ll need to call get_issue("llrs/blogR", 23) to know that. Here I did omit the text of the comment to make it readable, but we can see what has been happening and by who or who is affecting.

Learning

Developing this package I learned more about the {gh} package (In the previous blog I wrote manually the calls to different pages, which later on I discovered it is automatically handled by {gh}). And learned that the different accept headers have influenced on the total information returned (and that you cannot pass several accept headers at the same time).
Hope to learn more about the R community that is using Github as a way to help each other, improve packages and process.

Reproducibility

## ─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────
##  setting  value                       
##  version  R version 3.6.3 (2020-02-29)
##  os       Ubuntu 20.04 LTS            
##  system   x86_64, linux-gnu           
##  ui       X11                         
##  language (EN)                        
##  collate  en_US.UTF-8                 
##  ctype    en_US.UTF-8                 
##  tz       Europe/Madrid               
##  date     2020-07-09                  
## 
## ─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────
##  package     * version date       lib source        
##  assertthat    0.2.1   2019-03-21 [1] CRAN (R 3.6.0)
##  backports     1.1.8   2020-06-17 [1] CRAN (R 3.6.3)
##  blob          1.2.1   2020-01-20 [1] CRAN (R 3.6.1)
##  blogdown      0.19    2020-05-22 [1] CRAN (R 3.6.3)
##  bookdown      0.19    2020-05-15 [1] CRAN (R 3.6.3)
##  broom         0.5.6   2020-04-20 [1] CRAN (R 3.6.1)
##  cellranger    1.1.0   2016-07-27 [1] CRAN (R 3.6.0)
##  cli           2.0.2   2020-02-28 [1] CRAN (R 3.6.1)
##  colorspace    1.4-1   2019-03-18 [1] CRAN (R 3.6.0)
##  crayon        1.3.4   2017-09-16 [1] CRAN (R 3.6.0)
##  curl          4.3     2019-12-02 [1] CRAN (R 3.6.1)
##  DBI           1.1.0   2019-12-15 [1] CRAN (R 3.6.1)
##  dbplyr        1.4.4   2020-05-27 [1] CRAN (R 3.6.3)
##  digest        0.6.25  2020-02-23 [1] CRAN (R 3.6.1)
##  dplyr       * 1.0.0   2020-05-29 [1] CRAN (R 3.6.3)
##  ellipsis      0.3.1   2020-05-15 [1] CRAN (R 3.6.3)
##  evaluate      0.14    2019-05-28 [1] CRAN (R 3.6.0)
##  fansi         0.4.1   2020-01-08 [1] CRAN (R 3.6.1)
##  forcats     * 0.5.0   2020-03-01 [1] CRAN (R 3.6.1)
##  fs            1.4.1   2020-04-04 [1] CRAN (R 3.6.1)
##  generics      0.0.2   2018-11-29 [1] CRAN (R 3.6.0)
##  ggplot2     * 3.3.1   2020-05-28 [1] CRAN (R 3.6.3)
##  gh            1.1.0   2020-01-24 [1] CRAN (R 3.6.1)
##  glue          1.4.1   2020-05-13 [1] CRAN (R 3.6.3)
##  gtable        0.3.0   2019-03-25 [1] CRAN (R 3.6.0)
##  haven         2.3.1   2020-06-01 [1] CRAN (R 3.6.3)
##  hms           0.5.3   2020-01-08 [1] CRAN (R 3.6.1)
##  htmltools     0.5.0   2020-06-16 [1] CRAN (R 3.6.3)
##  httr          1.4.1   2019-08-05 [1] CRAN (R 3.6.0)
##  jsonlite      1.7.0   2020-06-25 [1] CRAN (R 3.6.3)
##  knitr         1.28    2020-02-06 [1] CRAN (R 3.6.1)
##  lattice       0.20-41 2020-04-02 [1] CRAN (R 3.6.1)
##  lifecycle     0.2.0   2020-03-06 [1] CRAN (R 3.6.1)
##  lubridate     1.7.9   2020-06-08 [1] CRAN (R 3.6.3)
##  magrittr      1.5     2014-11-22 [1] CRAN (R 3.6.0)
##  modelr        0.1.8   2020-05-19 [1] CRAN (R 3.6.3)
##  munsell       0.5.0   2018-06-12 [1] CRAN (R 3.6.0)
##  nlme          3.1-148 2020-05-24 [1] CRAN (R 3.6.3)
##  pillar        1.4.4   2020-05-05 [1] CRAN (R 3.6.3)
##  pkgconfig     2.0.3   2019-09-22 [1] CRAN (R 3.6.0)
##  purrr       * 0.3.4   2020-04-17 [1] CRAN (R 3.6.1)
##  R6            2.4.1   2019-11-12 [1] CRAN (R 3.6.1)
##  Rcpp          1.0.4.6 2020-04-09 [1] CRAN (R 3.6.3)
##  readr       * 1.3.1   2018-12-21 [1] CRAN (R 3.6.0)
##  readxl        1.3.1   2019-03-13 [1] CRAN (R 3.6.0)
##  reprex        0.3.0   2019-05-16 [1] CRAN (R 3.6.0)
##  rlang         0.4.6   2020-05-02 [1] CRAN (R 3.6.3)
##  rmarkdown     2.2     2020-05-31 [1] CRAN (R 3.6.3)
##  rstudioapi    0.11    2020-02-07 [1] CRAN (R 3.6.1)
##  rvest         0.3.5   2019-11-08 [1] CRAN (R 3.6.1)
##  scales        1.1.1   2020-05-11 [1] CRAN (R 3.6.3)
##  sessioninfo   1.1.1   2018-11-05 [1] CRAN (R 3.6.0)
##  socialGH    * 0.0.1   2020-07-08 [1] local         
##  stringi       1.4.6   2020-02-17 [1] CRAN (R 3.6.3)
##  stringr     * 1.4.0   2019-02-10 [1] CRAN (R 3.6.0)
##  tibble      * 3.0.1   2020-04-20 [1] CRAN (R 3.6.1)
##  tidyr       * 1.1.0   2020-05-20 [1] CRAN (R 3.6.3)
##  tidyselect    1.1.0   2020-05-11 [1] CRAN (R 3.6.3)
##  tidyverse   * 1.3.0   2019-11-21 [1] CRAN (R 3.6.1)
##  vctrs         0.3.1   2020-06-05 [1] CRAN (R 3.6.3)
##  withr         2.2.0   2020-04-20 [1] CRAN (R 3.6.1)
##  xfun          0.14    2020-05-20 [1] CRAN (R 3.6.3)
##  xml2          1.3.2   2020-04-23 [1] CRAN (R 3.6.1)
##  yaml          2.2.1   2020-02-01 [1] CRAN (R 3.6.1)
## 
## [1] /home/lluis/R/x86_64-pc-linux-gnu-library/3.6
## [2] /usr/local/lib/R/site-library
## [3] /usr/lib/R/site-library
## [4] /usr/lib/R/library
Avatar
Lluís Revilla Sancho
Bioinformatician

Bioinformatician with interests in functional enrichment, data integration and transcriptomics.

Related