Most Starred R Packages on GitHub

Published
11 minute read

It seems like all the best R packages proudly use GitHub and have a README adorned with badges across the top. The recent Microsoft acquisition of GitHub got me wondering: What proportion of current R packages use GitHub? Or at least refer to it in the URL of the package description. Also, what is the relationship between the number of CRAN downloads and the number of stars on a repository? My curiosity got the best of me so I hastily wrote a script to pull the data. Click here to go straight to the full script and data included at the bottom of this post. I acknowledge there are more elegant ways to have coded this, but let’s press on.

Pulling List of Packages & their Details

CORRECTION: I orginally posted this article with code that scraped CRAN to get the list of packages and their metadata. This could violate CRAN Terms and Conditions and generally isn’t responsible. Thanks to Maëlle Salmon (@ma_salmon) I was reminded of how to scrape responsibly and that there is a function, tools::CRAN_package_db(), that returns all the metadata for the current packages on CRAN. I have written another blog post entitled Scraping Responsibly with R that details how to use the robotstxt package from rOpenSci to check a domain’s directive on bots. Without scraping CRAN here is how you get the package metadata:

pkgs <- tools::CRAN_package_db()
# remove duplicate MD5sum column since tibbles can't handle duplicate column names
pkgs <- pkgs[,unique(names(pkgs))]
pkgs %>%
  select(Package, Version, Author, BugReports, URL) %>%
  as.tibble()
#> Warning: `as.tibble()` is deprecated as of tibble 2.0.0.
#> Please use `as_tibble()` instead.
#> The signature and semantics have changed, see `?as_tibble`.
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_warnings()` to see where this warning was generated.
#> # A tibble: 16,070 x 5
#>    Package  Version Author                  BugReports          URL             
#>    <chr>    <chr>   <chr>                   <chr>               <chr>           
#>  1 A3       1.0.0   "Scott Fortmann-Roe"    <NA>                <NA>            
#>  2 aaSEA    1.1.0   "Raja Sekhara Reddy D.… <NA>                <NA>            
#>  3 AATtools 0.0.1   "Sercan Kahveci [aut, … https://github.com… <NA>            
#>  4 ABACUS   1.0.0   "Mintu Nath [aut, cre]" <NA>                https://shiny.a…
#>  5 abbyyR   0.5.5   "Gaurav Sood [aut, cre… http://github.com/… http://github.c…
#>  6 abc      2.1     "Csillery Katalin [aut… <NA>                <NA>            
#>  7 abc.data 1.0     "Csillery Katalin [aut… <NA>                <NA>            
#>  8 ABC.RAP  0.9.0   "Abdulmonem Alsaleh [c… <NA>                <NA>            
#>  9 abcADM   1.0     "Zongjun Liu [aut],\n … <NA>                <NA>            
#> 10 ABCanal… 1.2.1   "Michael Thrun, Jorn L… <NA>                https://www.uni…
#> # … with 16,060 more rows

From there I looked at the package description fields "URL" and "BugReports" to see if either contained “github.com”. It turns out that 3,718 of the packages (29.3% of the total) referenced GitHub. After retrieving the package metadata I pinged the GitHub API to see if I could get the number of stars for the repository. Currently, GitHub allows 5,000 authenticated requests per hour (link), but out of all the packages only 3,718 referenced GitHub, so I could make all the requests at once. Here is the function I used to take a cleaned up version of the package’s URL then form a request to the GitHub API to get star counts:

# get the star count from a clean version of the package's URL
gh_star_count <- function(url){
  stars  <- tryCatch({
    this_url <- gsub("https://github.com/", "https://api.github.com/repos/", url)
    req <- GET(this_url, gtoken)
    stop_for_status(req)
    cont <- content(req)
    cont$stargazers_count  
  }, error = function(e){
    return(NA_integer_)
  })
  return(stars)
}

Analyzing the Data

Once I had all the package detail data, I found that R packages, on average, have 35.7 GitHub stars, but the median number of stars is only 6! ggplot2 has the most stars with 3,174. In my analysis I removed the xgboost, h2o, and feather packages which point to the repository of their implementations in many languages, not just R.

What I really found interesting was comparing CRAN downloads to GitHub repo stars. Using the cranlogs package I was able to get the total package downloads dating back to January 1, 2014. In contrast with the low star counts, the median downloads for R packages is 8,975. Combining stars and downloads data I found that the median R package has 903 downloads per star. Only 38.7% of packages had more than 10 stars, which shows how hard stars are to get even if you’ve written a great package. I’m not sure what proportion of R users frequently reference and contribute to GitHub, but it would be interesting to compare that with the high ratios of downloads to stars.

There are some real outliers in the data. For example, the Rcpp package, perhaps the most downloaded package of all-time, has 15.8M downloads and only 377 stars. Similarly, Hadley’s scales package has 9.4M downloads and only 115 stars. These support/helper packages just don’t get the same star love as the headliners like ggplot2, shiny, and dplyr.

Of course, I could not help but check out the stats for some of the most prolific package authors. After parsing out individuals with the ["aut", "cre"] roles I came to the not so surprising conclusion that Hadley has the most stars of any author with 12,408 stars. In contrast, Dirk Eddelbuettel had one of the lowest star-to-download ratios. For every ~38K downloads Dirk’s repositories will receive one star. Pay no attention to the man behind the curtain since his Rcpp package underpins a whole host of packages without all the GitHub fanfare. Here is a list of popular R package authors and their stats:

Author Notable Packages Downloads Stars Downloads Per Star
Hadley Wickham ggplot2, dplyr, httr 113,160,314 12,408 9,119.9
Dirk Eddelbuettel Rcpp, BH 28,433,586 745 38,165.9
Yihui Xie knitr, rmarkdown, bookdown 42,472,860 6,315 6,725.7
Winston Chang R6, shiny 17,161,005 4,027 4,261.5
Jennifer Bryan readxl, gapminder, googlesheets 6,055,774 1,714 3,533.1
JJ Allaire rstudioapi, reticulate, tensorflow 8,882,553 2,798 3,174.6
Jeroen Ooms jsonlite, curl, openssl 25,907,868 1,483 17,469.9
Scott Chamberlain geojsonio, taxize 1,770,664 2,528 700.4
Jim Hester devtools, memoise, readr 22,867,071 4,332 5,278.6
Kirill Müller tibble, DBI 36,159,009 1,077 33,573.8

I’m sure you could create mixed models to determine the unique download to star relationship for individuals. Also, you could use other package attributes to predict stars or downloads, but I’ll leave that to another curious soul. I will include tables below regarding the top 10 most downloaded, most starred, most and least downloaded per star.

Appendix

Top 10 Most Starred Packages

Name Author Downloads Stars Downloads Per Star
ggplot2 Hadley Wickham 13,001,703 3,174 4,096.3
shiny Winston Chang 4,571,794 2,902 1,575.4
dplyr Hadley Wickham 8,276,844 2,408 3,437.2
devtools Jim Hester 5,536,730 1,645 3,365.8
knitr Yihui Xie 7,131,564 1,581 4,510.8
data.table Matt Dowle 6,005,795 1,457 4,122.0
plotly Carson Sievert 1,195,880 1,255 952.9
rmarkdown Yihui Xie 5,432,495 1,160 4,683.2
tensorflow JJ Allaire 94,856 1,033 91.8
bookdown Yihui Xie 126,586 1,009 125.5

Top 10 Most Downloaded Packages with Stars

Name Author Downloads Stars Downloads Per Star
Rcpp Dirk Eddelbuettel 15,824,781 377 41,975.5
ggplot2 Hadley Wickham 13,001,703 3,174 4,096.3
stringr Hadley Wickham 11,547,828 268 43,088.9
stringi Marek Gagolewski 11,310,113 122 92,705.8
digest Dirk Eddelbuettel with contributions by Antoine Lucas 11,233,244 42 267,458.2
plyr Hadley Wickham 10,340,396 470 22,000.8
R6 Winston Chang 9,993,128 212 47,137.4
reshape2 Hadley Wickham 9,582,245 173 55,388.7
scales Hadley Wickham 9,380,757 115 81,571.8
jsonlite Jeroen Ooms 9,112,790 176 51,777.2

Top 10 Packages by Stars per Download (frequently starred)

Name Author Downloads Stars Downloads Per Star
r2d3 Javier Luraschi 416 235 1.77
workflowr John Blischak 448 169 2.65
goodpractice Hannah Frick 523 192 2.72
xtensor Johan Mabille 2,057 664 3.10
scico Thomas Lin Pedersen 185 59 3.14
shinytest Winston Chang 418 113 3.70
furrr Davis Vaughan 724 171 4.23
pkgdown Hadley Wickham 1,589 332 4.79
rtika Sasha Goodman 168 32 5.25
mindr Peng Zhao 2,051 368 5.57

Bottom 10 Packages by Stars per Download (infrequently starred)

Name Author Downloads Stars Downloads Per Star
mime Yihui Xie 7,398,765 12 616,563.8
pkgmaker Renaud Gaujoux 1,228,173 2 614,086.5
rngtools Renaud Gaujoux 1,224,959 2 612,479.5
magic Robin K. S. Hankin 344,741 1 344,741.0
gsubfn G. Grothendieck 675,056 2 337,528.0
bindrcpp Kirill Müller 2,996,452 10 299,645.2
plogr Kirill Müller 3,343,099 12 278,591.6
digest Dirk Eddelbuettel with contributions by Antoine Lucas 11,233,244 42 267,458.2
munsell Charlotte Wickham 7,778,712 31 250,926.2
proto Hadley Wickham 2,593,246 11 235,749.6

Full Script

Below and available via gist with data at: https://gist.github.com/StevenMMortimer/1b4b626d3d91240a77f969ae04b37114

# load packages & custom functions ---------------------------------------------

library(tidyverse)
library(httr)
library(cranlogs)
library(ggrepel)
library(scales)
library(knitr)
library(stringr)

gh_from_url <- function(x){
  if(!grepl(',', x)){
    x <- strsplit(x, " ")[[1]]
    x <- trimws(x[min(which(grepl(pattern='http://github.com|https://github.com|http://www.github.com', x, ignore.case=TRUE)))])    
  } else {
    x <- strsplit(x, ",")[[1]]
    x <- trimws(x[min(which(grepl(pattern='http://github.com|https://github.com|http://www.github.com', x, ignore.case=TRUE)))])    
  }
  x <- gsub("http://", "https://", tolower(x))
  x <- gsub("www\\.github\\.com", "github.com", x)
  x <- gsub("/$", "", x)
  x <- gsub("^github.com", "https://github.com", x)
  x <- gsub("/issues", "", x)
  x <- gsub("\\.git", "", x)
  return(x)
}

aut_maintainer_from_details <- function(x){
  x <- gsub("'|\"", "", x)
  if(grepl(',', x)){
    x <- strsplit(x, "\\],")[[1]]
    aut_cre_ind <- grepl(pattern='\\[aut, cre|\\[cre, aut|\\[cre', x, ignore.case=TRUE)
    if(any(aut_cre_ind)){
      x <- x[min(which(aut_cre_ind))]
      x <- gsub("\\[aut, cre|\\[cre, aut|\\[cre", "", x)
    }
    x <- strsplit(x, ",")[[1]][1]
    x <- trimws(gsub("\\]", "", x))
    x <- trimws(gsub(" \\[aut", "", x))
  }
  return(x)
}

gh_star_count <- function(url){
  stars  <- tryCatch({
    this_url <- gsub("https://github.com/", "https://api.github.com/repos/", url)
    req <- GET(this_url, gtoken)
    stop_for_status(req)
    cont <- content(req)
    cont$stargazers_count  
  }, error = function(e){
    return(NA_integer_)
  })
  return(stars)
}

# authenticate to github -------------------------------------------------------
# use Hadley's key and secret
myapp <- oauth_app("github",
                   key = "56b637a5baffac62cad9",
                   secret = "8e107541ae1791259e9987d544ca568633da2ebf")
github_token <- oauth2.0_token(oauth_endpoints("github"), myapp)
gtoken <- config(token = github_token)

# pull list of packages --------------------------------------------------------

# get list of currently available packages on CRAN
pkgs <- tools::CRAN_package_db()

# remove duplicate MD5sum column since tibbles can't handle duplicate column names
pkgs <- pkgs[,unique(names(pkgs))]

# filter out lines any duplicates
pkgs <- pkgs %>% 
  rename(Name = Package) %>%
  distinct(Name, .keep_all = TRUE)

# get details for each package -------------------------------------------------

all_pkg_details <- NULL

# old fashioned looping! 
# WARNING: This takes awhile to complete
for(i in 1:nrow(pkgs)){
  
  if(i %% 100 == 0){
    message(sprintf("Processing package #%s out of %s", i, nrow(pkgs)))
  }
  
  this_url <- pkgs[i,]$URL
  on_github <- FALSE
  this_github_url <- NA_character_
  gh_stars <- NA_integer_
  if(!is.null(this_url)){
    on_github <- grepl('http://github.com|https://github.com|http://www.github.com', this_url)
    if(on_github){
      this_github_url <- gh_from_url(this_url)
      gh_stars <- gh_star_count(this_github_url)
    } else {
      # check the BugReports URL as a backup (e.g. shiny package references GitHub this way)
      issues_on_github <- grepl('http://github.com|https://github.com|http://www.github.com', pkgs[i,]$BugReports)
      if(length(issues_on_github) == 0 || !issues_on_github){
        this_github_url <- NA_character_
      } else {
        this_github_url <- gh_from_url(pkgs[i,]$BugReports)
        gh_stars <- gh_star_count(this_github_url)  
        on_github <- TRUE
      }
    }
  } else {
    this_url <- NA_character_
  }
  
  downloads <- cran_downloads(pkgs[i,]$Name, from = "2014-01-01", to = "2018-06-15")
  
  all_pkg_details <- rbind(all_pkg_details, 
                           tibble(name = pkgs[i,]$Name, 
                                  published = pkgs[i,]$Published,
                                  author = aut_maintainer_from_details(pkgs[i,]$Author),
                                  url = this_url,
                                  github_ind = on_github, 
                                  github_url = this_github_url,
                                  downloads = sum(downloads$count),
                                  stars = gh_stars
                           )
  )
}

# basic summary stats ----------------------------------------------------------

# remove observations where the GitHub URL refers to a repository that 
# is not specific to R and therefore might have an inflated star count
all_pkg_details_clean <- all_pkg_details %>% 
  filter(!(name %in% c('xgboost', 'h2o', 'feather'))) %>% 
  mutate(downloads_per_star = downloads / stars, 
         downloads_per_star = ifelse(!is.finite(downloads_per_star), NA_real_, downloads_per_star))

# proportion of all packages listing github
sum(all_pkg_details$github_ind)
mean(all_pkg_details$github_ind)
# proportion of packages with stars
mean(!is.na(all_pkg_details$stars))

# typical number of stars per package
mean(all_pkg_details_clean$stars, na.rm=TRUE)
median(all_pkg_details_clean$stars, na.rm=TRUE)
max(all_pkg_details_clean$stars, na.rm=TRUE)

# typical number of downloads per package
mean(all_pkg_details_clean$downloads, na.rm=TRUE)
median(all_pkg_details_clean$downloads, na.rm=TRUE)

# percent of packages over 10 stars
mean(all_pkg_details_clean$stars > 10, na.rm=TRUE)

mean(all_pkg_details_clean$downloads_per_star, na.rm=TRUE)
median(all_pkg_details_clean$downloads_per_star, na.rm=TRUE)


# stars histogram --------------------------------------------------------------
ggplot(data=all_pkg_details_clean, mapping=aes(stars)) +
  geom_histogram(aes(fill=..count..), bins=60) +
  scale_x_continuous(trans = "log1p", breaks=c(0,1,2,3,10,100,1000,3000)) + 
  labs(x = "Stars", 
       y = "Count", 
       fill = "Count", 
       caption = "Source: api.github.com as of 6/16/18") + 
  ggtitle("Distribution of GitHub Stars on R Packages") + 
  theme_bw() + 
  theme(panel.grid.minor = element_blank(), 
        plot.caption=element_text(hjust = 0))

# stars to downloads scatterplot -----------------------------------------------
plot_dat <- all_pkg_details_clean
idx_label <- which(with(plot_dat, downloads > 10000000 | stars > 1000))
plot_dat$name2 <- plot_dat$name
plot_dat$name <- ""
plot_dat$name[idx_label] <- plot_dat$name2[idx_label]

ggplot(data=plot_dat, aes(stars, downloads, label = name)) +
  geom_point(color = ifelse(plot_dat$name == "", "grey50", "red")) +
  geom_text_repel(box.padding = .5) + 
  scale_y_continuous(labels = comma) + 
  scale_x_continuous(labels = comma) + 
  labs(x = "GitHub Stars", 
       y = "CRAN Downloads", 
       caption = "Sources:\napi.github.com as of 6/16/18\ncranlogs as of 1/1/14 - 6/15/18") + 
  ggtitle("Relationship Between CRAN Downloads and GitHub Stars") + 
  theme_bw() + 
  theme(plot.caption=element_text(hjust = 0))

# author stats -----------------------------------------------------------------

# summary by author
authors_detail <- all_pkg_details_clean %>% 
  group_by(author) %>% 
  summarize(downloads = sum(downloads, na.rm=TRUE), 
            stars = sum(stars, na.rm=TRUE)) %>% 
  mutate(downloads_per_star = downloads / stars, 
         downloads_per_star = ifelse(!is.finite(downloads_per_star), NA_real_, downloads_per_star)) %>% 
  arrange(desc(downloads))

# popular authors
pop_authors <- tibble(author = c('Hadley Wickham', 
                                 'Dirk Eddelbuettel', 
                                 'Yihui Xie', 
                                 'Winston Chang', 
                                 'Jennifer Bryan', 
                                 'JJ Allaire',
                                 'Jeroen Ooms', 
                                 'Scott Chamberlain', 
                                 'Jim Hester', 
                                 'Kirill Müller'), 
                      notable_packages = c('ggplot2, dplyr, httr',
                                           'Rcpp, BH', 
                                           'knitr, rmarkdown, bookdown', 
                                           'R6, shiny', 
                                           'readxl, gapminder, googlesheets', 
                                           'rstudioapi, reticulate, tensorflow', 
                                           'jsonlite, curl, openssl', 
                                           'geojsonio, taxize', 
                                           'devtools, memoise, readr', 
                                           'tibble, DBI')
)
author_stats <- pop_authors %>% 
  inner_join(., authors_detail, by='author') %>%
  select(author, notable_packages, downloads, stars, downloads_per_star) %>% 
  mutate(downloads_per_star = round(downloads_per_star, 1)) %>%
  rename_all(. %>% gsub("_", " ", .) %>% str_to_title)

# single author 
#all_pkg_details_clean %>% filter(author == 'Dirk Eddelbuettel') %>% arrange(desc(downloads))  

# top 10 lists -----------------------------------------------------------------

# Top 10 Most Starred Packages

top_starred <- all_pkg_details_clean %>% 
  select(name, author, downloads, stars, downloads_per_star) %>%
  arrange(desc(stars)) %>%
  slice(1:10) %>% 
  mutate(downloads_per_star = round(downloads_per_star, 1)) %>%
  rename_all(. %>% gsub("_", " ", .) %>% str_to_title)

# Top 10 Most Downloaded Packages with stars

top_downloaded <- all_pkg_details_clean %>% 
  filter(!is.na(stars)) %>%
  select(name, author, downloads, stars, downloads_per_star) %>%
  arrange(desc(downloads)) %>%
  slice(1:10) %>% 
  mutate(downloads_per_star = round(downloads_per_star, 1)) %>%
  rename_all(. %>% gsub("_", " ", .) %>% str_to_title)

# Bottom 10 Packages by Downloads per Star (frequently starred) 

frequently_starred <- all_pkg_details_clean %>% 
  filter(downloads > 100) %>% 
  select(name, author, downloads, stars, downloads_per_star) %>%
  arrange(downloads_per_star) %>%
  slice(1:10) %>% 
  mutate(downloads_per_star = round(downloads_per_star, 2)) %>%
  rename_all(. %>% gsub("_", " ", .) %>% str_to_title)

# Top 10 Packages by Downloads per Star (infrequently starred) 

infrequently_starred <- all_pkg_details_clean %>% 
  select(name, author, downloads, stars, downloads_per_star) %>%
  arrange(desc(downloads_per_star)) %>%
  slice(1:10) %>% 
  rename_all(. %>% gsub("_", " ", .) %>% str_to_title)