Case study: converting a Shiny app to async

Joe Cheng (joe@rstudio.com)

In this case study, we’ll work through an application of reasonable complexity, turning its slowest operations into futures/promises and modifying all the downstream reactive expressions and outputs to deal with promises.

Motivation

As a web service increases in popularity, so does the number of rogue scripts that abuse it for no apparent reason.

—Cheng’s Law of Why We Can’t Have Nice Things

I first noticed this in 2011, when the then-new RStudio IDE was starting to gather steam. We had a dashboard that tracked how often RStudio was being downloaded, and the numbers were generally tracking smoothly upward. But once every few months, we’d have huge spikes in the download counts, ten times greater than normal—and invariably, we’d find that all of the unexpected increase could be tracked to one or two IP addresses.

For hours or days we’d be inundated with thousands of downloads per hour, then just as suddenly, they’d cease. I didn’t know what was happening then, and I still don’t know today. Was it the world’s least competent denial-of-service attempt? Did someone write a download script with an accidental while (TRUE) around it?

Our application will let us examine downloads from CRAN for this kind of behavior. For any given day on CRAN, we’ll see what the top downloaders are and how they’re behaving.

Our source data

RStudio maintains the popular 0-Cloud CRAN mirror, and the log files it generates are freely available at http://cran-logs.rstudio.com/. Each day is a separate gzipped CSV file, and each row is a single package download. For privacy, IP addresses are anonymized by substituting each day’s IP addresses with unique integer IDs.

Here are the first few lines of http://cran-logs.rstudio.com/2018/2018-05-26.csv.gz :

"date","time","size","r_version","r_arch","r_os","package","version","country","ip_id"
"2018-05-26","20:42:23",450377,"3.4.4","x86_64","linux-gnu","lubridate","1.7.4","NL",1
"2018-05-26","20:42:30",484348,NA,NA,NA,"homals","0.9-7","GB",2
"2018-05-26","20:42:21",98484,"3.3.1","x86_64","darwin13.4.0","miniUI","0.1.1.1","NL",1
"2018-05-26","20:42:27",518,"3.4.4","x86_64","linux-gnu","RCurl","1.95-4.10","US",3

Fortunately for our purposes, there’s no need to analyze these logs at a high level to figure out which days are affected by badly behaved download scripts. These CRAN mirrors are popular enough that, according to Cheng’s Law, there should be plenty of rogue scripts hitting it every day of the year.

A tour of the app

The app I built to explore this data, cranwhales, let us examine the behavior of the top downloaders (“whales”) for any given day, at varying levels of detail. You can view this app live at https://gallery.shinyapps.io/cranwhales/, or download and run the code yourself at https://github.com/rstudio/cranwhales.

When the app starts, the “All traffic” tab shows you the number of package downloads per hour for all users vs. whales. In this screenshot, you can see the proportion of files downloaded by the top six downloaders on May 28, 2018. It may not look like a huge fraction at first, but keep in mind, we are only talking about six downloaders out of 52,815 total!

The “Biggest whales” tab simply shows the most prolific downloaders, with their number of downloads performed. Each anonymized IP address has been assigned an easier-to-remember name, and you can also see the country code of the original IP address.

The “Whales by hour” tab shows the hourly download counts for each whale individually. In this screenshot, you can see that the Netherlands’ relieved_snake downloaded at an extremely consistent rate during the whole day, while the American curly_capabara was active only during business hours in Eastern Standard Time. Still others, like colossal_chicken out of Hong Kong, was busy all day but at varying rates.

The “Detail View” has perhaps the most illuminating information. It lets you view every download made by a given whale on the day in question. The x dimension is time and the y dimension is what package they downloaded, so you can see at a glance exactly how many packages were downloaded, and how their various package downloads relate to each other. In this case, relieved_snake downloaded 104 different packages, in the same order, continuously, for the entire day.