In this case study, we’ll work through an application of reasonable complexity, turning its slowest operations into futures/promises and modifying all the downstream reactive expressions and outputs to deal with promises.
As a web service increases in popularity, so does the number of rogue scripts that abuse it for no apparent reason.
—Cheng’s Law of Why We Can’t Have Nice Things
I first noticed this in 2011, when the then-new RStudio IDE was starting to gather steam. We had a dashboard that tracked how often RStudio was being downloaded, and the numbers were generally tracking smoothly upward. But once every few months, we’d have huge spikes in the download counts, ten times greater than normal—and invariably, we’d find that all of the unexpected increase could be tracked to one or two IP addresses.
For hours or days we’d be inundated with thousands of downloads per
hour, then just as suddenly, they’d cease. I didn’t know what was
happening then, and I still don’t know today. Was it the world’s least
competent denial-of-service attempt? Did someone write a download script
with an accidental while (TRUE)
around it?
Our application will let us examine downloads from CRAN for this kind of behavior. For any given day on CRAN, we’ll see what the top downloaders are and how they’re behaving.
RStudio maintains the popular 0-Cloud
CRAN mirror, and
the log files it generates are freely available at http://cran-logs.rstudio.com/. Each day is a separate
gzipped CSV file, and each row is a single package download. For
privacy, IP addresses are anonymized by substituting each day’s IP
addresses with unique integer IDs.
Here are the first few lines of http://cran-logs.rstudio.com/2018/2018-05-26.csv.gz :
"date","time","size","r_version","r_arch","r_os","package","version","country","ip_id"
"2018-05-26","20:42:23",450377,"3.4.4","x86_64","linux-gnu","lubridate","1.7.4","NL",1
"2018-05-26","20:42:30",484348,NA,NA,NA,"homals","0.9-7","GB",2
"2018-05-26","20:42:21",98484,"3.3.1","x86_64","darwin13.4.0","miniUI","0.1.1.1","NL",1
"2018-05-26","20:42:27",518,"3.4.4","x86_64","linux-gnu","RCurl","1.95-4.10","US",3
Fortunately for our purposes, there’s no need to analyze these logs at a high level to figure out which days are affected by badly behaved download scripts. These CRAN mirrors are popular enough that, according to Cheng’s Law, there should be plenty of rogue scripts hitting it every day of the year.
The app I built to explore this data, cranwhales, let us examine the behavior of the top downloaders (“whales”) for any given day, at varying levels of detail. You can view this app live at https://gallery.shinyapps.io/cranwhales/, or download and run the code yourself at https://github.com/rstudio/cranwhales.
When the app starts, the “All traffic” tab shows you the number of package downloads per hour for all users vs. whales. In this screenshot, you can see the proportion of files downloaded by the top six downloaders on May 28, 2018. It may not look like a huge fraction at first, but keep in mind, we are only talking about six downloaders out of 52,815 total!
The “Biggest whales” tab simply shows the most prolific downloaders, with their number of downloads performed. Each anonymized IP address has been assigned an easier-to-remember name, and you can also see the country code of the original IP address.
The “Whales by hour” tab shows the hourly download counts for each
whale individually. In this screenshot, you can see that the
Netherlands’ relieved_snake
downloaded at an extremely
consistent rate during the whole day, while the American
curly_capabara
was active only during business hours in
Eastern Standard Time. Still others, like colossal_chicken
out of Hong Kong, was busy all day but at varying rates.
The “Detail View” has perhaps the most illuminating information. It
lets you view every download made by a given whale on the day in
question. The x dimension is time and the y dimension is what package
they downloaded, so you can see at a glance exactly how many packages
were downloaded, and how their various package downloads relate to each
other. In this case, relieved_snake
downloaded 104
different packages, in the same order, continuously, for the entire
day.