Note

The package DockerParallel is still under development, if you find any package which is not available on CRAN or not behaves as this vignette described, please consider to reinstall it from my GitHub repository.

Introduction

Parallel computing has became an important tool to analysis large and complex data. Using the parallel package to create local computing cluster is probably the simplest and most well-known method for the parallel computing in R's realm. As the advance of the cloud computing, there is a natural need to run R parallel cluster on the cloud to utilize the power of the cloud computing. DockerParallel is a package which is designed for the cloud computing. It aims to provide an easy-to-learn, highly scalable and low-cost tool to make the cloud computing possible.

The core component of DockerParallel, as its name implies, is the docker container. Container is a technique to package up code and all its dependencies in a standard unit and run it in an isolated environment from the host OS. By containerizing R's worker node, DockerParallel can easily deploy hundreds of identical workers in a cloud environment regardless of the host hardware and operating system. In this vignette, we will demonstrate how to use DockerParallel to run a cluster using Amazon Elastic Compute Service(ECS). The purpose of this vignette is providing the basic usage of the package for the user. For more information, please see the R markdown file developer-cookbook.

The structure of DockerParallel

For understanding the structure of DockerParallel, imagine that if someone tells you to create an R parallel cluster on the cloud using the container, what question you will ask before you can deploy the cluster on the cloud? Generally speaking, the cluster depends on the answers to these three questions:

  1. Which container should be used?
  2. Who provides the container service?
  3. What is the cluster configuration(e.g. worker number, CPU, memory)?

DockerParallel answers these questions via three components: DockerContainer, CloudProvider and CloudConfig. These components can be summarized in the following figure