Using Bioconductor with Amazon Elastic MapReduce

Bioconductor can be used with Amazon Elastic MapReduce to facilitate robust parallelization of computational tasks.

The following tutorial shows how it can be done.

Prerequisites

You must have an Amazon Web Services account and have signed up for the Elastic MapReduce, Amazon SimpleDB services, and Simple Storage Service or S3.

Key Pair

You need to generate a key pair using the AWS Console. Click “Create Key Pair” and give the key pair a name. Download the resulting file to a safe place on your hard drive.

On Mac and Unix systems you will need to alter the permissions of the key pair file as follows:

chmod 0400 mykeypair.pem

You only need to do this if you intend to ssh to your Elastic MapReduce instances (recommended for troubleshooting).

Streaming MapReduce Tutorial

In this tutorial we will use the qa() function from Bioconductor’s ShortRead to perform quality assessment on short read data. (Note that the qa() function also supports other types of parallelization, using the R packages multicore and Rmpi. If you would like to use qa() in this way, refer to the Bioconductor Cloud AMI Documentation).

Streaming MapReduce

A streaming MapReduce job flow is a task in which the mapper and reducer can be written in any programming language. Input data is streamed to the standard input of the mapper, which prints its output to standard output. That output is then streamed to the reducer, which again writes its output to standard output. Input and output data locations are specified as S3 “buckets”.

This paradigm is ideal for tasks in which the input and output data are textual. In our case, the data we want to examine are BAM files (though we could use any file type supported by the qa() function–see the ShortRead documentation), and the output data is a QA report (a compressed tar file containing html, jpg, and pdf files). So we cheat a little bit by specifying lists of BAM files as our input data, and then let our mapper download the actual BAM files from S3. Our input files are here and here. (You may want to use the curl utility to view these files, because a web browser will attempt to download them.) They are simply text files, each of which contains the name of a single BAM file, which can be found in an S3 bucket called “bioconductor-bamfiles”. (Note that this bucket is a public bucket maintained by the Bioconductor team; other parts of this tutorial require that you provide your own buckets, and bucket names must be unique across all of S3. We’ll make it clear when you need to provide your own bucket names).

If you wanted to run this tutorial on your own data, you could put more than one filename in each input file, but having a single name in each file ensures that each file will get its own mapper (assuming, of course, that you specify a matching number of instances when you start your job flow.)

(Note: To better browse S3 buckets, we recommend using the S3 Console, or a third party software tool such as s3cmd.)

Another issue is that Elastic MapReduce uses Amazon Machine Images (AMIs) which are very generic and contain outdated software. Luckily Elastic MapReduce provides Bootstrap Actions, a feature that allows us to install the latest version of R, the ShortRead package and its dependencies, and the s3cmd program for moving files to and from S3. Our bootstrapping script is available here.

Starting the MapReduce Job Flow

There are two ways to launch a MapReduce job flow: through the graphical web console or from a command-line utility that can be downloaded here.

First we will discuss the graphical method, and then show how to start the same job flow from the command line.

Starting a Job Flow with the Elastic MapReduce Console

Visit the Elastic MapReduce Console and click the “Create New Job Flow” button. Fill out the screen as follows.

Click “Continue”.

Fill out the screen as follows, with the changes noted below:

Note: The output location should NOT be what is shown here, but should be the name of a bucket that does not exist yet. Bucket names must be unique across all of S3, so it can be useful to include a uniquifying string–such as your birthdate–in the bucket name. For example, “mybucket-19710223”. Bucket names should be all lowercase, contain only letters, numbers and the dash (“-“) symbol. Elastic MapReduce will fail if the output bucket already exists. Use the S3 Console to delete your output bucket if it already exists, before creating your new job flow.

Click “Continue”.

Fill out the screen as follows, with the changes noted below:

Note: For “Key Pair”, use the name of the Key Pair you created in the Prerequisites step. You can see a list of key pairs on the Key Pairs page.

Note: For “Amazon S3 Log Path”, choose the name of a bucket you own. It should exist but be empty. See above for bucket naming guidelines. You can use the S3 Console to create your bucket. When filling out the form, put “s3n://” in front of the bucket name.

Click “Continue”.

Fill out the screen as follows:

Instead of “XXXXXXXXXXXXXXXXXXXX” and “YYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYY”, supply your Amazon Access Key ID and Secret Access Key. These are available from the Security Credentials page. These identifiers are necessary so that the mapper and reducer can communicate with S3.

The last two arguments in the “Optional Arguments” box should be the names of S3 buckets which exist and which you own. (We earlier told you that the output dir bucket should not exist, but it will have been created once this part of the job flow is reached). Here we are using the same bucket for both Streaming MapReduce output and the “real” output of our reducer (the report.tar.gz file), but you are not required to do the same thing.

Click “Continue”. Your job flow summary should look something like this:

If it all looks correct, click “Create Job Flow”.

Then click on “View Job Flows” to see the progress of your job flow.

If all goes well, your job will eventually complete and a file called report.tar.gz will be in your output bucket on S3. You can use the S3 Console to download this file. Then you can use the following command to unarchive the file:

tar zxf report.tar.gz

This will create some files in a “report” directory. You can open “index.html” with your web browser to see the report generated by ShortRead and Elastic MapReduce.

Starting a Job Flow with the command line

Download and install the Elastic MapReduce command-line utility. This utility is written in Ruby and requires that the Ruby Language be installed on your computer. It is installed by default on Mac OS X machines and on many Linux machines.

Note: The command line utility works best with Ruby 1.8.6 or 1.8.7. It fails to work with newer versions such as Ruby 1.9.2.

Once the command line utility is installed, you can start a job flow with a command like the following:

elastic-mapreduce --create --name "ShortRead QA" --num-instances 2 \
--slave-instance-type m1.small --master-instance-type m1.small \
--key-pair gsg-keypair --stream --input s3n://bioconductor-mapreduce-example-inputdir \
--output s3n://outdir19671025 --mapper s3n://bioconductor-mapreduce-example/mapper-emr.R \
--reducer s3n://bioconductor-mapreduce-example/reducer-emr.R \
--jobconf mapred.reduce.tasks=1 \
--bootstrap-action s3://bioconductor-emr-bootstrap-scripts/bootstrap.sh \
--bootstrap-name "Custom Boostrapping Action" --arg XXXXXXXXXXXXXXXXXXXX \
--arg YYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYY --arg bioconductor-bamfiles \
--arg tempdir19671025 --arg outdir19671025 --debug -v \
--log-uri s3n://emr-debugging-19671025 --enable-debugging

It is important to change the values of the following flags:

  • --key-pair: Change this to the name of the key pair you create in the Prerequisites step.
  • --output: Change this to the name of a bucket that does not yet exist, and put “s3n://” in front of the bucket name.
  • --arg XXXXXXXXXXXXXXXXXXXX: Substitute your Amazon Access Key ID.
  • --arg YYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYY: Substitute your Secret Access Key.
  • --arg tempdir19671025: Supply the name of a bucket you own which exists
  • --arg outdir19671025: Supply the name of a bucket you own which exists, OR the same bucket name you are using with the --output flag, which should not exist.
  • --log-uri: Change this to the name of a bucket which you own, which exists, and is empty, with “s3n://” in front of the name.

How it works

The critical part of our mapper is ShortRead’s qa() function. While this function can handle multiple files, the best parallelization is achieved when each mapper handles just a single file.

Our reducer is simply R’s rbind() function. This requires that all intermediate results (the qa objects generated by each of the mappers) be in the same place. We specify a single reducer with the “–jobconf mapred.reduce.tasks=1” flag. In the reducer, all of the intermediate qa objects are read into a list, rbind() is called on the list, and the report is generated from the results of the rbind().

Support

If you run into any issues, contact us through the Bioconductor support site.