--- title: "Rcwl: An R interface to the Common Workflow Language (CWL)" author: "Qiang Hu, Qian Liu" date: "`r Sys.Date()`" output: BiocStyle::html_document: toc: true toc_float: true vignette: > %\VignetteIndexEntry{Rcwl: An R interface to the Common Workflow Language} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include=FALSE} knitr::opts_knit$set(root.dir = tempdir()) ``` Here we introduce the _Bioconductor_ toolchain for usage and development of reproducible bioinformatics pipelines using packages of [Rcwl](https://bioconductor.org/packages/Rcwl/) and [RcwlPipelines](https://bioconductor.org/packages/RcwlPipelines/). `Rcwl` provides a simple way to wrap command line tools and build CWL data analysis pipelines programmatically within _R_. It increases the ease of use, development, and maintenance of CWL pipelines. `RcwlPipelines` manages a collection of more than a hundred of pre-built and tested CWL tools and pipelines, which are highly modularized with easy customization to meet different bioinformatics data analysis needs. In this vignette, we will introduce how to build and run CWL pipelines within _R/Bioconductor_ using `Rcwl`package. More details about CWL can be found at . # Installation 1. Download and install the package. ```{r getPackage, eval=FALSE} if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("Rcwl") ``` The development version with most up-to-date functionalities is also available from GitHub. ```{r getDevel, eval=FALSE} BiocManager::install("rworkflow/Rcwl") ``` 2. Load the package into _R_ session. ```{r Load, message=FALSE} library(Rcwl) ``` # First Example `cwlProcess` is the main constructor function to wrap a command line tool into an _R_ tool as a `cwlProcess` object (S4 class). Let's start with a simple example to wrap the `echo` command and execute `echo hello world` in _R_. First, we need to define the input parameter for the base command `echo`, here it is a string without a prefix. An `id` argument is required here. ```{r} input1 <- InputParam(id = "sth") ``` Second, we can construct a `cwlProcess` object by specifying the `baseCommand` for the command line tool, and `InputParamList` for the input parameters. ```{r} echo <- cwlProcess(baseCommand = "echo", inputs = InputParamList(input1)) ``` Now we have converted the command line tool `echo` into an _R_ tool: an _R_ object of class `cwlProcess` with the name of `echo`. We can take a look at the this _R_ object and use some utility functions to extract specific information. ```{r} echo class(echo) cwlClass(echo) cwlVersion(echo) baseCommand(echo) inputs(echo) outputs(echo) ``` The `inputs(echo)` will show the value once it is assigned in next step. Since we didn't define the outputs for this tool, it will stream standard output to a temporary file by default. The third step is to assign values (here is "Hello World!") for the input parameters. ```{r} echo$sth <- "Hello World!" inputs(echo) ``` Now this _R_ version of command line tool `echo` is ready to be executed. The function `runCWL` runs the tools in _R_ and returns a list of: 1) actual command line that was executed, 2) filepath to the output, and 3) running logs. The output directory by default takes the working directory, but can be specified in `outdir` argument. ```{r} r1 <- runCWL(echo, outdir = tempdir()) r1 r1$command readLines(r1$output) r1$logs ``` Users can also have the log printed out by specifying `showLog = TRUE`. ```{r} r1 <- runCWL(echo, outdir = tempdir(), showLog = TRUE) ``` A utility function `writeCWL` converts the `cwlProcess` object into 2 files: a `.cwl` file for the command and `.yml` file for the inputs, which are the internal cwl files to be executed when `runCWL` is invoked. The internal execution requires a `cwl-runner` (e.g., `cwltool`), which will be installed automatically with `runCWL`. ```{r} writeCWL(echo) ``` # Wrap command line tools The package provides functions to define a CWL syntax for Command Line Tools in an intuitive way. The functions were developed based on the CWL Command Line Tool Description (v1.0). More details can be found in the official document: . ## Input Parameters ### Essential Input parameters For the input parameters, three options need to be defined usually, *id*, *type*, and *prefix*. The type can be *string*, *int*, *long*, *float*, *double*, and so on. More detail can be found at: . Here is an example from [CWL user guide](http://www.commonwl.org/user_guide/03-input/). Here we defined an `echo` with different type of input parameters by `InputParam`. The `stdout` option can be used to capture the standard output stream to a file. ```{r} e1 <- InputParam(id = "flag", type = "boolean", prefix = "-f") e2 <- InputParam(id = "string", type = "string", prefix = "-s") e3 <- InputParam(id = "int", type = "int", prefix = "-i") e4 <- InputParam(id = "file", type = "File", prefix = "--file=", separate = FALSE) echoA <- cwlProcess(baseCommand = "echo", inputs = InputParamList(e1, e2, e3, e4), stdout = "output.txt") ``` Then we give it a try by setting values for the inputs. ```{r} echoA$flag <- TRUE echoA$string <- "Hello" echoA$int <- 1 tmpfile <- tempfile() write("World", tmpfile) echoA$file <- tmpfile r2 <- runCWL(echoA, outdir = tempdir()) r2$command ``` The command shows the parameters work as we defined. The parameter order is in alphabetical by default, but the option of "position" can be used to fix the orders. ### Array Inputs A similar example to CWL user guide. We can define three different type of array as inputs. ```{r} a1 <- InputParam(id = "A", type = "string[]", prefix = "-A") a2 <- InputParam(id = "B", type = InputArrayParam(items = "string", prefix="-B=", separate = FALSE)) a3 <- InputParam(id = "C", type = "string[]", prefix = "-C=", itemSeparator = ",", separate = FALSE) echoB <- cwlProcess(baseCommand = "echo", inputs = InputParamList(a1, a2, a3)) ``` Then set values for the three inputs. ```{r} echoB$A <- letters[1:3] echoB$B <- letters[4:6] echoB$C <- letters[7:9] echoB ``` Now we can check whether the command behaves as we expected. ```{r} r3 <- runCWL(echoB, outdir = tempdir()) r3$command ``` ## Output Parameters ### Capturing Output The outputs, similar to the inputs, is a list of output parameters. Three options *id*, *type* and *glob* can be defined. The glob option is used to define a pattern to find files relative to the output directory. Here is an example to unzip a compressed `gz` file. First, we generate a compressed R script file. ```{r} zzfil <- file.path(tempdir(), "sample.R.gz") zz <- gzfile(zzfil, "w") cat("sample(1:10, 5)", file = zz, sep = "\n") close(zz) ``` We define a `cwlProcess` object to use "gzip" to uncompress a input file. ```{r} ofile <- "sample.R" z1 <- InputParam(id = "uncomp", type = "boolean", prefix = "-d") z2 <- InputParam(id = "out", type = "boolean", prefix = "-c") z3 <- InputParam(id = "zfile", type = "File") o1 <- OutputParam(id = "rfile", type = "File", glob = ofile) gz <- cwlProcess(baseCommand = "gzip", inputs = InputParamList(z1, z2, z3), outputs = OutputParamList(o1), stdout = ofile) ``` Now the `gz` object can be used to uncompress the previous generated compressed file. ```{r} gz$uncomp <- TRUE gz$out <- TRUE gz$zfile <- zzfil r4 <- runCWL(gz, outdir = tempdir()) r4$output ``` Or we can use `arguments` to set some default parameters. ```{r} z1 <- InputParam(id = "zfile", type = "File") o1 <- OutputParam(id = "rfile", type = "File", glob = ofile) Gz <- cwlProcess(baseCommand = "gzip", arguments = list("-d", "-c"), inputs = InputParamList(z1), outputs = OutputParamList(o1), stdout = ofile) Gz Gz$zfile <- zzfil r4a <- runCWL(Gz, outdir = tempdir()) ``` To make it for general usage, we can define a pattern with javascript to glob the output, which require `node` from "nodejs" to be installed in your system PATH. ```{r} pfile <- "$(inputs.zfile.path.split('/').slice(-1)[0].split('.').slice(0,-1).join('.'))" ``` Or we can use the CWL built in file property, `nameroot`, directly. ```{r} pfile <- "$(inputs.zfile.nameroot)" o2 <- OutputParam(id = "rfile", type = "File", glob = pfile) req1 <- requireJS() GZ <- cwlProcess(baseCommand = "gzip", arguments = list("-d", "-c"), requirements = list(), ## assign list(req1) if node installed. inputs = InputParamList(z1), outputs = OutputParamList(o2), stdout = pfile) GZ$zfile <- zzfil r4b <- runCWL(GZ, outdir = tempdir()) ``` ### Array Outputs We can also capture multiple output files with `glob` pattern. ```{r} a <- InputParam(id = "a", type = InputArrayParam(items = "string")) b <- OutputParam(id = "b", type = OutputArrayParam(items = "File"), glob = "*.txt") touch <- cwlProcess(baseCommand = "touch", inputs = InputParamList(a), outputs = OutputParamList(b)) touch$a <- c("a.txt", "b.log", "c.txt") r5 <- runCWL(touch, outdir = tempdir()) r5$output ``` The "touch" command generates three files, but the output only collects two files with ".txt" suffix as defined in the `OutputParam` using the "glob" option. # Running Tools in Docker The CWL can work with docker to simplify your software management and communicate files between host and container. The docker container can be defined by the `hints` or `requirements` option. ```{r} d1 <- InputParam(id = "rfile", type = "File") req1 <- requireDocker("r-base") doc <- cwlProcess(baseCommand = "Rscript", inputs = InputParamList(d1), stdout = "output.txt", hints = list(req1)) doc$rfile <- r4$output ``` ```{r, eval=FALSE} r6 <- runCWL(doc) ``` The tools defined with docker requirements can also be run locally by disabling the docker option. In case your `Rscript` depends some local libraries to run, an option from `cwltools`, "--preserve-entire-environment", can be used to pass all environment variables. ```{r} r6a <- runCWL(doc, docker = FALSE, outdir = tempdir(), cwlArgs = "--preserve-entire-environment") ``` # Running Tools in Cluster server The CWL can also work in high performance clusters with batch-queuing system, such as SGE, PBS, SLURM and so on, using the Bioconductor package `BiocParallel`. Here is an example to submit jobs with "Multicore" and "SGE". ```{r, eval=FALSE} library(BiocParallel) sth.list <- as.list(LETTERS) names(sth.list) <- LETTERS ## submit with multicore result1 <- runCWLBatch(cwl = echo, outdir = tempdir(), inputList = list(sth = sth.list), BPPARAM = MulticoreParam(26)) ## submit with SGE result2 <- runCWLBatch(cwl = echo, outdir = tempdir(), inputList = list(sth = sth.list), BPPARAM = BatchtoolsParam(workers = 26, cluster = "sge", resources = list(queue = "all.q"))) ``` # Writing Pipeline We can connect multiple tools together into a pipeline. Here is an example to uncompress an R script and execute it with `Rscript`. Here we define a simple `Rscript` tool without using docker. ```{r} d1 <- InputParam(id = "rfile", type = "File") Rs <- cwlProcess(baseCommand = "Rscript", inputs = InputParamList(d1)) Rs ``` Test run: ```{r} Rs$rfile <- r4$output tres <- runCWL(Rs, outdir = tempdir()) readLines(tres$output) ``` The pipeline includes two steps, decompressing with predefined `cwlProcess` of `GZ` and compiling with `cwlProcess` of `Rs`. The input file is a compressed file for the first "Uncomp" step. ```{r} i1 <- InputParam(id = "cwl_zfile", type = "File") s1 <- cwlStep(id = "Uncomp", run = GZ, In = list(zfile = "cwl_zfile")) s2 <- cwlStep(id = "Compile", run = Rs, In = list(rfile = "Uncomp/rfile")) ``` In step 1 ('s1'), the pipeline runs the `cwlProcess` of `GZ`, where the input `zfile` is defined in 'i1' with id of "cwl_zfile". In step 2 ('s2'), the pipeline runs the `cwlProcess` of `Rs`, where the input `rfile` is from the output of the step 1 ("Uncomp/rfile") using the format of `/`. The pipeline output will be defined as the output of the step 2 ("Compile/output") using the format of `/` as shown below. ```{r} o1 <- OutputParam(id = "cwl_cout", type = "File", outputSource = "Compile/output") ``` The `cwlWorkflow` function is used to initiate the pipeline by specifying the `inputs` and `outputs`. Then we can simply use `+` to connect all steps to build the final pipeline. ```{r} cwl <- cwlWorkflow(inputs = InputParamList(i1), outputs = OutputParamList(o1)) cwl <- cwl + s1 + s2 cwl ``` Let's run the pipeline. ```{r} cwl$cwl_zfile <- zzfil r7 <- runCWL(cwl, outdir = tempdir()) readLines(r7$output) ``` Tips: Sometimes, we need to adjust some arguments of certain tools in a pipeline besides of parameter inputs. The function `arguments` can help to modify arguments for a tool, tool in a pipeline, or even tool in a sub-workflow. For example, ```{r} arguments(cwl, step = "Uncomp") <- list("-d", "-c", "-f") runs(cwl)$Uncomp ``` ## Scattering pipeline The scattering feature can specifies the associated workflow step or subworkflow to execute separately over a list of input elements. To use this feature, `ScatterFeatureRequirement` must be specified in the workflow requirements. Different `scatter` methods can be used in the associated step to decompose the input into a discrete set of jobs. More details can be found at: https://www.commonwl.org/v1.0/Workflow.html#WorkflowStep. Here is an example to execute multiple R scripts. First, we need to set the input and output types to be array of "File", and add the requirements. In the "Compile" step, the scattering input is required to be set with the `scatter` option. ```{r} i2 <- InputParam(id = "cwl_rfiles", type = "File[]") o2 <- OutputParam(id = "cwl_couts", type = "File[]", outputSource = "Compile/output") req1 <- requireScatter() cwl2 <- cwlWorkflow(requirements = list(req1), inputs = InputParamList(i2), outputs = OutputParamList(o2)) s1 <- cwlStep(id = "Compile", run = Rs, In = list(rfile = "cwl_rfiles"), scatter = "rfile") cwl2 <- cwl2 + s1 cwl2 ``` Multiple R scripts can be assigned to the workflow inputs and executed. ```{r} cwl2$cwl_rfiles <- c(r4b$output, r4b$output) r8 <- runCWL(cwl2, outdir = tempdir()) r8$output ``` ## Pipeline plot The function `plotCWL` can be used to visualize the relationship of inputs, outputs and the analysis for a tool or pipeline. ```{r} plotCWL(cwl) ``` # Web Application ## cwlProcess example Here we build a tool with different types of input parameters. ```{r} e1 <- InputParam(id = "flag", type = "boolean", prefix = "-f", doc = "boolean flag") e2 <- InputParam(id = "string", type = "string", prefix = "-s") e3 <- InputParam(id = "option", type = "string", prefix = "-o") e4 <- InputParam(id = "int", type = "int", prefix = "-i", default = 123) e5 <- InputParam(id = "file", type = "File", prefix = "--file=", separate = FALSE) e6 <- InputParam(id = "array", type = "string[]", prefix = "-A", doc = "separated by comma") mulEcho <- cwlProcess(baseCommand = "echo", id = "mulEcho", label = "Test parameter types", inputs = InputParamList(e1, e2, e3, e4, e5, e6), stdout = "output.txt") mulEcho ``` ## cwlProcess to Shiny App Some input parameters can be predefined in a list, which will be converted to select options in the webapp. An `upload` parameter can be used to defined whether to generate an upload interface for the file type option. If FALSE, the upload field will be text input (file path) instead of file input. ```{r, eval=FALSE} inputList <- list(option = c("option1", "option2")) app <- cwlShiny(mulEcho, inputList, upload = TRUE) runApp(app) ``` ![shinyApp](cwlShiny.png) # Working with R functions We can wrap an R function to `cwlProcess` object by simply assigning the R function to `baseCommand`. This could be useful to summarize results from other tools in a pipeline. It can also be used to benchmark different parameters for a method written in R. Please note that this feature is only implemented by `Rcwl`, but not available in the common workflow language. ```{r} fun1 <- function(x)x*2 testFun <- function(a, b){ cat(fun1(a) + b^2, sep="\n") } assign("fun1", fun1, envir = .GlobalEnv) assign("testFun", testFun, envir = .GlobalEnv) p1 <- InputParam(id = "a", type = "int", prefix = "a=", separate = F) p2 <- InputParam(id = "b", type = "int", prefix = "b=", separate = F) o1 <- OutputParam(id = "o", type = "File", glob = "rout.txt") TestFun <- cwlProcess(baseCommand = testFun, inputs = InputParamList(p1, p2), outputs = OutputParamList(o1), stdout = "rout.txt") TestFun$a <- 1 TestFun$b <- 2 r1 <- runCWL(TestFun, cwlArgs = "--preserve-entire-environment") readLines(r1$output) ``` The `runCWL` function wrote the `testFun` function and its dependencies into an R script file automatically and call `Rscript` to run the script with parameters. Each parameter requires a prefix from corresponding argument in the R function with "=" and without a separator. Here we assigned the R function and its dependencies into the global environment because it will start a new environment when the vignette is compiled. # Resources ## RcwlPipelines The `Rcwl` package can be utilized to develop pipelines for best practices of reproducible research, especially for Bioinformatics study. Multiple Bioinformatics pipelines, such as RNA-seq alignment, quality control and quantification, DNA-seq alignment and variant calling, have been developed based on the tool in an R package `RcwlPipelines`, which contains the CWL recipes and the scripts to create the pipelines. Examples to analyze real data are also included. The package is currently available in GitHub. * To install the package. ```{r, eval=FALSE} BiocManager::install("rworkflow/RcwlPipelines") ``` The project website https://rcwl.org/ serves as a central hub for all related resources. It provides guidance for new users and tutorials for both users and developers. Specific resources are listed below. ## Tutorial book The [tutorial book](https://rcwl.org/RcwlBook/) provides detailed instructions for developing `Rcwl` tools/pipelines, and also includes examples of some commonly-used tools and pipelines that covers a wide range of Bioinformatics data analysis needs. ## The _R_ recipes and cwl scripts The _R_ scripts to build the CWL tools and pipelines are now residing in a dedicated [GitHub repository](https://github.com/rworkflow/RcwlRecipes), which is intended to be a community effort to collect and contribute Bioinformatics tools and pipelines using `Rcwl` and CWL. ## Tool collections in CWL format Plenty of Bioinformatics tools and workflows can be found from GitHub in CWL format. They can be imported to `cwlProcess` object by `readCWL` function, or can be used directly. * * * ## Docker for Bioinformatics tools Most of the Bioinformatics software are available in docker containers, which can be very convenient to be adopted to build portable CWL tools and pipelines. * * # SessionInfo ```{r} sessionInfo() ```