---
title: "Rcwl Pipelines"
author: "Qiang Hu"
date: "`r Sys.Date()`"
output:
  BiocStyle::html_document:
    toc: true
    toc_float: true
vignette: >
  %\VignetteIndexEntry{Rcwl Pipelines}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

# Installation
1. Install the package from Bioconductor.

```{r, eval=FALSE}
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("RcwlPipelines")
```
The development version is also available to download from Github. 
```{r getDevel, eval=FALSE}
BiocManager::install("hubentu/RcwlPipelines")
```

2. Load the package into the R session.
```{r Load, message=FALSE}
library(RcwlPipelines)
library(dplyr)
```

# Tools and pipelines scripts
The R scripts to build the CWL tools and pipelines are collected in a
github repository now (https://github.com/hubentu/RcwlRecipes), which
is community effort to collect Bioinformatics tools and pipelines
using Rcwl and CWL (Common Workflow Language).

Three functions are used to collect the `Rcwl` scripts, search tools
recipes by keywords and load the scripts to current `R` environment.

## Indexing recipe scripts

The `cwlUpdate` function can update the recipe scripts from the github
repository and collect meta data to a local cache by the
`BiocFileCache` package. By default the local cache will be created
under your home directory for the first time. Here we use temporary
directory for example.

```{r}
tools <- cwlUpdate(cachePath = tempfile())
tools
```

## Search by keyword
The function `cwlSearch` can help to search indexed recipes by
keywords. For example, here we try to find the alignment tool `bwa
mem`.

```{r}
tl <- cwlSearch(c("bwa", "mem"), tools)
data.frame(tl)
```

## Loading tools and pipelines
The function `cwlLoad` can be used to "install" to tools or
pipelines to current environment by given the script path.

```{r}
bwa <- cwlLoad(tl$rpath)
bwa
```

Or we can install the tools by its `rname` directly.
```{r}
bwa <- cwlLoad(rname = 'tl_bwa', bfc = tools)
```

That's it! The tool "bwa" is ready to use.

# Build a pipeline
We can develop a pipline by utilizing the available tools. For
example, a simple alignment pipelines with mapping and marking
duplicates can be built from the `tools`.

First, we check whether the required tools (bwa, samtools and picard
markduplicates) are available.

```{r}
tls <- cwlSearch("bwa|sam2bam|sortBam|samtools_index|markdup", tools) %>%
    filter(Type == "tool") %>%
    select(rname, rpath, Command, Container)
tls
```

To load all the tools.
```{r}
invisible(sapply(tls$rpath, cwlLoad))
```

Next, we define the input parameters.
```{r}
p1 <- InputParam(id = "threads", type = "int")
p2 <- InputParam(id = "RG", type = "string")
p3 <- InputParam(id = "Ref", type = "string")
p4 <- InputParam(id = "FQ1", type = "File")
p5 <- InputParam(id = "FQ2", type = "File?")
```

Then we define the pipeline steps, from raw fastqs to duplicates
marked alignments.

```{r}
## bwa
s1 <- Step(id = "bwa", run = bwa,
           In = list(threads = "threads",
                     RG = "RG",
                     Ref = "Ref",
                     FQ1 = "FQ1",
                     FQ2 = "FQ2"))
## sam to bam
s2 <- Step(id = "sam2bam", run = sam2bam,
           In = list(sam = "bwa/sam"))
## sort bam
s3 <- Step(id = "sortBam", run = sortBam,
           In = list(bam = "sam2bam/bam"))
## mark duplicates
s4 <- Step(id = "markdup", run = markdup,
           In = list(ibam = "sortBam/sbam",
                     obam = list(
                         valueFrom="$(inputs.ibam.nameroot).mdup.bam"),
                     matrix = list(
                         valueFrom="$(inputs.ibam.nameroot).markdup.txt")))
## index bam
s5 <- Step(id = "idxBam", run = samtools_index,
           In = list(bam = "markdup/mBam"))
```

Last, we define the outputs and connect the steps to a new pipeline.
```{r}
req1 <- list(class = "StepInputExpressionRequirement")
req2 <- list(class = "InlineJavascriptRequirement")
## outputs
o1 <- OutputParam(id = "Bam", type = "File", outputSource = "markdup/mBam")
o2 <- OutputParam(id = "Idx", type = "File", outputSource = "idxBam/idx")
## stepParam
Align <- cwlStepParam(requirements = list(req1, req2),
                      inputs = InputParamList(p1, p2, p3, p4, p5),
                      outputs = OutputParamList(o1, o2))
## build pipeline
Align <- Align + s1 + s2 + s3 + s4 + s5
```

The pipeline is ready for use. We can plot the pipeline with
`plotCWL` from the `Rcwl` package.
```{r}
plotCWL(Align)
```

# Pipelines summary
There are mainly 4 pipelines are collected in this package. Here is a
brief introduction to these pipelines. More pipelines and tools are
expected to be included in the future.

## DNASeq alignment pipeline
The pipeline can be used to preprocess DNA sequences in fastq
format. It can take paired fastqs, read groups from multiple batches
as input.

```{r}
alignMerge <- cwlLoad(rname = "pl_alignMerge", bfc = tools)
inputs(alignMerge)
```

The pipeline includes two steps and several jobs will be run in each
step.

1. ``r names(runs(alignMerge))[[1]]``: bwa alignment by read groups.
```{r}
runs(runs(alignMerge)[[1]])
```

   * `bwa`: To align fastqs and read groups to reference genome with `bwa`.
   * `sam2bam`: To convert the alignments in "sam" format to "bam"
     format with `samtools`.
   * `sortBam`: To sort the "bam" file by coordinates with `samtools`.
   * `idxBam`: To index "bam" file with `samtools`.
  
2. ``r names(runs(alignMerge))[[2]]``: Merge by samples and markduplicates.
```{r}
runs(runs(alignMerge)[[2]])
```

   * `mergeBam`: To merge bam files from multiple batches with `picard`.
   * `markdup`: To mark duplicates with `picard`.
   * `samtools_index`: To index bam file with `samtools`.
   * `samtools_flagstat`: To summarize flags in bam with `samtools`.

The final bam files after duplicates marked, bam index, duplicates
matrix, and flag statistics summary will be in the output folder.

```{r}
outputs(alignMerge)
```

Here you can find an example to run the pipeline.

<https://hubentu.github.io/others/Rcwl/application.html#dnaseq-alignment-pipeline>

## RNASeq pipeline
The pipeline was built with reads quality summary, `STAR` alignment,
quantification by `featureCounts` and `RSeQC` quality control. Here
are the inputs.

```{r}
ranseq_Sf <- cwlLoad(rname = "pl_rnaseq_Sf", bfc = tools)
inputs(rnaseq_Sf)
```

The pipeline includes 6 steps.

* `fastqc`: To run quality summary for raw fastqs with `fastqc`.
* `STAR`: To align fastqs with `STAR`.
* `samtools_index`: To index aligned bam file.
* `samtools_flagstat`: To summary alignment flags.
* `featureCounts`: To quantify gene abundances.
* `RSeQC`: Several steps included.\
    - `gtfToGenePred`: To convert GTF annotation to "genePred" format.
    - `genePredToBed`: To convert "genePred" annotation to "bed" format.
    - `r_distribution`: To run reads distribution over genome features.
    - `gCoverage`: To summarize read coverage over gene body.

The outputs and logs from alignment, quantification and QC steps are
collected together to the output folder. A final QC report could be
generated by `multiqc`, which is also available in the data package.

An example to run the pipeline.

<https://hubentu.github.io/others/Rcwl/application.html#rnaseq-pipeline>

## GATK4 germline variant calling pipeline
The GATK4 best practice pipeline for germline variant calling was
implemented with Workflow Description Language (WDL). We wrapped the
WDL pipeline into 3 steps with `Rcwl`. The details of the pipeline can
be find here:
<https://software.broadinstitute.org/gatk/best-practices/workflow?id=11145>

1. `GAlign` GATK alignment.

The fastqs, sample information and customized json files for WDL are
required as inputs. Multiple steps will run in this step, including
`bwa` alignment, mark duplicates and base quality recalibration. GATK
ready BAM files will be collected to the output directory.

2. `hapCall` HaplotypeCaller.

The GATK ready BAM and customized json files are inputs in this
step. The local paths of GATK bundle files are required to be modified
in your json file. A "gVCF" files will be generated.

3. `jdCall` Joint variant discovery

This step will combine the "gVCF" files and then call germline
variants in all samples. The paths of the local bundle files are also
required to be changed in the json template file. The final VCF file
of germline variants will be collected.

An example to run the pipeline.\
<https://hubentu.github.io/others/Rcwl/application.html#gatk4-germline-variant-calling-pipeline>

## GATK4 Somatic short variant pipeline
The GATK4 Mutect2 pipeline for germline variant calling was also
available in WDL. The pipeline was reimplemented with `Rcwl` based on
the best practice documents.
<https://software.broadinstitute.org/gatk/best-practices/workflow?id=11146>

```{r}
GPoN <- cwlLoad(rname = "pl_GPoN", bfc = tools)
Mutect2PL <- cwlLoad(rname = "pl_Mutect2PL", bfc = tools)
```

1. Variant calling on normal samples

First, we need to run Mutect2 in tumor-only mode for each normal
sample by the tool `Mutect2`. The argument "--max-mnp-distance 0" is
required to be added because the next step, "GenpmicsDBImport", can't
handle MNPs.

```{r}
arguments(Mutect2) <- list("--max-mnp-distance", "0")
Mutect2
```

2. Panel of normals 

This step is to create a GenomicsDB and then combine to a VCF output
for the panel of normals from all the normal Mutect2 calls. A cwl
pipeline `GPoN` was built to create the panel VCF.

```{r}
runs(GPoN)
```

3. Mutect2 and variant filtering

This pipeline includes two main steps. First we call a large set of
candidate somatic variants, then filter them by estimated
contamination and orientation bias artifacts. We can plot the
`Mutect2PL` pipeline to show the details.

```{r}
plotCWL(Mutect2PL)
```

# SessionInfo
```{r}
sessionInfo()
```