--- title: "Rcwl Pipelines" author: "Qiang Hu" date: "`r Sys.Date()`" output: BiocStyle::html_document: toc: true toc_float: true vignette: > %\VignetteIndexEntry{Rcwl Pipelines} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` # Installation 1. Install the package from Bioconductor. ```{r, eval=FALSE} if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("RcwlPipelines") ``` The development version is also available to download from Github. ```{r getDevel, eval=FALSE} BiocManager::install("hubentu/RcwlPipelines") ``` 2. Load the package into the R session. ```{r Load, message=FALSE} library(RcwlPipelines) ``` # Tools and pipelines scripts The R scripts to build the CWL tools and pipelines based on the `Rcwl` package are stored in the "src" folder with "tl_" and "pl_" prefix respectively. The function `cwlTools` can be used to collect the available scripts. The `cachePath` can be your existing cache directory or a new folder. ```{r} tools <- cwlTools(cachePath = tempdir()) tools ``` The full paths can be pulled from the "fpath" column. The scripts can be viewed to demonstrate how the tools and pipelines were built. ```{r, message=FALSE} library(dplyr) bfcinfo(tools) %>% select(rname, fpath) ``` The commands and docker containers from the wrapped tools are included in the metadata. ```{r, results = 'asis'} tls <- bfcinfo(tools) %>% filter(Type == "tool") %>% select(rname, Command, Container) knitr::kable(tls) ``` Numbers of pipelines have been built on the wrapped tools. Here we show the built-in pipelines and their corresponding steps. ```{r, results = 'asis'} pls <- bfcinfo(tools) %>% filter(Type == "pipeline") %>% pull(rname) pls <- setdiff(pls, "mc3") Steps <- sapply(pls, function(x) { cwl <- get(x, envir = environment()) ss <- names(steps(cwl)) paste(ss, collapse = "+") }) knitr::kable(data.frame(Pipelines = pls, Steps), row.names = FALSE) ``` # Build a pipeline We can develop a pipline by utilizing the available tools. For example, a simple alignment pipelines with mapping and marking duplicates can be built from the `tools`. First, we check whether the required tools (bwa, samtools and picard markduplicates) are available. ```{r} bfcquery(tools, "bwa|sam2bam|sortBam|samtools_index|markdup") %>% filter(Type == "tool") %>% select(rname, Command, Container) ``` Next, we define the input parameters. ```{r} p1 <- InputParam(id = "threads", type = "int") p2 <- InputParam(id = "RG", type = "string") p3 <- InputParam(id = "Ref", type = "string") p4 <- InputParam(id = "FQ1", type = "File") p5 <- InputParam(id = "FQ2", type = "File?") ``` Then we define the pipeline steps, from raw fastqs to duplicates marked alignments. ```{r} ## bwa s1 <- Step(id = "bwa", run = bwa, In = list(threads = "threads", RG = "RG", Ref = "Ref", FQ1 = "FQ1", FQ2 = "FQ2")) ## sam to bam s2 <- Step(id = "sam2bam", run = sam2bam, In = list(sam = "bwa/sam")) ## sort bam s3 <- Step(id = "sortBam", run = sortBam, In = list(bam = "sam2bam/bam")) ## mark duplicates s4 <- Step(id = "markdup", run = markdup, In = list(ibam = "sortBam/sbam", obam = list( valueFrom="$(inputs.ibam.nameroot).mdup.bam"), matrix = list( valueFrom="$(inputs.ibam.nameroot).markdup.txt"))) ## index bam s5 <- Step(id = "idxBam", run = samtools_index, In = list(bam = "markdup/mBam")) ``` Last, we define the outputs and connect the steps to a new pipeline. ```{r} req1 <- list(class = "StepInputExpressionRequirement") req2 <- list(class = "InlineJavascriptRequirement") ## outputs o1 <- OutputParam(id = "Bam", type = "File", outputSource = "markdup/mBam") o2 <- OutputParam(id = "Idx", type = "File", outputSource = "idxBam/idx") ## stepParam Align <- cwlStepParam(requirements = list(req1, req2), inputs = InputParamList(p1, p2, p3, p4, p5), outputs = OutputParamList(o1, o2)) ## build pipeline Align <- Align + s1 + s2 + s3 + s4 + s5 ``` The pipeline is ready for use. We can plot the pipeline with `plotCWL` from the `Rcwl` package. ```{r} plotCWL(Align) ``` # Pipelines summary There are mainly 4 pipelines are collected in this package. Here is a brief introduction to these pipelines. More pipelines and tools are expected to be included in the future. ## DNASeq alignment pipeline The pipeline can be used to preprocess DNA sequences in fastq format. It can take paired fastqs, read groups from multiple batches as input. ```{r} inputs(alignMerge) ``` The pipeline includes two steps and several jobs will be run in each step. 1. ``r names(runs(alignMerge))[[1]]``: bwa alignment by read groups. ```{r} runs(runs(alignMerge)[[1]]) ``` * `bwa`: To align fastqs and read groups to reference genome with `bwa`. * `sam2bam`: To convert the alignments in "sam" format to "bam" format with `samtools`. * `sortBam`: To sort the "bam" file by coordinates with `samtools`. * `idxBam`: To index "bam" file with `samtools`. 2. ``r names(runs(alignMerge))[[2]]``: Merge by samples and markduplicates. ```{r} runs(runs(alignMerge)[[2]]) ``` * `mergeBam`: To merge bam files from multiple batches with `picard`. * `markdup`: To mark duplicates with `picard`. * `samtools_index`: To index bam file with `samtools`. * `samtools_flagstat`: To summarize flags in bam with `samtools`. The final bam files after duplicates marked, bam index, duplicates matrix, and flag statistics summary will be in the output folder. ```{r} outputs(alignMerge) ``` Here you can find an example to run the pipeline. ## RNASeq pipeline The pipeline was built with reads quality summary, `STAR` alignment, quantification by `featureCounts` and `RSeQC` quality control. Here are the inputs. ```{r} inputs(rnaseq_Sf) ``` The pipeline includes 6 steps. * `fastqc`: To run quality summary for raw fastqs with `fastqc`. * `STAR`: To align fastqs with `STAR`. * `samtools_index`: To index aligned bam file. * `samtools_flagstat`: To summary alignment flags. * `featureCounts`: To quantify gene abundances. * `RSeQC`: Several steps included.\ - `gtfToGenePred`: To convert GTF annotation to "genePred" format. - `genePredToBed`: To convert "genePred" annotation to "bed" format. - `r_distribution`: To run reads distribution over genome features. - `gCoverage`: To summarize read coverage over gene body. The outputs and logs from alignment, quantification and QC steps are collected together to the output folder. A final QC report could be generated by `multiqc`, which is also available in the data package. An example to run the pipeline. ## MC3 somatic variant calling pipeline The Multi-Center Mutation Calling in Multiple Cancers project (MC3) pipeline was developed by the TCGA group, which was implemented with CWL and available at . Two major steps, variant calling and merging VCFs to MAF, was imported to the `mc3` pipeline in this package. The steps of the pipeline was built on the CWL files from its github repository, which were also contained in the package. Thereforce, we need the load the pipleine by sourcing it from the script. The pipeline requires a paired of tumor/normal BAM files and genomic annotation files as inputs. ```{r} bfcinfo(tools) %>% filter(rname == "mc3") %>% pull(rpath) %>% source inputs(mc3) ``` Two steps are included. * `call_variants`: To call variants by 7 pipelines. * `covert`: To merge VCFs and convert to MAF The merged VCF and converted MAF files will be collected to the output folder. ```{r} outputs(mc3) ``` Here is an examples to run the pipeline.\ ## GATK4 germline variant calling pipeline The GATK4 best practice pipeline for germline variant calling was implemented with Workflow Description Language (WDL). We wrapped the WDL pipeline into 3 steps with `Rcwl`. The details of the pipeline can be find here: 1. `GAlign` GATK alignment. The fastqs, sample information and customized json files for WDL are required as inputs. Multiple steps will run in this step, including `bwa` alignment, mark duplicates and base quality recalibration. GATK ready BAM files will be collected to the output directory. 2. `hapCall` HaplotypeCaller. The GATK ready BAM and customized json files are inputs in this step. The local paths of GATK bundle files are required to be modified in your json file. A "gVCF" files will be generated. 3. `jdCall` Joint variant discovery This step will combine the "gVCF" files and then call germline variants in all samples. The paths of the local bundle files are also required to be changed in the json template file. The final VCF file of germline variants will be collected. An example to run the pipeline.\ ## GATK4 Somatic short variant pipeline The GATK4 Mutect2 pipeline for germline variant calling was also available in WDL. The pipeline was reimplemented with `Rcwl` based on the best practice documents. 1. Variant calling on normal samples First, we need to run Mutect2 in tumor-only mode for each normal sample by the tool `Mutect2`. The argument "--max-mnp-distance 0" is required to be added because the next step, "GenpmicsDBImport", can't handle MNPs. ```{r} arguments(Mutect2) <- list("--max-mnp-distance", "0") Mutect2 ``` 2. Panel of normals This step is to create a GenomicsDB and then combine to a VCF output for the panel of normals from all the normal Mutect2 calls. A cwl pipeline `GPoN` was built to create the panel VCF. ```{r} runs(GPoN) ``` 3. Mutect2 and variant filtering This pipeline includes two main steps. First we call a large set of candidate somatic variants, then filter them by estimated contamination and orientation bias artifacts. We can plot the `Mutect2PL` pipeline to show the details. ```{r} plotCWL(Mutect2PL) ``` # SessionInfo ```{r} sessionInfo() ```