---
title: "ReUseData: Workflow-based Data Recipes for Management of Reusable and Reproducible Data Resources"
author: "Qian Liu"
date: "`r Sys.Date()`"
output:
  BiocStyle::html_document:
    toc: true
    toc_float: true
vignette: >
  %\VignetteIndexEntry{ReUseDataRecipes}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

# Introduction

The growth in the volume and complexity of genomic data resources over
the past few decades poses both opportunities and challenges for data
reuse. Presently, reuse of data often involves similar preprocessing
steps in different research projects. Lack of a standardized
annotation strategy can lead to difficult-to-find and even duplicated
datasets, resulting in substantial inefficiencies and wasted computing
resources, especially for research collaborations and bioinformatics
core facilities. Tools such as `GoGetData` and `AnnotationHub` have
been developed to mitigate common problems in managing and accessing
curated genomic datasets. However, their use can be limited due to
software requirements (e.g., Conda https://conda.io), forms of data
representation or scope of data resources.  
To respond to the FAIR (findability, accessibility, interoperability,
and reusability) data principles that are being widely adopted and
organizational requirements for Data Management Plans (DMPs), here, we
introduce `ReUseData`, an _R/Bioconductor_ software tool to provide a
systematic and versatile approach for standardized and reproducible
data management. `ReUseData` facilitates transformation of shell or
other ad hoc scripts for data preprocessing into workflow-based data
recipes. Evaluation of data recipes generate curated data files in
their generic formats (e.g., VCF, bed) with full annotations for
subsequent reuse.

This package focuses on the management of genomic data resources and
uses classes and functions from existing _Bioconductor_ packages. So
we think it should be a good fit for the _Bioconductor_.

# Installation
1. Install the package from _Bioconductor_.

```{r install, eval=FALSE}
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("ReUseData")
```

Use the development version: 
```{r installDevel, eval=FALSE}
BiocManager::install("ReUseData", version = "devel")
```

2. Load the package and other packages used in this vignette into the
   R session.

```{r load}
suppressPackageStartupMessages(library(Rcwl))
library(ReUseData)
```
		
# Project resources 

## `ReUseData` recipe landing pages

The project website https://rcwl.org/dataRecipes/ contains all
prebuilt data recipes for public data downloading and curation. They
are available for direct use with convenient webpage searching. Each
data recipe has a landing page including recipe description (inputs,
outputs, etc.) and user instructions. **Make sure to check the
instructions of eligible input parameter values before recipe
evaluation.** These prebuilt data recipes demonstrate the use of
software and can be taken as templates for users to create their own
recipes for protected datasets.

There are many other _R_ resources available on this main website
https://rcwl.org/, including package vignettes for `Rcwl`
and`RcwlPipelines`, `Rcwl` tutorial e-book, case studies of using
`RcwlPipelines` in preprocessing single-cell RNA-seq data, etc. 

# `ReUseData` recipe scripts

The prebuilt data recipe scripts are included in the package, and are
physically residing in a dedicated [GitHub
repository](https://github.com/rworkflow/ReUseDataRecipes), which
demonstrates the recipe construction for different situations. The
most common case is that a data recipe can manage multiple data
resources with different input parameters (species, versions,
etc.). For example, the `gencode_transcripts` recipe download from
GENCODE, unzip and index the transcript fasta file for human or mouse
with different versions. A simple data downloading (using `wget`) for
a specific file can be written as a data recipe without any input
parameter. For example, the data recipe
`gcp_broad_gatk_hg38_1000G_omni2.5`) downloads the
`1000G_omni2.5.hg38.vcf.gz` and the `tbi` index files from Google
Cloud Platform bucket for Broad reference data GATK hg38.

If the data curation gets more complicated, say, multiple command-line
tools are to be involved, and `conda` can be used to install required
packages, or some secondary files are to be generated and collected,
the raw way of building a `ReUseData` recipe using `Rcwl` functions is
recommended, which gives more flexibility and power to accommodate
different situations. An example recipe is the `reference_genome`
which downloads, formats, and index reference genome data using tools
of `samtools`, `picard` and `bwa`, and manages multiple secondary
files besides the main fasta file for later reuse.
	
# `ReUseData` core functions 

Here we show the usage of 4 core functions `recipeMake`,
`recipeUpdate`, `recipeSearch`, `recipeLoad` for constructing,
updating, searching and loading `ReUseData` recipes in _R_.

## Recipe construction and evaluation

One can construct a data recipe from scratch or convert existing
shell scripts for data processing into data recipes, by specifying
input parameters, and output globbing patterns using `recipeMake`
function. Then the data recipe is represented in _R_ as an S4 class
`cwlProcess`. Upon assigning values to the input parameters, the
recipe is ready to be evaluated to generate data of interest. Here
are two examples:

```{r}
script <- '
input=$1
outfile=$2
echo "Print the input: $input" > $outfile.txt
'
```

Equivalently, we can load the shell script directly: 

```{r}
script <- system.file("extdata", "echo_out.sh", package = "ReUseData")
```

```{r}
rcp <- recipeMake(shscript = script,
                  paramID = c("input", "outfile"),
                  paramType = c("string", "string"),
                  outputID = "echoout",
                  outputGlob = "*.txt")
inputs(rcp)
outputs(rcp)
```

Evaluation of the data recipes are internally submitted as CWL
workflow tasks, which requires the latest version of `cwltool`. Here
we have used `basilisk` to initiate a conda environment and install
the `cwltool` in that environment if it is not available (or only
older versions are available) in the computer system.

We can install cwltool first to make sure a cwl-runner is available.
```{r}
invisible(Rcwl::install_cwltool())
```

```{r} 
rcp$input <- "Hello World!"
rcp$outfile <- "outfile"
outdir <- file.path(tempdir(), "SharedData")
res <- getData(rcp,
               outdir = outdir,
               notes = c("echo", "hello", "world", "txt"))
```

Let's take a look at the output file, which is successfully generated
in user-specified directory and grabbed through the `outputGlob`
argument. For more details of the `getData` function for recipe
evaluation, check the other vignette for [reusable data management](ReUseData_data.html).

```{r}
res$out
readLines(res$out)
```

Here we show a more complex example where the shell script has
required command line tools. When specific tools are needed for the
data processing, users just need to add their names in the
`requireTools` argument in `recipeMake` function, and then add `conda
= TRUE` when evaluating the recipe with `getData` function. Then these
tools will be automatically installed by initiating a conda
environment and the script can be successfully run in that
environment. 

This function promotes data reproducibility across different computing
platforms, and removes barrier of using sophisticated bioinformatics
tools by less experienced users.

The following code chunk is not evaluated for time-limit of package
building but can be evaluated by users. 

```{r, eval=FALSE}
shfile <- system.file("extdata", "gencode_transcripts.sh",
                      package = "ReUseData")
readLines(shfile)
rcp <- recipeMake(shscript = shfile,
                  paramID = c("species", "version"),
                  paramType = c("string", "string"),
                  outputID = "transcripts", 
                  outputGlob = "*.transcripts.fa*",
                  requireTools = c("wget", "gzip", "samtools")
                  )
rcp$species <- "human"
rcp$version <- "42"
res <- getData(rcp,
        outdir = outdir,
        notes = c("gencode", "transcripts", "human", "42"),
        conda = TRUE)
res$output
```

## Recipe caching and updating

`recipeUpdate()` creates a local cache for data recipes that are saved
in specified GitHub repository (if first time use), syncs and updates
data recipes from the GitHub repo to local caching system, so any
newly added recipes can be readily accessed and loaded directly in
_R_.  

**NOTE:** 

- The `cachePath` argument need to match between `recipeUpdate`,
`recipeLoad` and `recipeSearch` functions.
- use `force=TRUE` when any old recipes that are previously cached are
updated.
- use `remote = TRUE`to sync with remote GitHub repositories. By
  default, it syncs with `ReUseDataRecipe` GitHub
  repository](https://github.com/rworkflow/ReUseDataRecipe) for
  public, prebuilt data recipes. `repo` can also be a private GitHub
  repository.

```{r}
## First time use
recipeUpdate(cachePath = "ReUseDataRecipe",
             force = TRUE)
```

To sync the local recipe cache with remote GitHub
repository. Currently the remote data recipes on GitHub are the same
as the recipes in package (so not evaluted here to avoid duplicate
messages). We will do our best to keep current of the data recipes in
package development version with the remote GitHub repository. 

```{r, eval=FALSE}
recipeUpdate(remote = TRUE,
             repos = "rworkflow/ReUseDataRecipe")  ## can be private repo
```

`recipeUpdate` returns a `recipeHub` object with a list of all
available recipes. One can subset the list with `[` and use getter
functions `recipeNames()` to get the recipe names which can then be
passed to the `recipeSearch()` or `recipeLoad()`.

```{r}
rh <- recipeUpdate()
is(rh)
rh[1]
recipeNames(rh)
```

## Recipe searching and loading

Cached data recipes can be searched using multiple keywords to match
the recipe name. It returns a `recipeHub` object with a list of
recipes available.

```{r}
recipeSearch()
recipeSearch("gencode")
recipeSearch(c("STAR", "index"))
```

Recipes can be directly loaded into _R_ using `recipeLoad` function
with user assigned name or the original recipe name. Once the recipe
is successfully loaded, a message will be returned with recipe
instructions.

```{r}
rcp <- recipeLoad("STAR_index")
```

**NOTE** Use `return=FALSE` if you want to keep the original recipe
name, or if multiple recipes are to be loaded.


```{r}
recipeLoad("STAR_index", return = FALSE)
```

```{r}
identical(rcp, STAR_index)
```

```{r}
recipeLoad(c("ensembl_liftover", "gencode_annotation"), return=FALSE)
```

It's important to check the required `inputs()` of the recipe and the
recipe landing page for eligible input parameter values before
evaluating the recipe to generate data of interest.

```{r}
inputs(STAR_index)
inputs(ensembl_liftover)
inputs(gencode_annotation)
```

# SessionInfo
```{r}
sessionInfo()
```