---
title: "How to use import functions"
author: 
  - name: Giulia Pais
    affiliation: | 
     San Raffaele Telethon Institute for Gene Therapy - SR-Tiget, 
     Via Olgettina 60, 20132 Milano - Italia
    email: giuliapais1@gmail.com, calabria.andrea@hsr.it
output: 
  BiocStyle::html_document:
    self_contained: yes
    toc: true
    toc_float: true
    toc_depth: 2
    code_folding: show
date: "`r doc_date()`"
package: "`r pkg_ver('ISAnalytics')`"
vignette: >
  %\VignetteIndexEntry{import_functions_howto}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}  
---

```{r GenSetup, include = FALSE}
knitr::opts_chunk$set(
    collapse = TRUE,
    comment = "#>",
    crop = NULL ## Related to https://stat.ethz.ch/pipermail/bioc-devel/2020-April/016656.html
)
```


```{r vignetteSetup, echo=FALSE, message=FALSE, warning = FALSE}
## Bib setup
library("RefManageR")

## Write bibliography information
bib <- c(
    R = citation(),
    BiocStyle = citation("BiocStyle")[1],
    knitr = citation("knitr")[1],
    RefManageR = citation("RefManageR")[1],
    rmarkdown = citation("rmarkdown")[1],
    sessioninfo = citation("sessioninfo")[1],
    testthat = citation("testthat")[1],
    ISAnalytics = citation("ISAnalytics")[1],
    VISPA2 = BibEntry(bibtype = "Article", 
         title = paste("VISPA2: a scalable pipeline for", 
                       "high-throughput identification and",
                       "annotation of vector integration sites"),
         author = "Giulio Spinozzi, Andrea Calabria, Stefano Brasca", 
         journaltitle = "BMC Bioinformatics", 
         date = "2017-11-25",
         doi = "10.1186/s12859-017-1937-9")
)
```

# Introduction to `ISAnalytics` import functions family

In this vignette we're going to explain more in detail how functions of the 
import family should be used, the most common workflows to follow and more.

```{r echo=FALSE}
inst_chunk_path <- system.file("rmd", "install_and_options.Rmd", 
                               package = "ISAnalytics")
```

```{r child=inst_chunk_path}

```

```{r}
library(ISAnalytics)
```

## Designed to work with VISPA2 pipeline

The vast majority of the functions included in this package is designed to work 
in combination with VISPA2 pipeline `r Citep(bib[["VISPA2"]])`. 
If you don't know what it is, we strongly 
recommend you to take a look at these links:  

* Article: [VISPA2: Article](
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5702242/)  
* BitBucket Wiki: [VISPA2 Wiki](
https://bitbucket.org/andreacalabria/vispa2/wiki/Home)  

## File system structure generated

VISPA2 produces a standard file system structure starting from a folder you 
specify as your workbench or root. The structure always follows this schema:  

* root/
  * Optional intermediate folders
    * Projects (PROJECTID)
      * bam
      * bcmuxall
      * bed
      * iss
        * Pools (concatenatePoolIDSeqRun)
      * quality
      * quantification
        * Pools (concatenatePoolIDSeqRun)
      * report

Most of the functions implemented expect a standard file system structure
as the one described above.

# Notation {#notation}

We call an *"integration matrix"* a tabular structure characterized by:

* 3 mandatory columns of genomic features that characterize a viral insertion
site in the genome: `chr`, `integration_locus` and `strand`
* 2 (optional) annotation columns: `GeneName` and `GeneStrand`
* A variable number n of sample columns containing the quantification
of the corresponding integration site

```{r echo=FALSE}
sample_sparse_matrix <- tibble::tribble(
  ~ chr, ~ integration_locus, ~ strand, ~ GeneName, ~GeneStrand, 
  ~ exp1, ~ exp2, ~ exp3,
  "1", 12324, "+", "NFATC3", "+", 4553,5345,NA_integer_,
  "6", 657532, "+", "LOC100507487", "+", 76,545,5,
  "7", 657532, "+", "EDIL3", "-", NA_integer_,56,NA_integer_,
)
print(sample_sparse_matrix, width = Inf)
```

The package uses a more compact form of these matrices, limiting the amount
of NA values and optimizing time and memory consumption. 
For more info on this take a look at: 
[Tidy data](https://r4ds.had.co.nz/tidy-data.html)

While integration matrices contain the actual data, we also need associated
sample metadata to perform the vast majority of the analyses. 
`ISAnalytics` expects the metadata to be contained in a so called 
*"association file"*, which is a simple tabular file with a set of 
standard column headers. 

To generate a blank association file you can use the function 
`generate_blank_association_file`. You can also view the standard 
column names with `association_file_columns()`.

# Importing metadata {#metadata}

To import metadata we use `import_association_file()`. This function is not
only responsible for reading the file into the R environment as a data frame,
but it is capable to perform a file system alignment operation,
that is, for each project and pool contained in the file, it scans
the file system starting from the provided root to check if the corresponding
folders (contained in the appropriate column) can be found. Remember that
to work properly, this operation expects a standard folder structure, such
as the one provided by VISPA2. This function also produces an interactive
HTML report, to know more about this feature see `vignette(report_system)`.

```{r}
fs_path <- system.file("extdata", "fs.zip", package = "ISAnalytics")
root <- unzip_file_system(fs_path, "fs")
withr::with_options(list(ISAnalytics.reports = FALSE), code = {
  af_path <- system.file("extdata", "asso.file.tsv.gz", 
                         package = "ISAnalytics")
  af <- import_association_file(af_path, root = root)
})
```

```{r echo=FALSE}
print(head(af), width = Inf)
```

## Function arguments

You can change several arguments in the function call to modify the 
behavior of the function.

* `root`
  * Set it to `NULL` if you only want to import the association file without
  file system alignment. Beware that some of the automated import 
  functionalities won't work!
  * Set it to a non-empty string (path on disk): in this case, 
  the column `PathToFolderProjectID` in the file should contain 
  **relative** file paths, so if for example your root is set to "/home" and
  your project folder in the association file is set to "/PJ01", the function
  will check that the directory exists under "/home/PJ01"
  * Set it to an empty string: ideal if you want to store paths in the 
  association file as **absolute** file paths. In this case if your project
  folder is in "/home/PJ01" you should have this path in the
  `PathToFolderProjectID` column and set `root` = ""
* `tp_padding`: this argument is used to pad the `TimePoint` column in the
association file so that time points have all the same length
* `dates_format`: a string that is useful for properly parsing dates from 
tabular formats
* `separator`: the column separator used in the file. Defaults to "\\t",
other valid separators are "," (comma), ";" (semi-colon)
* `filter_for`: you can set this argument to a **named** list of filters, 
where names are column names. For example `list(ProjectID = "PJ01")` will
return only those rows whose attribute "ProjectID" equals "PJ01"
* `import_iss`: either `TRUE` or `FALSE`. If set to `TRUE`, performs
an internal call to `import_Vispa2_stats()` (see next section), and appends
the imported files to metadata
* `convert_tp`: either `TRUE` or `FALSE`. Converts the "TimePoint" column
in months and years (with custom logic).
* `report_path`
  * Set it to `NULL` to avoid the production of a report
  * Set it to a folder (if it doesn't exist, it gets automatically created)
  * Set it to a file
* `...`: additional named arguments to pass to `import_Vispa2_stats()` if
you chose to import VISPA2 stats

*NOTE*: the function supports files in various formats as long as the correct
separator is provided. It also accepts files in `*.xlsx` and `*.xls` formats 
but we do not recommend using these since the report won't include a 
detailed summary of potential parsing problems.

The interactive report includes useful information such as

* General issues: parsing problems, missing columns, NA values in important 
columns etc. This allows you to immediately spot problems and correct them 
before proceeding with the analyses
* File system alignment issues: very useful to know if all data can be imported
or folders are missing
* Info on VISPA2 stats (if `import_iss` was `TRUE`)

# Importing VISPA2 stats files

VISPA2 automatically produces summary files for each pool holding
information that can be useful for other analyses downstream,
so it is recommended to import them in the first steps of the workflow.
To do that, you can use `import_VISPA2_stats`:

```{r results='hide'}
withr::with_options(list(ISAnalytics.reports = FALSE), {
    vispa_stats <- import_Vispa2_stats(
        association_file = af,
        join_with_af = FALSE
    )
})
```

```{r echo=FALSE}
print(head(vispa_stats))
```

The function requires as input the imported and file system aligned
association file and it will scan the `iss` folder for files that match some
known prefixes (defaults are already provided but you can change them as you
see fit). You can either choose to join the imported data frames with the
association file in input and obtain a single data frame or keep it as it is,
just set the parameter `join_with_af` accordingly.
At the end of the process an HTML report is produced, signaling potential
problems.

You can directly call this function when you import the association file
by setting the `import_iss` argument of `import_association_file` to `TRUE`.

# Importing a single integration matrix

If you want to import a single integration matrix you can do so by using the
`import_single_Vispa2Matrix()` function.
This function reads the file and converts it into a tidy structure: several
different formats can be read, since you can specify the column separator.

```{r message=FALSE, results='hide'}
matrix_path <- fs::path(root,
                        "PJ01",
                        "quantification",
                        "POOL01-1",
                        "PJ01_POOL01-1_seqCount_matrix.no0.annotated.tsv.gz")
matrix <- import_single_Vispa2Matrix(matrix_path)
```

```{r echo=FALSE}
matrix
```

Other arguments you can pass to the function are

* `to_exclude`: a character vector that contains column names that need to
be excluded when imported. This is more targeted towards files that do have
all the columns of an integration matrix as presented in section
\@ref(notation) and other additional columns. By default this argument is set
to `NULL`
* `keep_excluded`: if set to `TRUE` all columns contained in `to_exclude` are
preserved as additional annotation columns

# Automated integration matrices import

Integration matrices import can be automated when when the association file
is imported with the file system alignment option.
`ISAnalytics` provides a function, `import_parallel_Vispa2Matrices()`,
that allows to do just that in a fast and efficient way.

```{r}
withr::with_options(list(ISAnalytics.reports = FALSE), {
    matrices <- import_parallel_Vispa2Matrices(af,
        c("seqCount", "fragmentEstimate"),
        mode = "AUTO"
    )
})
```

## Function arguments

Let's see how the behavior of the function changes when we change arguments.

### `association_file` argument

You can supply a data frame object, imported via `import_association_file()` 
(see Section \@ref(metadata)) or a string (the path to the association file
on disk). In the first scenario it is necessary to perform file system 
alignment, since the function scans the folders contained in the column 
`Path_quant`, while in the second case you should also provide as additional
**named** argument (to `...`) an appropriate `root`: the function will 
internally call `import_association_file()`, if you don't have specific
needs we recommend doing the 2 steps separately and provide the association
file as a data frame.

### `quantification_type` argument

For each pool there may be multiple available quantification types, that is,
different matrices containing the same samples
and same genomic features but a different quantification.
A typical workflow contemplates `seqCount` and `fragmentEstimate`,
all the supported quantification types can be viewed with
`quantification_types()`.

### `matrix_type` argument

As we mentioned in Section \@ref(notation), annotation columns are optional 
and may not be included in some matrices. This argument allows you to 
specify the function to look for only a specific type of matrix, either
`annotated` or `not_annotated`. Please note that in order to do that,
for now,
the function needs to assume some standard file name notation, that is, 
for `annotated` matrices, the function will look for the `.no0.annotated` 
suffix in the file name.

### `workers` argument

Sets the number of parallel workers to set up. This highly depends on the
hardware configuration of your machine.

### `multi_quant_matrix` argument

When importing more than one quantification at once, it can be very handy
to have all data in a single data frame rather than two. If set to `TRUE`
the function will internally call `comparison_matrix()` and produce a 
single data frames that has a dedicated column for each quantification.
For example, for the matrices we've imported before:

```{r echo=FALSE}
print(head(matrices), width = Inf)
```

### `report_path` argument

As other import functions, also `import_parallel_Vispa2Matrices()` produces
an interactive report, use this argument to set the appropriate path were
the report should be saved.

### `mode` argument

This argument can take one of two values, `AUTO` or `INTERACTIVE`.
The `INTERACTIVE` workflow, as the name suggests, needs user console 
input but allows a fine tuning of the import process. On the other hand,
`AUTO` allows a fully automated workflow but has of course some limitations.

**What do you want to import?**  
In a fully automated mode, the function will try to import everything that
is contained in the input association file. This means that if you need to
import only a specific set of projects/pools, you will need to filter the
association file accordingly prior calling the function (you can easily
do that via the `filter_for` argument as explained in Section \@ref(metadata)).
In interactive mode the function will ask you to type what you want to import.

**How to deal with duplicates?**  
When scanning folders for files that match a given pattern (in our case the 
function looks for matrices that match the quantification type and the 
matrix type), it is very possible that the same folder contains multiple files
for the same quantification. Of course this is not recommended, we suggest to
move the duplicated files in a sub directory or remove them if they're not 
necessary, but in case this happens, in interactive mode, the function asks
directly which files should be considered. Of course this is not possible in
automated mode, therefore you need to set two other arguments (described
in the next sub sections) to "help" the function discriminate 
between duplicates. Please note that if such discrimination is not possible
no files are imported.

### `patterns` argument

This argument is relevant only if `mode` is set to `AUTO`. Providing a 
set of patterns (interpreted as regular expressions) helps the function to 
choose between duplicated files if any are found. If you're confident your
folders don't contain any duplicates feel free to ignore this argument.

### `matching_opt` argument

This argument is relevant only if `mode` is set to `AUTO` and `patterns` 
isn't `NULL`. Tells the function how to match the given patterns if multiple
are supplied: `ALL` means keep only those files whose name matches all the
given patterns, `ANY` means keep only those files whose name matches any of the
given patterns and `OPTIONAL` expresses a preference, try to find files that
contain the patterns and if you don't find any return whatever you find.

### `...` argument

Additional named arguments to supply to both `import_association_file()` and
`comparison_matrix()`.

## Notes

Earlier versions of the package featured two separated functions, 
`import_parallel_Vispa2Matrices_auto()` and 
`import_parallel_Vispa2Matrices_interactive()`. Those functions are now
officially deprecated (since `ISAnalytics 1.3.3`) and will be defunct on
the next release cycle.

# Reproducibility

`R` session information.

```{r reproduce3, echo=FALSE}
## Session info
library("sessioninfo")
options(width = 120)
session_info()
```

# Bibliography

This vignette was generated using `r Biocpkg("BiocStyle")` `r Citep(bib[["BiocStyle"]])`
with `r CRANpkg("knitr")` `r Citep(bib[["knitr"]])` and `r CRANpkg("rmarkdown")` `r Citep(bib[["rmarkdown"]])` running behind the scenes.

Citations made with `r CRANpkg("RefManageR")` `r Citep(bib[["RefManageR"]])`.

```{r vignetteBiblio, results = "asis", echo = FALSE, warning = FALSE, message = FALSE}
## Print bibliography
PrintBibliography(bib, .opts = list(hyperlink = "to.doc", style = "html"))
```