--- title: "How to use import functions" author: - name: Giulia Pais affiliation: | San Raffaele Telethon Institute for Gene Therapy - SR-Tiget, Via Olgettina 60, 20132 Milano - Italia email: giuliapais1@gmail.com, calabria.andrea@hsr.it output: BiocStyle::html_document: self_contained: yes toc: true toc_float: true toc_depth: 2 code_folding: show date: "`r doc_date()`" package: "`r pkg_ver('ISAnalytics')`" vignette: > %\VignetteIndexEntry{import_functions_howto} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r GenSetup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", crop = NULL ## Related to https://stat.ethz.ch/pipermail/bioc-devel/2020-April/016656.html ) ``` ```{r vignetteSetup, echo=FALSE, message=FALSE, warning = FALSE} ## Bib setup library("RefManageR") ## Write bibliography information bib <- c( R = citation(), BiocStyle = citation("BiocStyle")[1], knitr = citation("knitr")[1], RefManageR = citation("RefManageR")[1], rmarkdown = citation("rmarkdown")[1], sessioninfo = citation("sessioninfo")[1], testthat = citation("testthat")[1], ISAnalytics = citation("ISAnalytics")[1], VISPA2 = BibEntry(bibtype = "Article", title = paste("VISPA2: a scalable pipeline for", "high-throughput identification and", "annotation of vector integration sites"), author = "Giulio Spinozzi, Andrea Calabria, Stefano Brasca", journaltitle = "BMC Bioinformatics", date = "2017-11-25", doi = "10.1186/s12859-017-1937-9") ) ``` # Introduction to `ISAnalytics` import functions family In this vignette we're going to explain more in detail how functions of the import family should be used, the most common workflows to follow and more. ```{r echo=FALSE} inst_chunk_path <- system.file("rmd", "install_and_options.Rmd", package = "ISAnalytics") ``` ```{r child=inst_chunk_path} ``` ```{r} library(ISAnalytics) ``` ## Designed to work with VISPA2 pipeline The vast majority of the functions included in this package is designed to work in combination with VISPA2 pipeline `r Citep(bib[["VISPA2"]])`. If you don't know what it is, we strongly recommend you to take a look at these links: * Article: [VISPA2: Article]( https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5702242/) * BitBucket Wiki: [VISPA2 Wiki]( https://bitbucket.org/andreacalabria/vispa2/wiki/Home) ## File system structure generated VISPA2 produces a standard file system structure starting from a folder you specify as your workbench or root. The structure always follows this schema: * root/ * Optional intermediate folders * Projects (PROJECTID) * bam * bcmuxall * bed * iss * Pools (concatenatePoolIDSeqRun) * quality * quantification * Pools (concatenatePoolIDSeqRun) * report Most of the functions implemented expect a standard file system structure as the one described above. # Notation {#notation} We call an *"integration matrix"* a tabular structure characterized by: * 3 mandatory columns of genomic features that characterize a viral insertion site in the genome: `chr`, `integration_locus` and `strand` * 2 (optional) annotation columns: `GeneName` and `GeneStrand` * A variable number n of sample columns containing the quantification of the corresponding integration site ```{r echo=FALSE} sample_sparse_matrix <- tibble::tribble( ~ chr, ~ integration_locus, ~ strand, ~ GeneName, ~GeneStrand, ~ exp1, ~ exp2, ~ exp3, "1", 12324, "+", "NFATC3", "+", 4553,5345,NA_integer_, "6", 657532, "+", "LOC100507487", "+", 76,545,5, "7", 657532, "+", "EDIL3", "-", NA_integer_,56,NA_integer_, ) print(sample_sparse_matrix, width = Inf) ``` The package uses a more compact form of these matrices, limiting the amount of NA values and optimizing time and memory consumption. For more info on this take a look at: [Tidy data](https://r4ds.had.co.nz/tidy-data.html) While integration matrices contain the actual data, we also need associated sample metadata to perform the vast majority of the analyses. `ISAnalytics` expects the metadata to be contained in a so called *"association file"*, which is a simple tabular file with a set of standard column headers. To generate a blank association file you can use the function `generate_blank_association_file`. You can also view the standard column names with `association_file_columns()`. # Importing metadata {#metadata} To import metadata we use `import_association_file()`. This function is not only responsible for reading the file into the R environment as a data frame, but it is capable to perform a file system alignment operation, that is, for each project and pool contained in the file, it scans the file system starting from the provided root to check if the corresponding folders (contained in the appropriate column) can be found. Remember that to work properly, this operation expects a standard folder structure, such as the one provided by VISPA2. This function also produces an interactive HTML report, to know more about this feature see `vignette(report_system)`. ```{r} fs_path <- system.file("extdata", "fs.zip", package = "ISAnalytics") root <- unzip_file_system(fs_path, "fs") withr::with_options(list(ISAnalytics.reports = FALSE), code = { af_path <- system.file("extdata", "asso.file.tsv.gz", package = "ISAnalytics") af <- import_association_file(af_path, root = root) }) ``` ```{r echo=FALSE} print(head(af), width = Inf) ``` ## Function arguments You can change several arguments in the function call to modify the behavior of the function. * `root` * Set it to `NULL` if you only want to import the association file without file system alignment. Beware that some of the automated import functionalities won't work! * Set it to a non-empty string (path on disk): in this case, the column `PathToFolderProjectID` in the file should contain **relative** file paths, so if for example your root is set to "/home" and your project folder in the association file is set to "/PJ01", the function will check that the directory exists under "/home/PJ01" * Set it to an empty string: ideal if you want to store paths in the association file as **absolute** file paths. In this case if your project folder is in "/home/PJ01" you should have this path in the `PathToFolderProjectID` column and set `root` = "" * `tp_padding`: this argument is used to pad the `TimePoint` column in the association file so that time points have all the same length * `dates_format`: a string that is useful for properly parsing dates from tabular formats * `separator`: the column separator used in the file. Defaults to "\\t", other valid separators are "," (comma), ";" (semi-colon) * `filter_for`: you can set this argument to a **named** list of filters, where names are column names. For example `list(ProjectID = "PJ01")` will return only those rows whose attribute "ProjectID" equals "PJ01" * `import_iss`: either `TRUE` or `FALSE`. If set to `TRUE`, performs an internal call to `import_Vispa2_stats()` (see next section), and appends the imported files to metadata * `convert_tp`: either `TRUE` or `FALSE`. Converts the "TimePoint" column in months and years (with custom logic). * `report_path` * Set it to `NULL` to avoid the production of a report * Set it to a folder (if it doesn't exist, it gets automatically created) * Set it to a file * `...`: additional named arguments to pass to `import_Vispa2_stats()` if you chose to import VISPA2 stats *NOTE*: the function supports files in various formats as long as the correct separator is provided. It also accepts files in `*.xlsx` and `*.xls` formats but we do not recommend using these since the report won't include a detailed summary of potential parsing problems. The interactive report includes useful information such as * General issues: parsing problems, missing columns, NA values in important columns etc. This allows you to immediately spot problems and correct them before proceeding with the analyses * File system alignment issues: very useful to know if all data can be imported or folders are missing * Info on VISPA2 stats (if `import_iss` was `TRUE`) # Importing VISPA2 stats files VISPA2 automatically produces summary files for each pool holding information that can be useful for other analyses downstream, so it is recommended to import them in the first steps of the workflow. To do that, you can use `import_VISPA2_stats`: ```{r results='hide'} withr::with_options(list(ISAnalytics.reports = FALSE), { vispa_stats <- import_Vispa2_stats( association_file = af, join_with_af = FALSE ) }) ``` ```{r echo=FALSE} print(head(vispa_stats)) ``` The function requires as input the imported and file system aligned association file and it will scan the `iss` folder for files that match some known prefixes (defaults are already provided but you can change them as you see fit). You can either choose to join the imported data frames with the association file in input and obtain a single data frame or keep it as it is, just set the parameter `join_with_af` accordingly. At the end of the process an HTML report is produced, signaling potential problems. You can directly call this function when you import the association file by setting the `import_iss` argument of `import_association_file` to `TRUE`. # Importing a single integration matrix If you want to import a single integration matrix you can do so by using the `import_single_Vispa2Matrix()` function. This function reads the file and converts it into a tidy structure: several different formats can be read, since you can specify the column separator. ```{r message=FALSE, results='hide'} matrix_path <- fs::path(root, "PJ01", "quantification", "POOL01-1", "PJ01_POOL01-1_seqCount_matrix.no0.annotated.tsv.gz") matrix <- import_single_Vispa2Matrix(matrix_path) ``` ```{r echo=FALSE} matrix ``` Other arguments you can pass to the function are * `to_exclude`: a character vector that contains column names that need to be excluded when imported. This is more targeted towards files that do have all the columns of an integration matrix as presented in section \@ref(notation) and other additional columns. By default this argument is set to `NULL` * `keep_excluded`: if set to `TRUE` all columns contained in `to_exclude` are preserved as additional annotation columns # Automated integration matrices import Integration matrices import can be automated when when the association file is imported with the file system alignment option. `ISAnalytics` provides a function, `import_parallel_Vispa2Matrices()`, that allows to do just that in a fast and efficient way. ```{r} withr::with_options(list(ISAnalytics.reports = FALSE), { matrices <- import_parallel_Vispa2Matrices(af, c("seqCount", "fragmentEstimate"), mode = "AUTO" ) }) ``` ## Function arguments Let's see how the behavior of the function changes when we change arguments. ### `association_file` argument You can supply a data frame object, imported via `import_association_file()` (see Section \@ref(metadata)) or a string (the path to the association file on disk). In the first scenario it is necessary to perform file system alignment, since the function scans the folders contained in the column `Path_quant`, while in the second case you should also provide as additional **named** argument (to `...`) an appropriate `root`: the function will internally call `import_association_file()`, if you don't have specific needs we recommend doing the 2 steps separately and provide the association file as a data frame. ### `quantification_type` argument For each pool there may be multiple available quantification types, that is, different matrices containing the same samples and same genomic features but a different quantification. A typical workflow contemplates `seqCount` and `fragmentEstimate`, all the supported quantification types can be viewed with `quantification_types()`. ### `matrix_type` argument As we mentioned in Section \@ref(notation), annotation columns are optional and may not be included in some matrices. This argument allows you to specify the function to look for only a specific type of matrix, either `annotated` or `not_annotated`. Please note that in order to do that, for now, the function needs to assume some standard file name notation, that is, for `annotated` matrices, the function will look for the `.no0.annotated` suffix in the file name. ### `workers` argument Sets the number of parallel workers to set up. This highly depends on the hardware configuration of your machine. ### `multi_quant_matrix` argument When importing more than one quantification at once, it can be very handy to have all data in a single data frame rather than two. If set to `TRUE` the function will internally call `comparison_matrix()` and produce a single data frames that has a dedicated column for each quantification. For example, for the matrices we've imported before: ```{r echo=FALSE} print(head(matrices), width = Inf) ``` ### `report_path` argument As other import functions, also `import_parallel_Vispa2Matrices()` produces an interactive report, use this argument to set the appropriate path were the report should be saved. ### `mode` argument This argument can take one of two values, `AUTO` or `INTERACTIVE`. The `INTERACTIVE` workflow, as the name suggests, needs user console input but allows a fine tuning of the import process. On the other hand, `AUTO` allows a fully automated workflow but has of course some limitations. **What do you want to import?** In a fully automated mode, the function will try to import everything that is contained in the input association file. This means that if you need to import only a specific set of projects/pools, you will need to filter the association file accordingly prior calling the function (you can easily do that via the `filter_for` argument as explained in Section \@ref(metadata)). In interactive mode the function will ask you to type what you want to import. **How to deal with duplicates?** When scanning folders for files that match a given pattern (in our case the function looks for matrices that match the quantification type and the matrix type), it is very possible that the same folder contains multiple files for the same quantification. Of course this is not recommended, we suggest to move the duplicated files in a sub directory or remove them if they're not necessary, but in case this happens, in interactive mode, the function asks directly which files should be considered. Of course this is not possible in automated mode, therefore you need to set two other arguments (described in the next sub sections) to "help" the function discriminate between duplicates. Please note that if such discrimination is not possible no files are imported. ### `patterns` argument This argument is relevant only if `mode` is set to `AUTO`. Providing a set of patterns (interpreted as regular expressions) helps the function to choose between duplicated files if any are found. If you're confident your folders don't contain any duplicates feel free to ignore this argument. ### `matching_opt` argument This argument is relevant only if `mode` is set to `AUTO` and `patterns` isn't `NULL`. Tells the function how to match the given patterns if multiple are supplied: `ALL` means keep only those files whose name matches all the given patterns, `ANY` means keep only those files whose name matches any of the given patterns and `OPTIONAL` expresses a preference, try to find files that contain the patterns and if you don't find any return whatever you find. ### `...` argument Additional named arguments to supply to both `import_association_file()` and `comparison_matrix()`. ## Notes Earlier versions of the package featured two separated functions, `import_parallel_Vispa2Matrices_auto()` and `import_parallel_Vispa2Matrices_interactive()`. Those functions are now officially deprecated (since `ISAnalytics 1.3.3`) and will be defunct on the next release cycle. # Reproducibility `R` session information. ```{r reproduce3, echo=FALSE} ## Session info library("sessioninfo") options(width = 120) session_info() ``` # Bibliography This vignette was generated using `r Biocpkg("BiocStyle")` `r Citep(bib[["BiocStyle"]])` with `r CRANpkg("knitr")` `r Citep(bib[["knitr"]])` and `r CRANpkg("rmarkdown")` `r Citep(bib[["rmarkdown"]])` running behind the scenes. Citations made with `r CRANpkg("RefManageR")` `r Citep(bib[["RefManageR"]])`. ```{r vignetteBiblio, results = "asis", echo = FALSE, warning = FALSE, message = FALSE} ## Print bibliography PrintBibliography(bib, .opts = list(hyperlink = "to.doc", style = "html")) ```