--- title: "Vignette for FLAMES on Windows" author: "Oliver Voogd" package: FLAMES output: BiocStyle::html_document vignette: > %\VignetteIndexEntry{Vignette for FLAMES on Windows} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` # Windows FLAMES Pipeline Windows is capable of running the FLAMES pipeline however, as the genome alignment in FLAMES is handled using minimap2, these functions are not able to be run on a Windows OS. This vignette provides a workflow for running the package primarily on Windows, with the required non-Windows functions being handled by a supported system - either MacOS or a Linux distrubution. In order to run the FLAMES pipeline on Windows, ensure that the arguments do_genome_align and do_read_realign are set to FALSE. This will run the FLAMES pipeline, without the initial read alignment using Minimap2 step, and the later read realignment using Minimap2 step. ```{r eval=FALSE} bulk_long_pipeline(..., do_genome_align=FALSE, do_read_realign=FALSE) # OR sc_long_pipeline(..., do_genome_align=FALSE, do_read_realign=FALSE) ``` Alternatively, the FLAMES pipeline can be manually executed using exported functions from the FLAMES package, and minimap2 alignment and realignment can be performed externally to complete the full pipeline. This process for manual execution with external alignment is outlined below. ## Manual execution of FLAMES pipeline ### Environment setup Begin by storing the required variables `annot`, `fastq`, `outdir`, `genome_fa`, in the current workspace. Ensure `outdir` exists, and is writable. More information on the values of these variables can be obtained by executing `?FLAMES::bulk_long_pipeline` or `?FLAMES::sc_long_pipeline`. The optional `config_file` variable can also be stored, or the default configuration file, located in the `extdata/` folder in the base level of this packaged can be used. This vignette uses `BiocFileCache` to download the required variables from an example data set. Below is the process for creating the BiocFileCache and downloading the required files. The code below is included for demonstration and is not required in order to run the Flames pipeline on Windows. This pipeline runs the bulk version of the FLAMES pipeline, however code is provided to run the single cell pipeline instead. Running the single cell pipeline requires little modification to this process beyond the initial pipeline setup. The downloaded files are stored in a temporary directory. ```{r eval=TRUE, echo=TRUE} # download required files using BiocFileCache temp_path <- tempdir() bfc <- BiocFileCache::BiocFileCache(temp_path, ask=FALSE) file_url <- "https://raw.githubusercontent.com/OliverVoogd/FLAMESData/master/data" annot <- bfc[[names(BiocFileCache::bfcadd(bfc, "Annotation", paste(file_url, "SIRV_isoforms_multi-fasta-annotation_C_170612a.gtf", sep="/")))]] # [[ notation is used to get the local file path of the downloaded file genome_fa <- bfc[[names(BiocFileCache::bfcadd(bfc, "Genomefa", paste(file_url, "SIRV_isoforms_multi-fasta_170612a.fasta", sep="/")))]] # download the two fastq files, move them to a folder to be merged together fastq1 <- bfc[[names(BiocFileCache::bfcadd(bfc, "Fastq1", paste(file_url, "fastq/sample1.fastq.gz", sep="/")))]] fastq2 <- bfc[[names(BiocFileCache::bfcadd(bfc, "Fastq2", paste(file_url, "fastq/sample2.fastq.gz", sep="/")))]] fastq_dir <- paste(temp_path, "fastq_dir", sep="/") # the downloaded fastq files need to be in a directory to be merged together dir.create(fastq_dir) file.copy(c(fastq1, fastq2), fastq_dir) unlink(c(fastq1, fastq2)) # the original files can be deleted # setup other environment variables config_file <- system.file("extdata/SIRV_config_default.json", package="FLAMES") # the configuration file is included with the FLAMES package outdir <- tempdir() # create a temporary output directory if (!dir.exists(outdir)) { dir.create(outdir) } ``` ```{r eval=TRUE, echo=FALSE, results='hide'} # download files generated using minimap2 (these files need to be generated by the user on a system with access to Minimap2. More information is in the section [Genome Alignment](#genome-alignment-using-minimap2)) # files are renamed so that FLAMES can find related .bam.bai files using the same file name. BiocFileCache automatically gives the files a random prefix which affects this. genome_bam <- paste0(temp_path, "/align2genome.bam") file.rename(bfc[[names(BiocFileCache::bfcadd(bfc, "Align BAM", paste(file_url, "align2genome.bam", sep="/")))]], genome_bam) genome_index <- paste0(temp_path, "/align2genome.bam.bai") file.rename(bfc[[names(BiocFileCache::bfcadd(bfc, "Align BAM Index", paste0(file_url, "/align2genome.bam.bai")))]], genome_index) realign_bam <- paste0(temp_path, "/realign2transcript.bam") file.rename(bfc[[names(BiocFileCache::bfcadd(bfc, "Realign BAM", paste(file_url, "realign2transcript.bam", sep="/")))]], realign_bam) realign_index <- paste0(temp_path, "/realign2transcript.bam.bai") file.rename(bfc[[names(BiocFileCache::bfcadd(bfc, "Realign BAM Index", paste(file_url, "realign2transcript.bam.bai", sep="/")))]], realign_index) ``` ### FLAMES Execution #### Pipeline Setup The FLAMES windows pipeline is broken into three steps, and three functions: `bulk_windows_pipeline_setup` or `sc_windows_pipeline_setup`, to setup the environment variables for the pipeline, and those specific for the bulk or single cell pipeline, `windows_pipeline_isoforms`, which is run after read alignment, in order to identify isoforms, and finally `windows_pipeline_quantification`, run after read realignment to quantify the results and output a SummarizedExperiemnt or SingleCellExperiment object. `bulk_windows_pipeline_setup` or `sc_windows_pipeline_setup` **must** be run before the other steps are begun. ```{r eval=TRUE} library(FLAMES) # If running the FLAMES single cell pipeline: # pipeline_variables <- sc_windows_pipeline_setup(annot=annot, fastq=fastq_dir, outdir=outdir, # genome_fa=genome_fa, config_file=config_file) # or, if running the bulk pipeline: pipeline_variables <- bulk_windows_pipeline_setup(annot=annot, fastq=fastq_dir, outdir=outdir, genome_fa=genome_fa, config_file=config_file) ``` The return list `pipeline_variables` contains all the required variables to run the future pipeline steps. This initial function sets up the pipeline, and validates input. After it is run, minimap2 alignment is required using the files given in `pipeline_variables$return_files`. This should be undertaken on a system with access to minimap2. ##### Minimap2 alignment To run the minimap2 alignment part of the FLAMES pipeline, this section describes the minimap2 commands used in the FLAMES pipeline, which need to be undertaken on an external machine with minimap2 installed. First, if `pipeline_variables$config$alignment_parameters$use_junctions` is `TRUE`, the following needs to be run, to produce `pipeline_variables$tmp_bed`. ```{r, eval=TRUE} gff3_to_bed12(pipeline_variables$annot, pipeline_variables$tmp_bed) ``` After this, the files `pipeline_variables$genome_fa`, `pipeline_variables$tmp_bed` and `pipeline_variables$fastq`, need to be transfered to a system with minimap2 installed, where the main minimap2 alignment step will take place After transferred, the following should be run on this new system. ```{r, eval=FALSE} {_prog} -ax splice -t 12 {_others} -k14 --secondary=no {_index} {_fq} -o {_out} ``` For this command: + `{_prog}` denotes the path to the minimap2 executable (eg, `/~/minimap2/minimap2`). If minimap2 is in PATH, `{_prog}` is only required to be `minimap2`. + `{_index}` denotes the path to the fasta index file, `pipeline_variables$genome_fa` + `{_fq}` denotes the input fastq file (for bulk, the merged fastq file), `pipeline_variables$fastq`. + `{_out}` denotes the temporary output sam file, and an approriate name should be given. + `{_others}` denotes the the command to run minimap2 with some optional arguments. + If `pipeline_variables$config$alignment_parameters$use_junctions$` is `TRUE`, `{_others}` should include `"--junc-bed {_bed} --junc-bonus 1"` where `{_bed}` is `pipeline_variables$tmp_bed`. + If `pipeline_variables$config$alignment_parameters$no_flank` is `TRUE`, `{_others}` should include `"--splice-flank=no"`. + If both extra command are present, they should be space separated. After this has been completed, the resulting output file given as the `{_out}` argument should be returned to the original system, and stored with the same address as `pipeline_variables$tmp_sam`. After this, the following code can be run on the original system, to complete the minimap2 alignment process. ```{r, eval=FALSE} samtools_as_bam(pipeline_variables$tmp_sam, pipeline_variables$tmp_bam) samtools_sort_index(pipeline_variables$tmp_bam, pipeline_variables$genome_bam) file.remove(pipeline_variables$tmp_sam) file.remove(pipeline_variables$tmp_bam) ``` #### Find Isoforms After read alignment, the pipeline can resume by calling `windows_pipeline_isoforms`, and passing in the `pipeline_variables` list returned from the previous step. ```{r eval=TRUE, echo=FALSE, results='hide'} pipeline_variables$genome_bam = genome_bam ``` ```{r eval=FALSE} pipeline_variables <- windows_pipeline_isoforms(pipeline_variables) ``` ```{r eval=TRUE, echo=FALSE, results='hide'} assembly <- paste0(temp_path, "/transcript_assembly.fa") file.rename(bfc[[names(BiocFileCache::bfcadd(bfc, "Transcript Assembly Fasta", paste(file_url, "transcript_assembly.fa", sep="/")))]], assembly) assembly_index <- paste0(temp_path, "/transcript_assembly.fa.fai") file.rename(bfc[[names(BiocFileCache::bfcadd(bfc, "Transcript Assembly Index", paste0(file_url, "/transcript_assembly.fa.fai")))]], assembly_index) isoforms_annot <- paste0(temp_path, "/isoform_annotated.gff3") file.rename(bfc[[names(BiocFileCache::bfcadd(bfc, "Annotated Isoform gff3", paste(file_url, "isoform_annotated.gff3", sep="/")))]], isoforms_annot) tss_tes <- paste0(temp_path, "/tss_tes.bedgraph") file.rename(bfc[[names(BiocFileCache::bfcadd(bfc, "TSS TES enrichment", paste0(file_url, "/tss_tes.bedgraph")))]], tss_tes) pipeline_variables$isoform_gff3 <- isoforms_annot pipeline_variables$transcript_fa <- assembly pipeline_variables$transcript_fa_idx <- assembly_index pipeline_variables$tss_tes_stat <- tss_tes g <- parse_gff_tree(annot) pipeline_variables$transcript_dict <- g$transcript_dict i <- parse_gff_tree(isoforms_annot) pipeline_variables$transcript_dict_i <- i$transcript_dict ``` Following this step, read realignment needs to be undertaken as before, on a system with access to minimap2. ##### Minimap2 read realignment Like the original minimap2 read alignment step, this process requires the transfer of files between the starting FLAMES system and one with access to minimap2. For this step, `pipeline_variables$transcript_fa` and `pipeline_variables$fastq` need to be transfered to the system with access to minimap2. Once this has been done, the minimap2 realignment step can be undertaken by running: ```{r, eval=FALSE} {_prog} -ax map-ont -p 0.9 --end-bonus 10 -N 3 -t 12 {_index} {_fq} -o {_out} ``` Where, as before: + `{_prog}` denotes the path to the minimap2 executable (eg, `/~/minimap2/minimap2`). If minimap2 is in PATH, `{_prog}` is only required to be `minimap2`. + `{_index}` denotes the path to the fasta index file, `pipeline_variables$genome_fa` + `{_fq}` denotes the input fastq file (for bulk, the merged fastq file), `pipeline_variables$fastq`. + `{_out}` denotes the temporary output sam file, and an approriate name should be given. The resulting sam file, `{_out}` should be transfered back to the original system, and stored as `pipeline_variables$tmp_sam`. Once this is done, the following can be run to complete the minimap2 realignment process. ```{r, eval=FALSE} samtools_as_bam(pipeline_variables$tmp_sam, pipeline_variables$tmp_bam) samtools_sort_index(pipeline_variables$tmp_bam, pipeline_variables$realign_bam) file.remove(pipeline_variables$tmp_sam) file.remove(pipeline_variables$tmp_bam) ``` #### Transcript Quantification Finally, transcript quantification can be performed, as required: ```{r eval=TRUE, echo=FALSE, results='hide'} pipeline_variables$realign_bam = realign_bam ``` ```{r eval=TRUE} se <- windows_pipeline_quantification(pipeline_variables) ``` The directory `outdir` now contains several output files returned from this pipeline. The output files generated by this pipeline are: \itemize{ \item{transcript_count.csv.gz}{ - a transcript count matrix (also contained in the SummarizedExperiment)} \item{isoform_annotated.filtered.gff3}{ - isoforms in gff3 format (also contained in the SummarizedExperiment)} \item{transcript_assembly.fa}{ - transcript sequence from the isoforms} \item{align2genome.bam}{ - sorted BAM file with reads aligned to genome} \item{realign2transcript.bam}{ - sorted realigned BAM file using the transcript_assembly.fa as reference} \item{tss_tes.bedgraph}{ - TSS TES enrichment for all reads (for QC)} } The pipeline also returns a SummarizedExperiment or SingleCellExperiment object, depending on the pipeline run, containing the data from `transcript_count.csv.gz`and `isoform_annotated.filtered.gff3`. ```{r eval=TRUE} se ``` #### Session Info ```{r echo=FALSE} utils::sessionInfo() ```