--- title: 'SpliceWiz: the cookbook' author: "Alex CH Wong" date: "`r format(Sys.Date(), '%m/%d/%Y')`" output: rmarkdown::html_document: highlight: pygments toc: true toc_float: true abstract: This vignette is a guide containing example code for performing real-life tasks. Importantly, it covers some functionality that were not covered in the Quick-Start vignette (because they are too computationally intensive to be reproducible in a vignette). Version `r packageVersion("SpliceWiz")` vignette: > %\VignetteIndexEntry{SpliceWiz: the cookbook} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` \ # Loading SpliceWiz For instructions on installing and configuring SpliceWiz, please see the Quick-Start vignette. ```{r} library(SpliceWiz) ``` \ # Reference Generation First, define the path to the directory in which the reference should be stored. This directory will be made by SpliceWiz, but its parent directory must exist, otherwise an error will be returned. ```{r eval = FALSE} ref_path <- "./Reference" ``` \ ### Create a SpliceWiz reference from user-defined FASTA and GTF files locally Note that setting `genome_path = "hg38"` will prompt SpliceWiz to use the default files for nonPolyA and Mappability exclusion references in the generation of its reference. Valid options for `genome_path` are "hg38", "hg19", "mm10" and "mm9". ```{r eval=FALSE} buildRef( reference_path = ref_path, fasta = "genome.fa", gtf = "transcripts.gtf", genome_type = "hg38" ) ``` \ ### Prepare genome resources and building the reference as separate steps `buildRef()` first runs `getResources()`, which prepares the genome and gene annotations by storing a compressed local copy in the `resources` subdirectory of the given reference path. Specifically, a binary compressed version of the FASTA file (a.k.a. TwoBitFile), and a gzipped GTF file. If `fasta` and/or `gtf` are https or ftp links, the resources will be downloaded from the internet (which may take a while). After local compressed versions of the genome and gene annotations are prepared, `buildRef()` will proceed to generate the SpliceWiz reference. Note that these two steps can be run separately. `getResources()` will prepare local compressed copies of the FASTA / GTF resources without generating the SpliceWiz reference. Running `buildRef()`, with `reference_path` specifying where the resources were prepared previously with `getResources()`, will perform the 2nd step (SpliceWiz reference generation) without needing to prepare the genome resources (in this case, set the parameters `fasta = ""` and `gtf = ""`). As an example, the below steps: ```{r eval=FALSE} getResources( reference_path = ref_path, fasta = "genome.fa", gtf = "transcripts.gtf" ) buildRef( reference_path = ref_path, fasta = "", gtf = "", genome_type = "hg38" ) ``` is equivalent to this: ```{r eval=FALSE} buildRef( reference_path = ref_path, fasta = "genome.fa", gtf = "transcripts.gtf", genome_type = "hg38" ) ``` \ ### Overwriting an existing reference, but using the same annotations To re-build and overwrite an existing reference, using the same resource annotations, set `overwrite = TRUE` ```{r eval=FALSE} # Assuming hg38 genome: buildRef( reference_path = ref_path, genome_type = "hg38", overwrite = TRUE ) ``` \ If `buildRef()` is run without setting `overwrite = TRUE`, it will terminate if the file `SpliceWiz.ref.gz` is found within the reference directory. \ ### Create a SpliceWiz reference using web resources from Ensembl's FTP The following will first download the genome and gene annotation files from the online resource and store a local copy of it in a file cache, facilitated by BiocFileCache. Then, it uses the downloaded resource to create the SpliceWiz reference. ```{r eval=FALSE} FTP <- "ftp://ftp.ensembl.org/pub/release-94/" buildRef( reference_path = ref_path, fasta = paste0(FTP, "fasta/homo_sapiens/dna/", "Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz"), gtf = paste0(FTP, "gtf/homo_sapiens/", "Homo_sapiens.GRCh38.94.chr.gtf.gz"), genome_type = "hg38" ) ``` \ ### Create a SpliceWiz reference using AnnotationHub resources AnnotationHub contains Ensembl references for many genomes. To browse what is available: ```{r} require(AnnotationHub) ah <- AnnotationHub() query(ah, "Ensembl") ``` For a more specific query: ```{r} query(ah, c("Homo Sapiens", "release-94")) ``` We wish to fetch "AH65745" and "AH64631" which contains the desired FASTA and GTF files, respectively. To build a reference using these resources: ```{r eval=FALSE} buildRef( reference_path = ref_path, fasta = "AH65745", gtf = "AH64631", genome_type = "hg38" ) ``` `Build-Reference-methods` will recognise the inputs of `fasta` and `gtf` as AnnotationHub resources if they begin with "AH". \ ### Create a SpliceWiz reference from species other than human or mouse For human and mouse genomes, we highly recommend specifying `genome_type` as the default mappability file is used to exclude intronic regions with repeat sequences from intron retention analysis. For other species, one could generate a SpliceWiz reference without this reference: ```{r eval=FALSE} buildRef( reference_path = ref_path, fasta = "genome.fa", gtf = "transcripts.gtf", genome_type = "" ) ``` \ If one wishes to prepare a Mappability Exclusion for species other than human or mouse, please see the `Calculating Mappability Exclusions using STAR` section below. \ ### (NEW) Gene ontology annotations For human and mouse genomes, gene ontology annotations are automatically generated. This is inferred by specifying `genome_type` to the human or mouse genome. For other species, or to specify human/mouse, this should be specified in the `ontologySpecies` parameter of `buildRef()`. Only Ensembl/orgDB resources are supported (for now). For a list of available species: ```{r} getAvailableGO() ``` For example, to specify arabidopsis: ```{r eval=FALSE} buildRef( reference_path = ref_path, fasta = "genome.fa", gtf = "transcripts.gtf", genome_type = "", ontologySpecies = "Arabidopsis thaliana" ) ``` \ # STAR reference generation (using SpliceWiz wrappers) ### Checking if STAR is installed To use `STAR` to align FASTQ files, one must be using a system with `STAR` installed. This software is not available in Windows. To check if `STAR` is available: ```{r} STAR_version() ``` \ ### Building a STAR reference ```{r eval = FALSE} ref_path = "./Reference" # Ensure genome resources are prepared from genome FASTA and GTF file: if(!dir.exists(file.path(ref_path, "resource"))) { getResources( reference_path = ref_path, fasta = "genome.fa", gtf = "transcripts.gtf" ) } # Generate a STAR genome reference: STAR_BuildRef( reference_path = ref_path, n_threads = 8 ) ``` \ Note that, by default, `STAR_BuildRef` will store the STAR genome reference in the `STAR` subdirectory within `reference_path`. To override this setting, set the `STAR_ref_path` parameter to a directory path of your choice, e.g.: ```{r, eval = FALSE} STAR_BuildRef( reference_path = ref_path, STAR_ref_path = "/path/to/another/directory", n_threads = 8 ) ``` \ ### Building a STAR genome without specifying gene annotations Sometimes, one might wish to build a genome annotation without first specifying the gene annotations. Reasons one might want to do this include: * Making a STAR reference is computationally intensive, so one might wish to use the same STAR reference for all projects involving the same species * Reducing any potential bias for annotated splice junctions during alignment. We can use `STAR_buildGenome` to do this: ```{r eval = FALSE} # Generate a STAR genome reference: STAR_buildGenome( reference_path = ref_path, STAR_ref_path = "/path/to/hg38" n_threads = 8 ) ``` This STAR reference is derived from the genome FASTA file but not the gene annotation GTF file. Prior to alignment, additional parameters need to be supplied (which should take 5 minutes). These include: * gene annotation (GTF) file, which is automatically generated by setting the SpliceWiz reference path to the `reference_path` parameter * sjdbOverhang (default 100), which is ideally the read length (minus 1) * sequences for any spike-in standards, such as ERCC FASTA files To generate an on-the-fly (i.e., alignment-ready) STAR reference from a genome-derived reference: ```{r, eval = FALSE} STAR_new_ref <- STAR_loadGenomeGTF( reference_path = ref_path, STAR_ref_path = "/path/to/hg38", STARgenome_output = file.path(tempdir(), "STAR"), n_threads = 8, sjdbOverhang = 100, extraFASTA = "./ercc.fasta" ) ``` The path to the on-the-fly reference is specified by the return value (`STAR_new_ref` in the above example). As already explained, this step allows a single STAR reference to be built for each species, which can be adapted for different projects based on their specific technical specifications (e.g. different read length can be adapted by setting different `sjdbOverhang`, or any spike-ins by setting the spike-in FASTA using `extraFASTA`). ### Calculating Mappability Exclusions using STAR (optional) Genomes contain regions of low mappability (i.e. areas which are difficult for reads or fragments to align to). A common computational cause of low mappability include repeat sequences. IRFinder uses an empirical method to determine regions of low mappability, which we adopted in SpliceWiz. These resources are used automatically when generating the SpliceWiz reference and setting the `genome_type` to supported genomes (hg38, hg19, mm10, mm9). For other species, one may wish to generate their own annotations of low mappability regions using the STAR aligner. The `STAR_mappability` wrapper function will use the STAR aligner to calculate regions of low mappability within the given genome. ```{r eval = FALSE} STAR_mappability( reference_path = ref_path, STAR_ref_path = file.path(ref_path, "STAR"), map_depth_threshold = 4, n_threads = 8, read_len = 70, read_stride = 10, error_pos = 35 ) ``` In the above example, `STAR_mappability()` will use the given STAR reference (inside the `STAR_ref_path` directory), and the genome found within the `reference_path` SpliceWiz reference, to generate synthetic reads. * `read_len` specifies the length of these synthetic reads (default `70`) * `read_stride` specifies the nucleotide distance between adjacent synthetic reads (default `10`). These will be generated with alternate `+` / `-` strand * `error_pos` introduces a single nucleotide error at the specified position (default `35`), which will generate an SNP at the center of the 70-nt synthetic read. These synthetic reads will then be aligned back to the STAR genome to create a BAM file, which is later processed to measure the coverage depth of the genome by these synthetic reads. Finally, regions with coverage depth of `map_depth_threshold` or below will be defined as regions of "low mappability". In the above example, 70-nt reads of 10-nt stride will produce synthetic reads such that each nucleotide is expected to have a coverage of `70 / 10 = 7` nucleotides. A coverage of `4` nucleotides or less equates to a coverage of < ~60% of expected depth. \ ### Building BOTH STAR and SpliceWiz references together If `STAR` is available on the same computer or server where R/RStudio is being run, we can use the one-line function `buildFullRef`. This function will: * Prepare the resources from the given FASTA and GTF files (runs `getResources`) * Generate a STAR genome (runs `STAR_BuildRef`) * Use the STAR genome and the FASTA file to *de-novo* calculate and define low mappability regions (runs `STAR_mappability`) * Build the SpliceWiz reference using the genome resources and mappability file (runs `buildRef`) This step is recommended when one wishes to build a non-human/mouse genome in a single step, including generating low-mappability regions to exclude measuring IR events with low mappability. ```{r eval=FALSE} buildFullRef( reference_path = ref_path, fasta = "genome.fa", gtf = "transcripts.gtf", genome_type = "", use_STAR_mappability = TRUE, n_threads = 8 ) ``` `n_threads` specify how many threads should be used to build the STAR reference and to calculate the low mappability regions \ ### Mappability exclusion generation using Rsubread If `STAR` is not available, `Rsubread` is available on Bioconductor for alignment and can be used to perform mappability calculations. The example code in the manual is displayed here for convenience, to demonstrate how this would be done: ```{r eval = FALSE} require(Rsubread) # (1a) Creates genome resource files ref_path <- file.path(tempdir(), "Reference") getResources( reference_path = ref_path, fasta = chrZ_genome(), gtf = chrZ_gtf() ) # (1b) Systematically generate reads based on the SpliceWiz example genome: generateSyntheticReads( reference_path = ref_path ) # (2) Align the generated reads using Rsubread: # (2a) Build the Rsubread genome index: subreadIndexPath <- file.path(ref_path, "Rsubread") if(!dir.exists(subreadIndexPath)) dir.create(subreadIndexPath) Rsubread::buildindex( basename = file.path(subreadIndexPath, "reference_index"), reference = chrZ_genome() ) # (2b) Align the synthetic reads using Rsubread::subjunc() Rsubread::subjunc( index = file.path(subreadIndexPath, "reference_index"), readfile1 = file.path(ref_path, "Mappability", "Reads.fa"), output_file = file.path(ref_path, "Mappability", "AlignedReads.bam"), useAnnotation = TRUE, annot.ext = chrZ_gtf(), isGTF = TRUE ) # (3) Analyse the aligned reads in the BAM file for low-mappability regions: calculateMappability( reference_path = ref_path, aligned_bam = file.path(ref_path, "Mappability", "AlignedReads.bam") ) # (4) Build the SpliceWiz reference using the calculated Mappability Exclusions buildRef(ref_path) ``` \ Note that the default output file for `calculateMappability()` (step 3) is `Mappability/MappabilityExclusion.bed.gz` found within the `reference_path` directory. Then `buildRef()` (step 4) will automatically use this file, regardless of the `genome_type` parameter. The exception is if `MappabilityRef` parameter is set to a different file. This conveniences users to generate their own human/mouse mappability files but use the default non-polyA reference, e.g.: ```{r, eval = FALSE} buildRef(ref_path, genome_type = "hg38") ``` \ # STAR alignment (using SpliceWiz wrappers) First, remember to check that STAR is available via command line: ```{r} STAR_version() ``` ### Aligning a single sample using STAR ```{r eval = FALSE} STAR_alignReads( fastq_1 = "sample1_1.fastq", fastq_2 = "sample1_2.fastq", STAR_ref_path = file.path(ref_path, "STAR"), BAM_output_path = "./bams/sample1", n_threads = 8, trim_adaptor = "AGATCGGAAG" ) ``` \ Note that by default, `STAR_alignReads()` will "trim" Illumina adapters (in fact they will be soft-clipped using STAR's `--clip3pAdapterSeq` option). To disable this feature, set `trim_adapter = ""` in the `STAR_alignReads()` function. ### Aligning multiple samples using STAR ```{r eval = FALSE} Experiment <- data.frame( sample = c("sample_A", "sample_B"), forward = file.path("raw_data", c("sample_A", "sample_B"), c("sample_A_1.fastq", "sample_B_1.fastq")), reverse = file.path("raw_data", c("sample_A", "sample_B"), c("sample_A_2.fastq", "sample_B_2.fastq")) ) STAR_alignExperiment( Experiment = Experiment, STAR_ref_path = file.path("Reference_FTP", "STAR"), BAM_output_path = "./bams", n_threads = 8, two_pass = FALSE ) ``` To use two-pass mapping, set `two_pass = TRUE`. We recommend disabling this feature, as one-pass mapping is adequate in typical-use cases. Two-pass mapping is recommended if one expects a large number of novel splicing events or if the gene annotations (of transcript isoforms) is likely to be incomplete. Additionally, two-pass mapping is highly memory intensive and should be reserved for systems with high memory resources. \ \ ### Finding FASTQ files recursively from a given directory SpliceWiz can identify sequencing FASTQ files recursively from a given directory. It assumes that forward and reverse reads are suffixed as `_1` and `_2`, respectively. Users can choose to identify such files using a specified file extension. For example, to recursively identify FASTQ files of the format `{sample}_1.fq.gz` and `{sample}_2.fq.gz`, use the following: ```{r eval = FALSE} # Assuming sequencing files are named by their respective sample names fastq_files <- findFASTQ( sample_path = "./sequencing_files", paired = TRUE, fastq_suffix = ".fq.gz", level = 0 ) ``` For gzipped fastq files, `fastq_suffix` should be `".fq.gz"` or `".fastq.gz"`. For uncompressed fastq files, it should be `".fq"` or `".fastq"`. Please check your files in order to correctly set this option. `findFASTQ()` will return a 2- or 3-column data frame (depending if `paired` was set to `FALSE` or `TRUE`, respectively). The first column is the sample name (the file name, if `level = 0`, or the parent directory name, if `level = 1`). The subsequent columns are the paths of the forward and reverse reads. The data.frame returned by the `findFASTQ()` function can be parsed into the `STAR_alignExperiment` function. This will align all samples contained in the data.frame parsed via the `Experiment` parameter. ```{r eval = FALSE} STAR_alignExperiment( Experiment = fastq_files, STAR_ref_path = file.path("Reference_FTP", "STAR"), BAM_output_path = "./bams", n_threads = 8, two_pass = FALSE ) ``` \ Note that, if a directory contains multiple forward and reverse FASTQ files, they will be aligned to the same BAM file. This can be done by setting `level = 1` in the `findFASTQ()` function, resulting in multiple rows with the same sample name. \ # Processing BAM files To conveniently find all BAM files recursively in a given path: ```{r eval=FALSE} bams <- findBAMS("./bams", level = 1) ``` This convenience function returns the putative sample names, either from BAM file names themselves (`level = 0`), or from the names of their parent directories (`level = 1`). First, ensure that a SpliceWiz reference has been generated using the `buildRef()` function. This reference should be parsed into the `reference_path` parameter of the `processBAM()` function. To run `processBAM()` using 4 OpenMP threads: ```{r eval=FALSE} # assume SpliceWiz reference has been generated in `ref_path` using the # `buildRef()` function. processBAM( bamfiles = bams$path, sample_names = bams$sample, reference_path = ref_path, output_path = "./pb_output", n_threads = 4, useOpenMP = TRUE ) ``` \ ### Creating COV files from BAM files without running processBAM Sometimes one may wish to create a COV file from a BAM file without running `processBAM()`. One reason might be because a SpliceWiz reference is not available. To convert a list of BAM files, run `BAM2COV()`. This is a function structurally similar to `processBAM()` but without the need to give the path to the SpliceWiz reference: ```{r eval=FALSE} BAM2COV( bamfiles = bams$path, sample_names = bams$sample, output_path = "./cov_output", n_threads = 4, useOpenMP = TRUE ) ``` \ ### Converting COV files to BigWig Sometimes, users may wish to convert COV files to BigWig. One common reason may be to generate strand-specific coverage to compare with BigWig files on IGV. For example, to generate a BigWig file containing reads on the negative strand: ```{r} se <- SpliceWiz_example_NxtSE() cov_file <- covfile(se)[1] cov_negstrand <- getCoverage(cov_file, strand = "-") bw_file <- file.path(tempdir(), "sample_negstrand.bw") rtracklayer::export(cov_negstrand, bw_file, "bw") ``` \ ### The OpenMP parameter explained SpliceWiz processes BAM files using OpenMP-based parallelisation (multi-threading), using our ompBAM C++ library (available via the ompBAM Bioconductor package). The advantage of using this approach (instead of processing multiple BAM files each using a single thread) is that the latter approach uses a lot more memory. Our OpenMP-based approach processes BAM files one at a time, avoiding the memory cost when analysing multiple BAM files simultaneously. Note that, by default, `processBAM` and `BAM2COV` will use OpenMP where available (which is natively supported on Windows and Linux). For MacOS, if OpenMP is not available, these functions will use BiocParallel's `MulticoreParam` to multi-thread process BAM files (1 BAM per thread). Beware that this may take a lot of RAM! (Typically 5-10 Gb per BAM file). We highly suggest considering installing OpenMP libraries on MacOS, as this will lower RAM usage. \ # Collating the experiment Assuming the SpliceWiz reference is in `ref_path`, after running `processBAM()` as shown in the previous section, use the convenience function `findSpliceWizOutput()` to tabulate a list of samples and their corresponding `processBAM()` outputs: ```{r eval=FALSE} expr <- findSpliceWizOutput("./pb_output") ``` This data.frame can be directly used to run `collateData`: ```{r eval = FALSE} collateData( Experiment = expr, reference_path = ref_path, output_path = "./NxtSE_output" ) ``` * NB: Novel splicing detection can be enabled by setting `novelSplicing = TRUE`. See the Quick-Start vignette for more details about the various parameters associated with novel splicing detection. ```{r eval = FALSE} collateData( Experiment = expr, reference_path = ref_path, output_path = "./NxtSE_output_novelSplicing", novelSplicing = TRUE ) ``` Then, the collated data can be imported as a `NxtSE` object, which is an object that inherits `SummarizedExperiment` and has specialized containers to hold additional data required by SpliceWiz. ```{r eval = FALSE} se <- makeSE("./NxtSE_output") ``` \ # Downstream analysis using SpliceWiz Please refer to SpliceWiz: Quick-Start vignette for worked examples using the example dataset. \ \ # SessionInfo ```{r} sessionInfo() ```