%\VignetteEngine{knitr::knitr} %\VignetteIndexEntry{Using Branchpointer for annotation of intronic human splicing branchpoints} %\VignetteDepends{branchpointer} %\VignetteKeywords{branchpointer} %\VignettePackage{branchpointer} \documentclass{article} \usepackage[utf8]{inputenc} <>= BiocStyle::latex() @ \begin{document} <>= library(knitr) opts_chunk$set(out.width="0.7\\maxwidth",fig.align="center") @ \title{Using branchpointer for annotation of intronic human splicing branchpoints} \author{Beth Signal} \maketitle \tableofcontents \section{Introduction} The spliceosome mediates the formation of an intron lariat though inter-action between the 5’ splice site and branchpoint (Will and Luhrmann, 2011). A subsequent reaction at the 3’ SS then removes the intron lariat, producing a spliced RNA product. Mapping of branchpoints generally requires sequencing of the intron lariat following cDNA synthesis (Gao et al., 2008; Taggart et al., 2012). However, intron lariats are rapidly de-branched and degraded, and much less abundant than the spliced RNA, resulting in the poor recovery of elements using sequencing. Most recently, Mercer et al. (2015) employed a targeted sequencing approach to identify 59,359 branchpoints in 17.4\% of annotated human gene introns. Whilst this constituted the largest annotation to date, the identification of branchpoints was restricted to highly-expressed genes with sufficient sequence coverage. To address this limitation, and expand branchpoint annotations across the human genome, we have developed a machine-learning based model of branchpoints trained with this empirical annotation (Signal et al., 2016). This model requires only genomic sequence and exon annotations, and exhibits no discernible bias to gene type or expression, and can be applied using the R package, branchpointer. Aberrant splicing is known to lead to many human diseases (Singh and Cooper, 2012), however prediction of intronic variant effects have been typically limited to splice site alterations (McLaren et al., 2016; Wang et al., 2010). Therefore, in addition to annotation of branchpoints, branchpointer allows users to assess the effects of intronic mutations on branchpoint architecture. Gao,K. et al. (2008) Human branch point consensus sequence is yUnAy. Nucleic Acids Res., 36, 2257–67. McLaren,W. et al. (2016) The Ensembl Variant Effect Predictor. Genome Biol., 17, 122. Mercer,T.R. et al. (2015) Genome-wide discovery of human splicing branchpoints. Genome Res., 25, 290–303. Signal,B. et al. (2016) Machine-learning annotation of human splicing branchpoints. BioRxiv. doi: 10.1101/094003. Singh,R.K. and Cooper,T.A. (2012) Pre-mRNA splicing in disease and therapeutics. Trends Mol. Med., 18, 472–482. Taggart,A.J. et al. (2012) Large-scale mapping of branchpoints in human pre-mRNA transcripts in vivo. Nat. Struct. Mol. Biol., 19, 719–21. Wang,K. et al. (2010) ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Res., 38, e164. Will,C.L. and Luhrmann,R. (2011) Spliceosome structure and function. Cold Spring Harb. Perspect. Biol., 3, a003707. \section{Preparation} \subsection{Download genome annotations} Branchpointer requires a genome annotation derived from a GTF file and the fasta sequence for this genome annotation. We will be using the GENCODE annotation (http://www.gencodegenes.org/releases/current.html) as an example, although others and custom annotations can be used. Create or move to a working directory where these files can be stored. Note that these can be large files (over 1GB) when uncompressed \begin{verbatim} wget ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_24/gencode.v24.annotation.gtf.gz gunzip gencode.v24.annotation.gtf.gz \end{verbatim} branchpointer requires either a genome .fa file, or a BSGenome object for sequence retrieval. The genome must correspond to the gtf used -- i.e. gencodev24 uses GRCh38 (p5). Download .fa: \begin{verbatim} wget ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_24/GRCh38.p5.genome.fa.gz gunzip GRCh38.p5.genome.fa.gz \end{verbatim} or load a \Biocpkg{BSGenome}: <>= library(BSgenome.Hsapiens.UCSC.hg38) genome <- BSgenome.Hsapiens.UCSC.hg38::BSgenome.Hsapiens.UCSC.hg38 @ \subsection{Read in exon annotations} Start by loading branchpointer. <>= library(branchpointer) @ readExonAnnotation will generate an exon annotation GRanges object from a gtf. We will load in the gtf downloaded from the preparation section. <>= exons <- gtfToExons("gencode.v24.annotation.gtf") @ We will load in a small gtf from the package data for the following examples. <>= smallExons <- system.file("extdata","gencode.v24.annotation.small.gtf", package = "branchpointer") exons <- gtfToExons(smallExons) @ \section{Branchpoint annotations in intronic regions} \subsection{Read query and calculate location attributes} Query regions must contain a branchpoint window - that is the region located at -18 to -44 from the 3' splice site. Each region given will be treated as only one query, and associated with the closest 3' exon. To cover multiple 3'exons, please provide branchpointer with separate region queries. For known regions, queries can be supplied as a table: <>= queryIntron <- system.file("extdata","intron_example.txt", package = "branchpointer") queryIntron <- readQueryFile(queryIntron,queryType = "region") head(queryIntron) @ Then location information can be retrieved using <>= queryIntron <- getQueryLoc(queryIntron,queryType = "region",exons = exons) head(queryIntron) @ For large numbers of queries (over 500), it is recommended to use parallelisation to speed up computation. This can be done by setting \begin{verbatim}use_parallel=TRUE\end{verbatim} and supplying a cores number greater than 1 to functions with this argument. Note that if the number of specified cores is greater than the number available, the maximum number available will be utilised <>= queryIntron <- getQueryLoc(queryIntron,queryType="region", exons = exons, useParallel=TRUE, cores=4) @ Alternatively, to generate branchpoint window region queries by exon annotations, the exon annotation file can be used: Note that when searching for genes, transcripts, or exons, the ids used must be in the same format as in the annotation file (i.e. ENSG00000XXXXXX, ENST00000XXXXXX, ENSE00000XXXXXX). If you are unsure of an id, aliases can typically be found through ensembl (ensembl.org), or through a biomaRt query. <>= queryIntronMake <- makeRegions("ENSE00003541068", "exon_id", exons) head(queryIntronMake) #or for multiple ids queryIntronMake <- lapply(c("ENSE00003541068", "ENSE00003461148"), makeRegions, "exon_id", exons) queryIntronMake <- do.call("c", queryIntronMake) @ \subsubsection{Using bedtools} During the prediction step, if a BSGenome object is not specified,sequences covering each site +/- 250 nt can be retrieved using bedtools. The absolute location of the bedtools binary must be provided for calls from within R. To find the location of your installed bedtools binary, using the command line type: \begin{verbatim} which bedtools \end{verbatim} If chromosome names in the .fa genome file do not match those in the query (i.e chr1 in query, 1 in .fa), the argument rm\_chr should be set to FALSE. \subsection{Predict branchpoint probabilities} Branchpoint probability scores can then be evaluated using the branchpointer model. This will generate a new GRanges object with a row for each site (of 27) in branchpoint window regions. If a SNP query type is provided (See next section), this will also perform an in silico mutation of the sequence. We recommend use of the cutoff probability 0.5 to distinguish branchpoints and non-branchpoint sites. U2 binding energy can be used as a measurement of branchpoint strength when the probability score is above the cutoff. All features required for the model to predict branchpoint probability are contained within the output object, along with the score and U2 binding energy. <>= branchpointPredictionsIntron <- predictBranchpoints(queryIntron, queryType = "region", BSgenome = genome) head(branchpointPredictionsIntron) @ The window scores can be plotted using plotBranchpointWindow(), with optional plots for gene and isoform structure. The main panel displays the probability scores of each site within the branchpoint window. The opacity of the bars is representative of relative U2 binding energy (darker = stronger), and the lower panel shows U2 binding energy for all sites above the provided probability cutoff. BRCA2 intron (ENSE00002167182.1 - ENSE00003461148.1): <>= plotBranchpointWindow(queryIntron$id[2], branchpointPredictionsIntron, probabilityCutoff = 0.5, plotMutated = FALSE, plotStructure = TRUE, exons = exons) @ \section{Effects of SNPs on branchpoint annotations} In addition to locating branchpoints in intronic windows, branchpointer can be used to evaluate the local effects of SNPs on branchpoints. The general workflow is the same as for annotation of intronic windows, however \begin{verbatim}queryType="SNP"\end{verbatim} must be used. \subsection{Read query and calculate location attributes} Query SNPs should be located nearby a branchpoint window to have any potential effects on branchpoint architecture SNP queries can be supplied as a table formatted as follows: <>= querySNP <- system.file("extdata","SNP_example.txt", package = "branchpointer") querySNP <- readQueryFile(querySNP,queryType="SNP") @ Alternatively, appropriate attributes can be pulled from biomart when a list of refsnp ids is provided: <>= library(biomaRt) mart <- useMart("ENSEMBL_MART_SNP", dataset="hsapiens_snp",host="www.ensembl.org") querySNP <- snpToQuery(c("rs17000647","rs5031002","rs998731"), mart.snp = mart) @ By default, all SNPs retrieved will be unstranded, and hence further processing will be done on both strands Location information can be retrieved using: <>= querySNP <- getQueryLoc(querySNP, queryType="SNP", exons = exons, filter = FALSE) head(querySNP) @ Each SNP will be associated with the closest 3' exon. If SNPs are distal from branchpoint windows, the max\_dist argument will remove any greater than the specified distance. Filtering prior to exon associations can speed up processing in instances where it is unknown if the majority of SNPs fall nearby branchpoint windows. Queries can be provided as stranded or unstranded. In the case of unstranded queries, any value except "+" or "-" will cause branchpointer to run on both strands. \subsection{Predict branchpoint probabilities} Using a .fa and bedtools: <>= branchpointPredictionsSNP <- getBranchpointSequence(querySNP, queryType = "SNP", genome = "GRCh38.p5.genome.fa", bedtoolsLocation="/Apps/bedtools2/bin/bedtools") @ Using a BSgenome: <>= #for query SNPs branchpointPredictionsSNP <- predictBranchpoints(querySNP, queryType = "SNP", BSgenome = genome) #to summarise effects: querySNP <- predictionsToStats(querySNP,branchpointPredictionsSNP) head(querySNP) @ The window scores in the reference and alternative sequences can be visualised using plotBranchpointWindow(). rs17000647 in C4orf26 intron 1. <>= plotBranchpointWindow(querySNP$id[1], branchpointPredictionsSNP, probabilityCutoff = 0.5,plotMutated = TRUE, plotStructure = TRUE,exons = exons) @ \section{Session info} <>= sessionInfo() @ \end{document}