--- title: "4. Intermediate Lab for R & Bioconductor " author: "Sonali Arora" output: BiocStyle::html_document: toc: true toc_depth: 2 vignette: > % \VignetteIndexEntry{4. Intermediate Lab for R & Bioconductor} % \VignetteEngine{knitr::rmarkdown} --- ```{r style, echo = FALSE, results = 'asis'} BiocStyle::markdown() options(width=100, max.print=1000) knitr::opts_chunk$set( eval=as.logical(Sys.getenv("KNITR_EVAL", "TRUE")), cache=as.logical(Sys.getenv("KNITR_CACHE", "TRUE")), error=FALSE) ``` Author: Sonali Arora (sarora@fredhutch.org)
Date: 20-22 July, 2015
The material in this course requires R version 3.2.1 and Bioconductor version 3.2 ## Intermediate lab for Bioconductor __Exercise 1__ Find the package in Bioconductor which contains annotation databases generated from UCSC for Rat Norvegicus (assembly rn5), load it and save it in a variable called 'txdb'. Using this object , do the following - a) find all the genes contained in this assembly and save it in a varible called 'ratGenes'. b) How many sequences are contained in ratGenes ? (Hint : ?Seqinfo) c) The 'ratGenes' contains scaffolds also - how do you subsetthe object to contain sequences only from the standard chromosomes ? b) I am interested in gene 'Acsl5' (entrez gene id=94340). Does this exist in 'ratGenes'? what are its chromosome coordinates ? __Exercise 2__ Find the package in Bioconductor thats stores the Full genome sequences for Rattus norvegicus (Rat) as provided by UCSC (rn5, Mar. 2012) a) Load the package and save it in a object called 'ratSeq' b) what object is this sequence information stored in ? c) get the dna sequence information for acsl5 and store it in 'acsl5_sequence' d) calculate the GC content from this sequnence. __Exercise 3__ In the 'ratGenes' object above, you get only the entrez gene ids, can you get the gene names for each gene ? __Exercise 4__ Get the annotation databases from NCBI for Homo sapiens (assembly build GRCh38.80), create a txdb object (similar to what we saw in question 3 above) and get the genes. ( Hint - involves starting from scratch with a GTF file) __Exercise 5__ The liftOver facilities developed in conjunction with the UCSC browser track infrastructure are available for transforming data in GRanges formats. We want to transform data from rn4 to the latest asembly rn6. a) The transformation to rn6 coordinates is defined by a chain file provided by UCSC. Get the chain file which contains transformations from rn5 to rn6. b) Perform the liftover after getting the chain file. ## Solutions __Answer 1__ ```{r txdb} suppressPackageStartupMessages({ library("TxDb.Rnorvegicus.UCSC.rn5.refGene") }) txdb <- TxDb.Rnorvegicus.UCSC.rn5.refGene ## find all genes ratGenes <- genes(txdb) ## list all sequences seqinfo(ratGenes) ## subset to contain only standard chromosomes ratGenes <- keepStandardChromosomes(ratGenes) ## find gene 'Acsl5' acsl5 <- ratGenes[which(mcols(ratGenes)$gene_id==94340),] acsl5 ``` __Answer 2__ ```{r bsgenome} suppressPackageStartupMessages({ library(BSgenome.Rnorvegicus.UCSC.rn5) }) ratSeq <- BSgenome.Rnorvegicus.UCSC.rn5 class(ratSeq) ## get the sequence acsl5_sequence <- getSeq(ratSeq, acsl5) ## calculate the GC content letterFrequency(acsl5_sequence, "GC", as.prob=TRUE) ``` __Answer 3__ ```{r select-rat} library("Rattus.norvegicus") ## get a mapping between all entrex id and gene names ratGeneNames <- select(Rattus.norvegicus, ratGenes$gene_id, columns=c("SYMBOL", 'GENEID'), keytype="GENEID") ## match the entrz id with result before subsetting idx <- match(ratGeneNames$GENEID, ratGenes$gene_id) ## add mactched result to GRanges mcols(ratGenes) <- ratGeneNames[idx,] ratGenes ``` __Answer 4__ Steps include a) Getting the GTF file from NCBI for a particular build of Homo sapiens that you're interested in. ( AnnotationHub is a package inside Bioconductor which automatically gets the file for you) b) create a txdb object from this GTF file (which is read in as a GRanges) c) extract genes from the txdb object as before. These steps are beneficial if you cannot find pre-packaged genome annotations for your favourite organism as a package inside Bioconductor. ```{r gtf-to-gr, eval=FALSE} library(AnnotationHub) ah = AnnotationHub() ## find the file gtf_humans <- query(ah , c("gtf", "Homo sapiens", "grch38","80")) gtf_humans ## download the file gtfFile <- ah[["AH47066"]] ## create a txdb library(GenomicFeatures) txdb <- makeTxDbFromGRanges(gtfFile) #may take some time.. txdb ## get the genes from the taxdb object humanGenes <- genes(txdb) ``` __Answer 5__ One way of getting a chain file would be to find the file in ucsc, download it and read it in using `rtracklayer::import.chain()`. An easier solution would be to find the files via `AnnotationHub` ```{r chainfile} ## a) get the chain file ## load the package and query the files to find the file we want library(AnnotationHub) ah = AnnotationHub() query(ah , c("rattus", "rn5", "rn6")) ## learn more about the file you want ah["AH14761"] ## download the file ratChain <- ah[["AH14761"]] ratChain ## b) perform the liftover library(rtracklayer) lft <- liftOver(acsl5, ratChain) lft ``` ## References - [Basic Introduction to GenomicRanges](http://bioconductor.org/packages/devel/bioc/vignettes/GenomicRanges/inst/doc/GenomicRangesIntroduction.pdf) - [Manipulating GRanges](http://bioconductor.org/packages/devel/bioc/vignettes/GenomeInfoDb/inst/doc/GenomeInfoDb.pdf) - [GenomicRangesHOWTOs (Advanced)](http://bioconductor.org/packages/devel/bioc/vignettes/GenomicRanges/inst/doc/GenomicRangesHOWTOs.pdf) - [Introduction to Bioconductor for Sequence Data](http://www.bioconductor.org/help/workflows/sequencing/) - [Working with BSgenome Packages](http://bioconductor.org/packages/devel/bioc/vignettes/BSgenome/inst/doc/GenomeSearching.pdf) - [Utilizing TxDb objects with GenomicFeatures](http://bioconductor.org/packages/devel/bioc/vignettes/GenomicFeatures/inst/doc/GenomicFeatures.pdf) - [AnnotationHub Introduction](http://bioconductor.org/packages/devel/bioc/vignettes/AnnotationHub/inst/doc/AnnotationHub.html), [Case study](http://bioconductor.org/packages/devel/bioc/vignettes/AnnotationHub/inst/doc/AnnotationHub-HOWTO.html) - [liftOver workflow](http://bioconductor.org/help/workflows/liftOver/) ## What to not miss at BioC2015 ! If you liked this lab and want to learn more in this area, do not miss the following labs at BioC2015 - [_Bioconductor annotation resources_](https://github.com/mrjc42/BiocAnnotRes2015) by Marc Carlson, Sonali Arora. (Wednesday, Session 3, 1:00pm-2:45pm) - _Practical introduction to Bioconductor foundational data structures for high throughput sequencing analysis_ by Herve Pages, Michael Lawrence. (Wednesday, Session 3, 3:15pm-5:00pm) ## `sessionInfo()` ```{r sessionInfo} sessionInfo() ```