19-24 June | CSAMA 2022
BiocManager::install("org.Hs.eg.db") library(org.Hs.eg.db) org.Hs.eg.db
## OrgDb object: ## | DBSCHEMAVERSION: 2.1 ## | Db type: OrgDb ## | Supporting package: AnnotationDbi ## | DBSCHEMA: HUMAN_DB ## | ORGANISM: Homo sapiens ## | SPECIES: Human ## | EGSOURCEDATE: 2022-Mar17 ## | EGSOURCENAME: Entrez Gene ## | EGSOURCEURL: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA ## | CENTRALID: EG ## | TAXID: 9606 ## | GOSOURCENAME: Gene Ontology ## | GOSOURCEURL: http://current.geneontology.org/ontology/go-basic.obo ## | GOSOURCEDATE: 2022-03-10 ## | GOEGSOURCEDATE: 2022-Mar17 ## | GOEGSOURCENAME: Entrez Gene ## | GOEGSOURCEURL: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA ## | KEGGSOURCENAME: KEGG GENOME ## | KEGGSOURCEURL: ftp://ftp.genome.jp/pub/kegg/genomes ## | KEGGSOURCEDATE: 2011-Mar15 ## | GPSOURCENAME: UCSC Genome Bioinformatics (Homo sapiens) ## | GPSOURCEURL: ## | GPSOURCEDATE: 2022-Nov23 ## | ENSOURCEDATE: 2021-Dec21 ## | ENSOURCENAME: Ensembl ## | ENSOURCEURL: ftp://ftp.ensembl.org/pub/current_fasta ## | UPSOURCENAME: Uniprot ## | UPSOURCEURL: http://www.UniProt.org/ ## | UPSOURCEDATE: Fri Apr 1 14:42:16 2022
How can we get access to the data in a Bioconductor annotation resource?
The main function is select
:
AnnotationDbi::select(anno, keys, columns, keytype)
Where
RangedSummarizedExperiment
with results from an RNA-Seq experiment.library(airway) data(airway) ids <- head(rownames(airway)) ids
## [1] "ENSG00000000003" "ENSG00000000005" "ENSG00000000419" "ENSG00000000457" ## [5] "ENSG00000000460" "ENSG00000000938"
select(org.Hs.eg.db, ids, "SYMBOL", "ENSEMBL")
## 'select()' returned 1:1 mapping between keys and columns
## ENSEMBL SYMBOL ## 1 ENSG00000000003 TSPAN6 ## 2 ENSG00000000005 TNMD ## 3 ENSG00000000419 DPM1 ## 4 ENSG00000000457 SCYL3 ## 5 ENSG00000000460 C1orf112 ## 6 ENSG00000000938 FGR
columns
to list annotations available in specific annotation object.columns(org.Hs.eg.db)
## [1] "ACCNUM" "ALIAS" "ENSEMBL" "ENSEMBLPROT" "ENSEMBLTRANS" ## [6] "ENTREZID" "ENZYME" "EVIDENCE" "EVIDENCEALL" "GENENAME" ## [11] "GENETYPE" "GO" "GOALL" "IPI" "MAP" ## [16] "OMIM" "ONTOLOGY" "ONTOLOGYALL" "PATH" "PFAM" ## [21] "PMID" "PROSITE" "REFSEQ" "SYMBOL" "UCSCKG" ## [26] "UNIPROT"
keytypes
to list supported key types.keytypes(org.Hs.eg.db)
## [1] "ACCNUM" "ALIAS" "ENSEMBL" "ENSEMBLPROT" "ENSEMBLTRANS" ## [6] "ENTREZID" "ENZYME" "EVIDENCE" "EVIDENCEALL" "GENENAME" ## [11] "GENETYPE" "GO" "GOALL" "IPI" "MAP" ## [16] "OMIM" "ONTOLOGY" "ONTOLOGYALL" "PATH" "PFAM" ## [21] "PMID" "PROSITE" "REFSEQ" "SYMBOL" "UCSCKG" ## [26] "UNIPROT"
brca <- c("BRCA1", "BRCA2") select(org.Hs.eg.db, brca, c("GENENAME", "OMIM"), "SYMBOL")
## 'select()' returned 1:many mapping between keys and columns
## SYMBOL GENENAME OMIM ## 1 BRCA1 BRCA1 DNA repair associated 113705 ## 2 BRCA1 BRCA1 DNA repair associated 114480 ## 3 BRCA1 BRCA1 DNA repair associated 604370 ## 4 BRCA1 BRCA1 DNA repair associated 614320 ## 5 BRCA1 BRCA1 DNA repair associated 617883 ## 6 BRCA2 BRCA2 DNA repair associated 114480 ## 7 BRCA2 BRCA2 DNA repair associated 155255 ## 8 BRCA2 BRCA2 DNA repair associated 176807 ## 9 BRCA2 BRCA2 DNA repair associated 194070 ## 10 BRCA2 BRCA2 DNA repair associated 600185 ## 11 BRCA2 BRCA2 DNA repair associated 605724 ## 12 BRCA2 BRCA2 DNA repair associated 612555 ## 13 BRCA2 BRCA2 DNA repair associated 613029 ## 14 BRCA2 BRCA2 DNA repair associated 613347
mapIds
functionselect
(but for single annotations!).multiVals
allows to specify how to deal with 1:many mappings.mapIds(org.Hs.eg.db, brca, "OMIM", "SYMBOL", multiVals = "first")
## 'select()' returned 1:many mapping between keys and columns
## BRCA1 BRCA2 ## "113705" "114480"
multiVals
multiVals = "first"
(default): take the first element.multiVals = "asNA"
: NA
is returned for any 1:many mapping.multiVals = "list"
: multiple elements are collapsed into a list
.multiVals = "CharacterList"
: results are returned as a CharacterList
.mapIds(org.Hs.eg.db, brca, "OMIM", "SYMBOL", multiVals = "CharacterList")
## 'select()' returned 1:many mapping between keys and columns
## CharacterList of length 2 ## [["BRCA1"]] 113705 114480 604370 614320 617883 ## [["BRCA2"]] 114480 155255 176807 194070 600185 605724 612555 613029 613347
TxDb
and EnsDb
objectsTxDb
(GenomicFeatures package) and EnsDb
(ensembldb package) objects contain positional annotations.EnsDb
contain additional annotations such as gene symbols, Entrezgene IDs, GC content, gene/transcript biotypes and protein annotations.TxDb
and EnsDb
resources can be installed as packages (e.g. TxDb.Hsapiens.UCSC.hg19.knownGene, EnsDb.Hsapiens.v86).library(AnnotationHub) ah <- AnnotationHub() query(ah, c("EnsDb", "hsapiens", "102"))
## AnnotationHub with 1 record ## # snapshotDate(): 2022-05-13 ## # names(): AH89180 ## # $dataprovider: Ensembl ## # $species: Homo sapiens ## # $rdataclass: EnsDb ## # $rdatadateadded: 2020-10-27 ## # $title: Ensembl 102 EnsDb for Homo sapiens ## # $description: Gene and protein annotations for Homo sapiens based on Ensem... ## # $taxonomyid: 9606 ## # $genome: GRCh38 ## # $sourcetype: ensembl ## # $sourceurl: http://www.ensembl.org ## # $sourcesize: NA ## # $tags: c("102", "AHEnsDbs", "Annotation", "EnsDb", "Ensembl", "Gene", ## # "Protein", "Transcript") ## # retrieve record with 'object[["AH89180"]]'
edb <- ah[["AH89180"]] edb
## EnsDb for Ensembl: ## |Backend: SQLite ## |Db type: EnsDb ## |Type of Gene ID: Ensembl Gene ID ## |Supporting package: ensembldb ## |Db created by: ensembldb package from Bioconductor ## |script_version: 0.3.6 ## |Creation time: Sat Dec 19 21:24:06 2020 ## |ensembl_version: 102 ## |ensembl_host: localhost ## |Organism: Homo sapiens ## |taxonomy_id: 9606 ## |genome_build: GRCh38 ## |DBSCHEMAVERSION: 2.1 ## | No. of genes: 68001. ## | No. of transcripts: 254853. ## |Protein data available.
EnsDb
or TxDb
columns(edb)
## [1] "CANONICALTRANSCRIPT" "DESCRIPTION" "ENTREZID" ## [4] "EXONID" "EXONIDX" "EXONSEQEND" ## [7] "EXONSEQSTART" "GCCONTENT" "GENEBIOTYPE" ## [10] "GENEID" "GENEIDVERSION" "GENENAME" ## [13] "GENESEQEND" "GENESEQSTART" "INTERPROACCESSION" ## [16] "ISCIRCULAR" "PROTDOMEND" "PROTDOMSTART" ## [19] "PROTEINDOMAINID" "PROTEINDOMAINSOURCE" "PROTEINID" ## [22] "PROTEINSEQUENCE" "SEQCOORDSYSTEM" "SEQLENGTH" ## [25] "SEQNAME" "SEQSTRAND" "SYMBOL" ## [28] "TXBIOTYPE" "TXCDSSEQEND" "TXCDSSEQSTART" ## [31] "TXID" "TXIDVERSION" "TXNAME" ## [34] "TXSEQEND" "TXSEQSTART" "TXSUPPORTLEVEL" ## [37] "UNIPROTDB" "UNIPROTID" "UNIPROTMAPPINGTYPE"
select(edb, "ENSG00000139618", c("SYMBOL", "SEQNAME"), "GENEID")
## GENEID SYMBOL SEQNAME ## 1 ENSG00000139618 BRCA2 13
genes
, transcripts
,exons
to extract their genomic coordinates.GRanges
object (with additional metadata columns).genes(edb)
## GRanges object with 68001 ranges and 9 metadata columns: ## seqnames ranges strand | gene_id ## <Rle> <IRanges> <Rle> | <character> ## ENSG00000223972 1 11869-14409 + | ENSG00000223972 ## ENSG00000227232 1 14404-29570 - | ENSG00000227232 ## ... ... ... ... . ... ## ENSG00000231514 Y 26626520-26627159 - | ENSG00000231514 ## ENSG00000235857 Y 56855244-56855488 + | ENSG00000235857 ## gene_name gene_biotype seq_coord_system ## <character> <character> <character> ## ENSG00000223972 DDX11L1 transcribed_unproces.. chromosome ## ENSG00000227232 WASH7P unprocessed_pseudogene chromosome ## ... ... ... ... ## ENSG00000231514 CCNQP2 processed_pseudogene chromosome ## ENSG00000235857 CTBP2P1 processed_pseudogene chromosome ## description gene_id_version canonical_transcript ## <character> <character> <character> ## ENSG00000223972 DEAD/H-box helicase .. ENSG00000223972.5 ENST00000456328 ## ENSG00000227232 WASP family homolog .. ENSG00000227232.5 ENST00000488147 ## ... ... ... ... ## ENSG00000231514 CCNQ pseudogene 2 [S.. ENSG00000231514.1 ENST00000435741 ## ENSG00000235857 CTBP2 pseudogene 1 [.. ENSG00000235857.1 ENST00000431853 ## symbol entrezid ## <character> <list> ## ENSG00000223972 DDX11L1 84771,727856,100287102,... ## ENSG00000227232 WASH7P 653635 ## ... ... ... ## ENSG00000231514 CCNQP2 <NA> ## ENSG00000235857 CTBP2P1 <NA> ## ------- ## seqinfo: 455 sequences (1 circular) from GRCh38 genome
transcripts(edb, filter = ~ symbol == "BRCA2")
## GRanges object with 12 ranges and 10 metadata columns: ## seqnames ranges strand | tx_id ## <Rle> <IRanges> <Rle> | <character> ## ENST00000671466 13 32315086-32316527 + | ENST00000671466 ## ENST00000544455 13 32315480-32399668 + | ENST00000544455 ## ... ... ... ... . ... ## ENST00000666593 13 32380007-32394933 + | ENST00000666593 ## ENST00000533776 13 32396809-32398448 + | ENST00000533776 ## tx_biotype tx_cds_seq_start tx_cds_seq_end ## <character> <integer> <integer> ## ENST00000671466 protein_coding 32316461 32316527 ## ENST00000544455 protein_coding 32316461 32398770 ## ... ... ... ... ## ENST00000666593 nonsense_mediated_de.. 32380007 32383930 ## ENST00000533776 retained_intron <NA> <NA> ## gene_id tx_support_level tx_id_version gc_content ## <character> <integer> <character> <numeric> ## ENST00000671466 ENSG00000139618 <NA> ENST00000671466.1 39.7590 ## ENST00000544455 ENSG00000139618 1 ENST00000544455.5 36.2436 ## ... ... ... ... ... ## ENST00000666593 ENSG00000139618 <NA> ENST00000666593.1 39.5793 ## ENST00000533776 ENSG00000139618 3 ENST00000533776.1 39.9618 ## tx_name symbol ## <character> <character> ## ENST00000671466 ENST00000671466 BRCA2 ## ENST00000544455 ENST00000544455 BRCA2 ## ... ... ... ## ENST00000666593 ENST00000666593 BRCA2 ## ENST00000533776 ENST00000533776 BRCA2 ## ------- ## seqinfo: 1 sequence from GRCh38 genome
exs <- exonsBy(edb, by = "tx", filter = ~ seq_name == "Y") exs
## GRangesList object of length 816: ## $ENST00000155093 ## GRanges object with 8 ranges and 2 metadata columns: ## seqnames ranges strand | exon_id exon_rank ## <Rle> <IRanges> <Rle> | <character> <integer> ## [1] Y 2935381-2935769 + | ENSE00001648585 1 ## [2] Y 2953909-2953997 + | ENSE00003889480 2 ## ... ... ... ... . ... ... ## [7] Y 2977940-2978080 + | ENSE00003891660 7 ## [8] Y 2978810-2982506 + | ENSE00003895708 8 ## ------- ## seqinfo: 1 sequence from GRCh38 genome ## ## $ENST00000215473 ## GRanges object with 2 ranges and 2 metadata columns: ## seqnames ranges strand | exon_id exon_rank ## <Rle> <IRanges> <Rle> | <character> <integer> ## [1] Y 5056088-5057459 + | ENSE00001436852 1 ## [2] Y 5098215-5101437 + | ENSE00003741448 2 ## ------- ## seqinfo: 1 sequence from GRCh38 genome ## ## $ENST00000215479 ## GRanges object with 6 ranges and 2 metadata columns: ## seqnames ranges strand | exon_id exon_rank ## <Rle> <IRanges> <Rle> | <character> <integer> ## [1] Y 6873972-6874027 - | ENSE00001348274 1 ## [2] Y 6872555-6872620 - | ENSE00001671586 2 ## ... ... ... ... . ... ... ## [5] Y 6868037-6868462 - | ENSE00001667251 5 ## [6] Y 6865926-6866078 - | ENSE00003843155 6 ## ------- ## seqinfo: 1 sequence from GRCh38 genome ## ## ... ## <813 more elements>
query(ah, c("TwoBitFile", "GRCh38"))
## AnnotationHub with 115 records ## # snapshotDate(): 2022-05-13 ## # $dataprovider: Ensembl ## # $species: Homo sapiens, homo sapiens ## # $rdataclass: TwoBitFile ## # additional mcols(): taxonomyid, genome, description, ## # coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags, ## # rdatapath, sourceurl, sourcetype ## # retrieve records with, e.g., 'object[["AH49722"]]' ## ## title ## AH49722 | Homo_sapiens.GRCh38.cdna.all.2bit ## AH49723 | Homo_sapiens.GRCh38.dna.primary_assembly.2bit ## ... ... ## AH103846 | Homo_sapiens.GRCh38.dna_sm.primary_assembly.2bit ## AH103847 | Homo_sapiens.GRCh38.ncrna.2bit
gn <- ah[["AH49723"]]
## loading from cache
extractTranscriptSeqs(gn, exs)
## DNAStringSet object of length 816: ## width seq names ## [1] 5336 AGTGCTCCAGAGAAAGGCCGTC...ATAAATGTTTACTTGTATATGA ENST00000155093 ## [2] 4595 CTGGTGGTCCAGTACCTCCAAA...TGAGCCCTTCAGAAGACATTCT ENST00000215473 ## ... ... ... ## [815] 615 GGAGGGAGTTGGGAGGAAAGCG...AATAAAAATATTGACATCACAA ENST00000670354 ## [816] 2156 AGTTTCAGAAGTTCGTGGGTCA...ACGAGAGAAGAAAAGCAGAAAA ENST00000670518
getSeq
takes the data object and a GRanges
defining which (sub)sequence to extract.rng <- GRanges("X:10000-15000") getSeq(gn, rng)
## DNAStringSet object of length 1: ## width seq ## [1] 5001 NCTAACCCTAACCCTAACCCTAACCCTAACCCTA...AGGGGAGAGCAAGAGTGTGCGGGCGAGGGAGGA
library(biomaRt) listMarts()
## biomart version ## 1 ENSEMBL_MART_ENSEMBL Ensembl Genes 106 ## 2 ENSEMBL_MART_MOUSE Mouse strains 106 ## 3 ENSEMBL_MART_SNP Ensembl Variation 106 ## 4 ENSEMBL_MART_FUNCGEN Ensembl Regulation 106
mart <- useMart("ENSEMBL_MART_ENSEMBL")
listDatasets(mart) |> head()
## dataset description ## 1 abrachyrhynchus_gene_ensembl Pink-footed goose genes (ASM259213v1) ## 2 acalliptera_gene_ensembl Eastern happy genes (fAstCal1.2) ## 3 acarolinensis_gene_ensembl Green anole genes (AnoCar2.0v2) ## 4 acchrysaetos_gene_ensembl Golden eagle genes (bAquChr1.2) ## 5 acitrinellus_gene_ensembl Midas cichlid genes (Midas_v5) ## 6 amelanoleuca_gene_ensembl Giant panda genes (ASM200744v2) ## version ## 1 ASM259213v1 ## 2 fAstCal1.2 ## 3 AnoCar2.0v2 ## 4 bAquChr1.2 ## 5 Midas_v5 ## 6 ASM200744v2
mart <- useMart("ENSEMBL_MART_ENSEMBL","hsapiens_gene_ensembl")
getBM(attributes, filters, values, mart)
.where
mart
object we set uplistAttributes
and listFilters
to get available annotations and supported filters.listAttributes(mart) |> head()
## name description page ## 1 ensembl_gene_id Gene stable ID feature_page ## 2 ensembl_gene_id_version Gene stable ID version feature_page ## 3 ensembl_transcript_id Transcript stable ID feature_page ## 4 ensembl_transcript_id_version Transcript stable ID version feature_page ## 5 ensembl_peptide_id Protein stable ID feature_page ## 6 ensembl_peptide_id_version Protein stable ID version feature_page
listFilters(mart) |> head()
## name description ## 1 chromosome_name Chromosome/scaffold name ## 2 start Start ## 3 end End ## 4 band_start Band Start ## 5 band_end Band End ## 6 marker_start Marker Start
sym <- c("BRCA1", "BRCA2") getBM(c("ensembl_gene_id", "hgnc_symbol", "chromosome_name"), "hgnc_symbol", sym, mart)
## ensembl_gene_id hgnc_symbol chromosome_name ## 1 ENSG00000012048 BRCA1 17 ## 2 ENSG00000139618 BRCA2 13
Notes
listEnsemblArchives()
would allow to select URLs for previous (Ensembl) releases.Thank you for your attention