Sequence logo has been widely used as a graphical representation of nucleic acid motifs and conserved amino acid (AA) usage. We have developed a R/Bioconductor package dagLogo to facilitate the identification and visualization of significant differential usage of AAs between experimental sets and various background sets, with or without grouping based on AA properties.
dagLogo 1.38.0
fetchSequence
function in biomaRt package given a list of gene identifiers and the corresponding positions of the anchoring AA.fetchSequence
function in biomaRt package given a list of gene identifiers and the corresponding peptide subsequences of interest with the anchoring AA marked such as asterisks or lower case of one or more AA letters.dagPeptides
using prepareProteome
and formatSequence
functions sequentially given a list of unaligned/aligned ungapped peptide sequences.A sequence logo has been widely used as a graphical representation of nucleic acid sequence motifs. There is a R package seqlogo(Bembom 2006) for drawing sequence logos for a single DNA motif. There is another R package motifStack(Ou et al. 2018) for depicting individual sequence motif as well as multiple motifs for amino acid (AA), DNA and RNA sequences.
IceLogo(Colaert et al. 2009) is a tool developed in Java for visualizing significantly conserved sequence patterns in a list of aligned AA sequences against a set of background sequences. Compared to webLogo(Crooks et al. 2004), which relies on information theory, IceLogo builds on probability theory. It is reported that IceLogo has a more dynamic nature and is more appropriate for analysis of conserved sequence patterns.
However, IceLogo can only compare conserved sequences to reference sequences at the individual amino acid level. As we know, some conserved sequence patterns are not conserved at the individual amino acid level, but conserved at the level of amino acid group characteristic of their physical and chemical properties, such as charge and hydrophobicity.
Here we developed a R/Bioconductor package dagLogo, for visualizing significantly conserved sequence patterns relative to a proper background set of sequences, with or without grouping amino acid residuals based on their physical and chemical properties. Figure 1 shows the flowchart of performing analysis using dagLogo. Comparing to existing tools, dagLogo allows aligned or not aligned subsequences of different length as input; Provides more options and functions to generate various background sets that can be tailored to fit the experimental design; Both significantly over- and under-represented amino acid residues can be plotted; AA residues can be grouped and statistical significance test can be performed at the group level.
.
Figure 1. Flowchart of performing analysis using dagLogo. Two ways to prepare an object of Proteome
are colored in greenish and yellowish, while two alternative ways to build an object of dagPeptides
are colored in blue and red.
library(dagLogo)
fetchSequence
function in biomaRt package given a list of gene identifiers and the corresponding positions of the anchoring AA.##just in case biomaRt server does not response
if (interactive())
{
try({
mart <- useMart("ensembl")
fly_mart <-
useDataset(mart = mart, dataset = "dmelanogaster_gene_ensembl")
dat <- read.csv(system.file("extdata", "dagLogoTestData.csv",
package = "dagLogo"))
seq <- fetchSequence(IDs = as.character(dat$entrez_geneid),
anchorPos = as.character(dat$NCBI_site),
mart = fly_mart,
upstreamOffset = 7,
downstreamOffset = 7)
head(seq@peptides)
})
}
fetchSequence
function in biomaRt package given a list of gene identifiers and the corresponding peptide subsequences of interest with the anchoring AA marked such as asterisks or lower case of one or more AA letters.if (interactive())
{
try({
mart <- useMart("ensembl")
fly_mart <-
useDataset(mart = mart, dataset = "dmelanogaster_gene_ensembl")
dat <- read.csv(system.file("extdata", "dagLogoTestData.csv",
package = "dagLogo"))
seq <- fetchSequence(IDs = as.character(dat$entrez_geneid),
anchorAA = "*",
anchorPos = as.character(dat$peptide),
mart = fly_mart,
upstreamOffset = 7,
downstreamOffset = 7)
head(seq@peptides)
})
}
In the following example, the anchoring AA is marked as lower case “s” for amino acid serine.
if(interactive()){
try({
dat <- read.csv(system.file("extdata", "peptides4dagLogo.csv",
package = "dagLogo"))
## cleanup the data
dat <- unique(cleanPeptides(dat, anchors = "s"))
mart <- useMart("ensembl")
human_mart <-
useDataset(mart = mart, dataset = "hsapiens_gene_ensembl")
seq <- fetchSequence(IDs = toupper(as.character(dat$symbol)),
type = "hgnc_symbol",
anchorAA = "s",
anchorPos = as.character(dat$peptides),
mart = human_mart,
upstreamOffset = 7,
downstreamOffset = 7)
head(seq@peptides)
})
}
The function cleanPeptides
can be used to select a subset of data to analyze
when input data contains multiple anchoring AAs, represented as lower case of
AAs.
if(interactive()){
dat <- read.csv(system.file("extdata", "peptides4dagLogo.csv",
package="dagLogo"))
dat <- unique(cleanPeptides(dat, anchors = c("s", "t")))
mart <- useMart("ensembl", "hsapiens_gene_ensembl")
seq <- fetchSequence(toupper(as.character(dat$symbol)),
type="hgnc_symbol",
anchorAA=as.character(dat$anchor),
anchorPos=as.character(dat$peptides),
mart=mart,
upstreamOffset=7,
downstreamOffset=7)
head(seq@peptides)
}
Similarly, peptide sequences can be fetched from an object of Proteome.
dagPeptides
using prepareProteome
and formatSequence
functions sequentially given a list of unaligned/aligned ungapped peptide sequences.dat <- unlist(read.delim(system.file("extdata", "grB.txt", package = "dagLogo"),
header = F, as.is = TRUE))
##prepare proteome from a fasta file,
##the fastq file is subset of human proteome for this vignette.
proteome <- prepareProteome(fasta = system.file("extdata",
"HUMAN.fasta",
package = "dagLogo"),
species = "Homo sapiens")
##prepare an object of dagPeptides
seq <- formatSequence(seq = dat, proteome = proteome, upstreamOffset = 14,
downstreamOffset = 15)
Once you have an object of dagPeptides
in hand, you can start to build a background model for statistical significance test. The background could be a set of random subsequences of a whole proteome or your inputs. To build a background model from a whole proteome, an object of Proteome is required.
Sequences provided by a fasta file or downloaded from the UniProt database can be used to prepare a Proteome
object. Case 3 in step 1 shows how to prepare a Proteome
object from a fasta file. The following code snippet shows how to prepare an object of Proteome
using the UniProt.ws package.
if(interactive()){
proteome <- prepareProteome("UniProt", species = "Homo sapiens")
}
The prepared Proteome
object can be used as a background model for the following statistical significance test using Fisher’s exact test or Z-test.
bg_fisher <- buildBackgroundModel(seq, background = "wholeProteome",
proteome = proteome, testType = "fisher")
bg_ztest <- buildBackgroundModel(seq, background = "wholeProteome",
proteome = proteome, testType = "ztest")
Statistical significance test can be performed at the AA level without making any change to the formatted and aligned amino acids. Alternatively, statistical significance test can be performed at the AA group level, where amino acids are grouped based on their physical or chemical properties. To group the AAs, the formatted and aligned AA symbols are replaced by a new set of symbols representing their corresponding groups. For example, if AA charge is of your interest, then group symbols “P”, “N” and “U” are used to replace amino acids with positive charge, negative charge and no charge respectively. A few pre-built grouping schemes have been made available for you to specify as follows.
## no grouping
t0 <- testDAU(seq, dagBackground = bg_ztest)
## grouping based on chemical properties of AAs.
t1 <- testDAU(dagPeptides = seq, dagBackground = bg_ztest,
groupingScheme = "chemistry_property_Mahler_group")
## grouping based on the charge of AAs.
t2 <- testDAU(dagPeptides = seq, dagBackground = bg_ztest,
groupingScheme = "charge_group")
## grouping based on the consensus similarity.
t3 <- testDAU(dagPeptides = seq, dagBackground = bg_ztest,
groupingScheme = "consensus_similarity_SF_group")
## grouping based on the hydrophobicity.
t4 <- testDAU(dagPeptides = seq, dagBackground = bg_ztest,
groupingScheme = "hydrophobicity_KD")
## In addition, dagLogo allows users to use their own grouping
## Scheme. The following example shows how to supply a customized
## scheme. Please note that the user-supplied grouping is named
## as "custom_group" internally.
## Add a grouping scheme based on the level 3 of BLOSUM50
color = c(LVIMC = "#33FF00", AGSTP = "#CCFF00",
FYW = '#00FF66', EDNQKRH = "#FF0066")
symbol = c(LVIMC = "L", AGSTP = "A", FYW = "F", EDNQKRH = "E")
group = list(
LVIMC = c("L", "V", "I", "M", "C"),
AGSTP = c("A", "G", "S", "T", "P"),
FYW = c("F", "Y", "W"),
EDNQKRH = c("E", "D", "N", "Q", "K", "R", "H"))
addScheme(color = color, symbol = symbol, group = group)
t5 <- testDAU(dagPeptides = seq, dagBackground = bg_ztest,
groupingScheme = "custom_group")
We can use a heatmap or logo to display the statistical significance test results.
##Plot a heatmap to show the results
dagHeatmap(t0)
## dagLogo showing differentially used AAs without grouping
dagLogo(t0)
## dagLogo showing AA grouped based on properties of individual a amino acid.
dagLogo(t1, groupingSymbol = getGroupingSymbol(t1@group), legend = TRUE)
## grouped on the basis of charge.
dagLogo(t2, groupingSymbol = getGroupingSymbol(t2@group), legend = TRUE)
## grouped on the basis of consensus similarity.
dagLogo(t3, groupingSymbol = getGroupingSymbol(t3@group), legend = TRUE)
## grouped on the basis of hydrophobicity.
dagLogo(t4, groupingSymbol = getGroupingSymbol(t4@group), legend = TRUE)
sessionInfo()
## R version 4.3.0 RC (2023-04-13 r84269)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.2 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.17-bioc/R/lib/libRblas.so
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/New_York
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 grid stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] Biostrings_2.68.0 GenomeInfoDb_1.36.0 XVector_0.40.0
## [4] IRanges_2.34.0 S4Vectors_0.38.0 motifStack_1.44.0
## [7] UniProt.ws_2.40.0 RSQLite_2.3.1 BiocGenerics_0.46.0
## [10] biomaRt_2.56.0 dagLogo_1.38.0 BiocStyle_2.28.0
##
## loaded via a namespace (and not attached):
## [1] DBI_1.1.3 bitops_1.0-7
## [3] rlang_1.1.0 magrittr_2.0.3
## [5] ade4_1.7-22 matrixStats_0.63.0
## [7] compiler_4.3.0 reshape2_1.4.4
## [9] png_0.1-8 vctrs_0.6.2
## [11] stringr_1.5.0 pkgconfig_2.0.3
## [13] crayon_1.5.2 fastmap_1.1.1
## [15] magick_2.7.4 dbplyr_2.3.2
## [17] caTools_1.18.2 utf8_1.2.3
## [19] Rsamtools_2.16.0 rmarkdown_2.21
## [21] pracma_2.4.2 tzdb_0.3.0
## [23] DirichletMultinomial_1.42.0 bit_4.0.5
## [25] xfun_0.39 zlibbioc_1.46.0
## [27] cachem_1.0.7 CNEr_1.36.0
## [29] jsonlite_1.8.4 progress_1.2.2
## [31] blob_1.2.4 highr_0.10
## [33] DelayedArray_0.26.0 BiocParallel_1.34.0
## [35] jpeg_0.1-10 parallel_4.3.0
## [37] prettyunits_1.1.1 R6_2.5.1
## [39] bslib_0.4.2 stringi_1.7.12
## [41] RColorBrewer_1.1-3 rtracklayer_1.60.0
## [43] GenomicRanges_1.52.0 jquerylib_0.1.4
## [45] Rcpp_1.0.10 bookdown_0.33
## [47] SummarizedExperiment_1.30.0 knitr_1.42
## [49] base64enc_0.1-3 R.utils_2.12.2
## [51] readr_2.1.4 BiocBaseUtils_1.2.0
## [53] Matrix_1.5-4 tidyselect_1.2.0
## [55] yaml_2.3.7 codetools_0.2-19
## [57] curl_5.0.0 rjsoncons_1.0.0
## [59] plyr_1.8.8 lattice_0.21-8
## [61] tibble_3.2.1 Biobase_2.60.0
## [63] KEGGREST_1.40.0 evaluate_0.20
## [65] BiocFileCache_2.8.0 xml2_1.3.3
## [67] pillar_1.9.0 BiocManager_1.30.20
## [69] filelock_1.0.2 MatrixGenerics_1.12.0
## [71] generics_0.1.3 grImport2_0.2-0
## [73] RCurl_1.98-1.12 hms_1.1.3
## [75] ggplot2_3.4.2 munsell_0.5.0
## [77] scales_1.2.1 xtable_1.8-4
## [79] gtools_3.9.4 glue_1.6.2
## [81] pheatmap_1.0.12 seqLogo_1.66.0
## [83] tools_4.3.0 TFMPvalue_0.0.9
## [85] BiocIO_1.10.0 BSgenome_1.68.0
## [87] annotate_1.78.0 GenomicAlignments_1.36.0
## [89] XML_3.99-0.14 poweRlaw_0.70.6
## [91] TFBSTools_1.38.0 AnnotationDbi_1.62.0
## [93] colorspace_2.1-0 GenomeInfoDbData_1.2.10
## [95] restfulr_0.0.15 cli_3.6.1
## [97] rappdirs_0.3.3 fansi_1.0.4
## [99] dplyr_1.1.2 gtable_0.3.3
## [101] R.methodsS3_1.8.2 sass_0.4.5
## [103] digest_0.6.31 rjson_0.2.21
## [105] htmlwidgets_1.6.2 R.oo_1.25.0
## [107] memoise_2.0.1 htmltools_0.5.5
## [109] lifecycle_1.0.3 httr_1.4.5
## [111] GO.db_3.17.0 bit64_4.0.5
## [113] MASS_7.3-59
Bembom, Oliver. 2006. “SeqLogo: Sequence Logos for Dna Sequence Alignments.” R Package Version 1.5.4.
Colaert, Niklaas, Kenny Helsens, Lennart Martens, Joel Vandekerckhove, and Kris Gevaert. 2009. “Improved Visualization of Protein Consensus Sequences by iceLogo.” Nature Methods 6 (11): 786–87.
Crooks, Gavin E., Gary Hon, John-Marc Chandonia, and Steven E. Brenner. 2004. “WebLogo: A Sequence Logo Generator.” Genome Research 14: 1188–90.
Ou, Jianhong, Scot A Wolfe, Michael H Brodsky, and Lihua Julie Zhu. 2018. “MotifStack for the Analysis of Transcription Factor Binding Site Evolution.” Nature Methods 15 (1): 8.