--- title: "Identification and classification of duplicated genes" author: - name: Fabricio Almeida-Silva affiliation: | VIB-UGent Center for Plant Systems Biology, Ghent University, Ghent, Belgium - name: Yves Van de Peer affiliation: | VIB-UGent Center for Plant Systems Biology, Ghent University, Ghent, Belgium output: BiocStyle::html_document: toc: true number_sections: yes bibliography: bibliography.bib vignette: > %\VignetteIndexEntry{Identification and classification of duplicated genes} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", crop = NULL ) ``` # Introduction Gene and genome duplications are a source of raw genetic material for evolution [@ohno2013evolution]. However, whole-genome duplications (WGD) and small-scale duplications (SSD) contribute to genome evolution in different manners. To help you explore the different contributions of WGD and SSD to evolution, we developed `r BiocStyle::Githubpkg("doubletrouble")`, a package that can be used to identify and classify duplicated genes from whole-genome protein sequences, calculate substitution rates per substitution site (i.e., $K_a$ and $K_s$) for gene pairs, find peaks in $K_s$ distributions, and classify gene pairs by age groups. # Installation You can install `r BiocStyle::Githubpkg("doubletrouble")` from Bioconductor with the following code: ```{r installation, eval = FALSE} if(!requireNamespace("BiocManager", quietly = TRUE)) { install.packages("BiocManager") } BiocManager::install("doubletrouble") ## Check that you have a valid Bioconductor installation BiocManager::valid() ``` Then, you can load the package: ```{r load_package} library(doubletrouble) ``` # Data description In this vignette, we will use protein sequences (primary transcripts only) and genome annotation for the yeast species *Saccharomyces cerevisiae* and *Candida glabrata*. Data were obtained from Ensembl Fungi release 54 [@yates2022ensembl]. The example data sets are stored in the following objects: - **yeast_seq:** A list of `AAStringSet` objects with elements named *Scerevisiae* and *Cglabrata*. ```{r data1} # Load list of DIAMOND tabular output data(yeast_seq) head(yeast_seq) ``` - **yeast_annot:** A `GRangesList` object with elements named *Scerevisiae* and *Cglabrata*. ```{r ex_data} # Load annotation list processed with syntenet::process_input() data(yeast_annot) head(yeast_annot) ``` **IMPORTANT:** If you have protein sequences as FASTA files in a directory, you can read them into a list of `AAStringSet` objects with the function `fasta2AAStringSetlist()` from the Bioconductor package `r BiocStyle::Biocpkg("syntenet")`. Likewise, you can get a `GRangesList` object from GFF/GTF files with the function `gff2GRangesList()`, also from `r BiocStyle::Biocpkg("syntenet")`. Our goal here is to identify and classify duplicated genes in the *S. cerevisiae* genome. The *C. glabrata* genome will be used as an outgroup to identify transposed duplicates later in this vignette. # Data preparation First of all, we need to process the list of protein sequences and gene ranges to detect synteny with `r BiocStyle::Biocpkg("syntenet")`. We will do that using the function `process_input()` from the `r BiocStyle::Biocpkg("syntenet")` package. ```{r process_input} library(syntenet) # Process input data pdata <- process_input(yeast_seq, yeast_annot) # Inspect the output names(pdata) pdata$seq pdata$annotation ``` The processed data is represented as a list with the elements `seq` and `annotation`, each containing the protein sequences and gene ranges for each species, respectively. Finally, we need to perform pairwise sequence similarity searches to identify the whole set of paralogous gene pairs. We can do this using the function `run_diamond()` from the `r BiocStyle::Biocpkg("syntenet")` package [^1], setting `compare = "intraspecies"` to perform only intraspecies comparisons. [^1]: **Note:** you need to have DIAMOND installed in your machine to run this function. If you don't have it, you can use the `r BiocStyle::Biocpkg("Herper")` package to install DIAMOND in a Conda environment and run DIAMOND from this virtual environment. ```{r run_diamond_intraspecies} data(diamond_intra) # Run DIAMOND in sensitive mode for S. cerevisiae only if(diamond_is_installed()) { diamond_intra <- run_diamond( seq = pdata$seq["Scerevisiae"], compare = "intraspecies", outdir = file.path(tempdir(), "diamond_intra"), ... = "--sensitive" ) } # Inspect output names(diamond_intra) head(diamond_intra$Scerevisiae_Scerevisiae) ``` And voilĂ ! Now that we have the DIAMOND output and the processed annotation, you can classify the duplicated genes. # Classifying duplicated gene pairs and genes To classify duplicated gene pairs based on their modes of duplication, you will use the function `classify_gene_pairs()`. This function offers four different classification schemes, depending on how much detail you want. The classification schemes, along with the duplication modes they identify and their required input, are summarized in the table below: | Scheme | Duplication modes | Required input | |:---------|:----------------------------|:---------------------------------------------------| | binary | SD, SSD | `blast_list`, `annotation` | | standard | SD, TD, PD, DD | `blast_list`, `annotation` | | extended | SD, TD, PD, TRD, DD | `blast_list`, `annotation`, `blast_inter` | | full | SD, TD, PD, rTRD, dTRD, DD | `blast_list`, `annotation`, `blast_inter`, `intron_counts` | **Legend:** SD, segmental duplication. SSD, small-scale duplication. TD, tandem duplication. PD, proximal duplication. TRD, transposon-derived duplication. rTRD, retrotransposon-derived duplication. dTRD, DNA transposon-derived duplication. DD, dispersed duplication. As shown in the table, the minimal input objects are: - **blast_list**: A list of data frames with DIAMOND (or BLASTp, etc.) tabular output for intraspecies comparisons as returned by `syntenet::run_diamond(..., compare = 'intraspecies')`. - **annotation**: The processed annotation list (a `GRangesList` object) as returned by `syntenet::process_input()`. However, if you also want to identify transposon-derived duplicates (TRD) and further classify them as retrotransposon-derived duplicates (rTRD) or DNA transposon-derived duplicates (dTRD), you will need the following objects: - **blast_list**: A list of data frames with DIAMOND (or BLASTp, etc.) tabular output for interspecies comparisons (target species vs an outgroup) as returned by `syntenet::run_diamond(..., compare = )`. - **intron_counts**: A list of data frames with the number of introns per gene for each species, as returned by `get_intron_counts()`. Below, we demonstrate each classification scheme with examples. ## The *binary* scheme (SD vs SSD) The binary scheme classifies duplicates as derived from either segmental duplications (SD) or small-scale duplications (SSD). To identify segmental duplicates, the function `classify_gene_pairs()` performs intragenome synteny detection scans with `r BiocStyle::Biocpkg("syntenet")` and classifies any detected anchor pairs as segmental duplicates. The remaining pairs are classified as originating from small-scale duplications. This scheme can be used by specifying `scheme = "binary"` in the function `classify_gene_pairs()`. ```{r binary_classification} # Binary scheme c_binary <- classify_gene_pairs( annotation = pdata$annotation, blast_list = diamond_intra, scheme = "binary" ) # Inspecting the output names(c_binary) head(c_binary$Scerevisiae) # How many pairs are there for each duplication mode? table(c_binary$Scerevisiae$type) ``` The function returns a list of data frames, each containing the duplicated gene pairs and their modes of duplication for each species (here, because we have only one species, this is a list of length 1). ## The *standard* scheme (SSD → TD, PD, DD) Gene pairs derived from small-scale duplications can be further classified as originating from tandem duplications (TD, genes are adjacent to each other), proximal duplications (PD, genes are separated by only a few genes), or dispersed duplications (DD, duplicates that do not fit in any of the previous categories). This is the default classification scheme in `classify_gene_pairs()`, and it can be specified by setting `scheme = "standard"`. ```{r expanded_classification} # Standard scheme c_standard <- classify_gene_pairs( annotation = pdata$annotation, blast_list = diamond_intra, scheme = "standard" ) # Inspecting the output names(c_standard) head(c_standard$Scerevisiae) # How many pairs are there for each duplication mode? table(c_standard$Scerevisiae$type) ``` ## The *extended* scheme (SSD → TD, PD, TRD, DD) To find transposon-derived duplicates (TRD), the function `classify_gene_pairs()` detects syntenic regions between a target species and an outgroup species. Genes in the target species that are in syntenic regions with the outgroup are treated as *ancestral loci*. Then, if only one gene of the duplicate pair is an ancestral locus, this duplicate pair is classified as originating from transposon-derived duplications. Since finding transposon-derived duplicates requires comparing a target species with an outgroup species, you will first need to perform a similarity search of your target species against an outgroup. You can do this with `syntenet::run_diamond()`. For the parameter `compare`, you will pass a 2-column data frame specifying the comparisons to be made. [^3] [^3]: **Pro tip:** If you want to identify and classify duplicated genes for multiple species in batch, you must include the outgroup for each of them in the comparisons data frame. Here, we will identify duplicated gene pairs for *Saccharomyces cerevisiae* using *Candida glabrata* as an outgroup. ```{r blast_interspecies} data(diamond_inter) # load pre-computed output in case DIAMOND is not installed # Create data frame of comparisons to be made comparisons <- data.frame( species = "Scerevisiae", outgroup = "Cglabrata" ) comparisons # Run DIAMOND for the comparison we specified if(diamond_is_installed()) { diamond_inter <- run_diamond( seq = pdata$seq, compare = comparisons, outdir = file.path(tempdir(), "diamond_inter"), ... = "--sensitive" ) } names(diamond_inter) head(diamond_inter$Scerevisiae_Cglabrata) ``` Now, we will pass this interspecies DIAMOND output as an argument to the parameter `blast_inter` of `classify_gene_pairs()`. ```{r full_classification} # Extended scheme c_extended <- classify_gene_pairs( annotation = pdata$annotation, blast_list = diamond_intra, scheme = "extended", blast_inter = diamond_inter ) # Inspecting the output names(c_extended) head(c_extended$Scerevisiae) # How many pairs are there for each duplication mode? table(c_extended$Scerevisiae$type) ``` ## The *full* scheme (SSD → TD, PD, rTRD, dTRD, DD) Finally, the full scheme consists in classifying transposon-derived duplicates (TRD) further as originating from retrotransposons (rTRD) or DNA transposons (dTRD). To do that, the function `classify_gene_pairs()` uses the number of introns per gene to find TRD pairs for which one gene has at least 1 intron, and the other gene has no introns; if that is the case, the pair is classified as originating from the activity of retrotransposons (rTRD, i.e., the transposed gene without introns is a processed transcript that was retrotransposed back to the genome). All the other TRD pairs are classified as DNA transposon-derived duplicates (dTRD). To classify duplicates using this scheme, you will first need to create a list of data frames with the number of introns per gene for each species. This can be done with the function `get_intron_counts()`, which takes a `TxDb` object as input. `TxDb` objects store transcript annotations, and they can be created with a family of functions named `makeTxDbFrom*` from the `r BiocStyle::Biocpkg("GenomicFeatures")` package (see `?get_intron_counts()` for a summary of all functions). Here, we will create a list of `TxDb` objects from a list of `GRanges` objects using the function `MakeTxDbFromGRanges` from `r BiocStyle::Biocpkg("GenomicFeatures")`. Importantly, to create a `TxDb` from a `GRanges`, the `GRanges` object must contain genomic coordinates for all features, including transcripts, exons, etc. Because of that, we will use annotation from the example data set `yeast_annot`, which was not processed with `syntenet::process_input()`. ```{r} # Create a list of `TxDb` objects from a list of `GRanges` objects txdb_list <- lapply(yeast_annot, GenomicFeatures::makeTxDbFromGRanges) txdb_list ``` Once we have the `TxDb` objects, we can get intron counts per gene with `get_intron_counts()`. ```{r} # Get a list of data frames with intron counts per gene for each species intron_counts <- lapply(txdb_list, get_intron_counts) # Inspecting the list names(intron_counts) head(intron_counts$Scerevisiae) ``` Finally, we can use this list to classify duplicates using the full scheme as follows: ```{r} # Full scheme c_full <- classify_gene_pairs( annotation = pdata$annotation, blast_list = diamond_intra, scheme = "full", blast_inter = diamond_inter, intron_counts = intron_counts ) # Inspecting the output names(c_full) head(c_full$Scerevisiae) # How many pairs are there for each duplication mode? table(c_full$Scerevisiae$type) ``` # Classifying genes into unique modes of duplication If you look carefully at the output of `classify_gene_pairs()`, you will notice that some genes appear in more than one duplicate pair, and these pairs can have different duplication modes assigned. There's nothing wrong with it. Consider, for example, a gene that was originated by a segmental duplication some 60 million years ago, and then it underwent a tandem duplication 5 million years ago. In the output of `classify_gene_pairs()`, you'd see this gene in two pairs, one with **SD** in the `type` column, and one with **TD**. If you want to assign each gene to a unique mode of duplication, you can use the function `classify_genes()`. This function assigns duplication modes hierarchically using factor levels in column `type` as the priority order. The priority orders for each classification scheme are: 1. **Binary:** SD > SSD. 2. **Standard:** SD > TD > PD > DD. 3. **Extended:** SD > TD > PD > TRD > DD. 4. **Full:** SD > TD > PD > rTRD > dTRD > DD. The input for `classify_genes()` is the list of gene pairs returned by `classify_gene_pairs()`. ```{r classify_genes} # Classify genes into unique modes of duplication c_genes <- classify_genes(c_full) # Inspecting the output names(c_genes) head(c_genes$Scerevisiae) # Number of genes per mode table(c_genes$Scerevisiae$type) ``` # Calculating substitution rates for duplicated gene pairs You can use the function `pairs2kaks()` to calculate rates of nonsynonymous substitutions per nonsynonymous site ($K_a$), synonymouys substitutions per synonymous site ($K_s$), and their ratios ($K_a/K_s$). These rates are calculated using the Bioconductor package `r BiocStyle::Biocpkg("MSA2dist")`, which implements all codon models in KaKs_Calculator 2.0 [@wang2010kaks_calculator]. For the purpose of demonstration, we will only calculate $K_a$, $K_s$, and $K_a/K_s$ for 5 TD-derived gene pairs. The CDS for TD-derived genes were obtained from Ensembl Fungi [@yates2022ensembl], and they are stored in `cds_scerevisiae`. ```{r kaks_calculation} data(cds_scerevisiae) head(cds_scerevisiae) # Store DNAStringSet object in a list cds_list <- list(Scerevisiae = cds_scerevisiae) # Keep only top five TD-derived gene pairs for demonstration purposes td_pairs <- c_full$Scerevisiae[c_full$Scerevisiae$type == "TD", ] gene_pairs <- list(Scerevisiae = td_pairs[seq(1, 5, by = 1), ]) # Calculate Ka, Ks, and Ka/Ks kaks <- pairs2kaks(gene_pairs, cds_list) # Inspect the output head(kaks) ``` # Identifying and visualizing $K_s$ peaks Peaks in $K_s$ distributions typically indicate whole-genome duplication (WGD) events, and they can be identified by fitting Gaussian mixture models (GMMs) to $K_s$ distributions. In `r BiocStyle::Githubpkg("doubletrouble")`, this can be performed with the function `find_ks_peaks()`. However, because of saturation at higher $K_s$ values, only **recent WGD** events can be reliably identified from $K_s$ distributions [@vanneste2013inference]. Recent WGD events are commonly found in plant species, such as maize, soybean, apple, etc. Although the genomes of yeast species have signatures of WGD, these events are ancient, so it is very hard to find evidence for them using $K_s$ distributions. [^4] [^4]: **Tip:** You might be asking yourself: "How does one identify ancient WGD, then?". A common approach is to look for syntenic blocks (i.e., regions with conserved gene content and order) within genomes. This is what `classify_gene_pairs()` does under the hood to find SD-derived gene pairs. To demonstrate how you can find peaks in $K_s$ distributions with `find_ks_peaks()`, we will use a data frame containing $K_s$ values for duplicate pairs in the soybean (*Glycine max*) genome, which has undergone 2 WGDs events ~13 and ~58 million years ago [@schmutz2010genome]. Then, we will visualize $K_s$ distributions with peaks using the function `plot_ks_peaks()`. First of all, let's look at the data and have a quick look at the distribution with the function `plot_ks_distro()` (more details on this function in the data visualization section). ```{r ks_eda} # Load data and inspect it data(gmax_ks) head(gmax_ks) # Plot distribution plot_ks_distro(gmax_ks) ``` By visual inspection, we can see 2 or 3 peaks. Based on our prior knowledge, we know that 2 WGD events have occurred in the ancestral of the *Glycine* genus and in the ancestral of all Fabaceae, which seem to correspond to the peaks we see at $K_s$ values around 0.1 and 0.5, respectively. There could be a third, flattened peak at around 1.6, which would represent the WGD shared by all eudicots. Let's test which number of peaks has more support: 2 or 3. ```{r find_ks_peaks} # Find 2 and 3 peaks and test which one has more support peaks <- find_ks_peaks(gmax_ks$Ks, npeaks = c(2, 3), verbose = TRUE) names(peaks) str(peaks) # Visualize Ks distribution plot_ks_peaks(peaks) ``` As we can see, the presence of 3 peaks is more supported (lowest BIC). The function returns a list with the mean, variance and amplitude of mixture components (i.e., peaks), as well as the $K_s$ distribution itself. Now, suppose you just want to get the first 2 peaks. You can do that by explictly saying to `find_ks_peaks()` how many peaks there are. ```{r find_peaks_explicit} # Find 2 peaks ignoring Ks values > 1 peaks <- find_ks_peaks(gmax_ks$Ks, npeaks = 2, max_ks = 1) plot_ks_peaks(peaks) ``` **Important consideration on GMMs and $K_s$ distributions:** Peaks identified with GMMs should not be blindly regarded as "the truth". Using GMMs to find peaks in $K_s$ distributions can lead to problems such as overfitting and overclustering [@tiley2018assessing]. Some general recommendations are: 1. Use your prior knowledge. If you know how many peaks there are (e.g., based on literature evidence), just tell the number to `find_ks_peaks()`. Likewise, if you are not sure about how many peaks there are, but you know the maximum number of peaks is N, don't test for the presence of >N peaks. GMMs can incorrectly identify more peaks than the actual number. 2. Test the significance of each peak with SiZer (Significant ZERo crossings of derivatives) maps [@chaudhuri1999sizer]. This can be done with the function `SiZer()` from the R package `r BiocStyle::CRANpkg("feature")`. As an example of a SiZer map, let's use `feature::SiZer()` to assess the significance of the 2 peaks we found previously. ```{r sizer} # Get numeric vector of Ks values <= 1 ks <- gmax_ks$Ks[gmax_ks$Ks <= 1] # Get SiZer map feature::SiZer(ks) ``` The blue regions in the SiZer map indicate significantly increasing regions of the curve, which support the 2 peaks we found. # Classifying genes by age groups Finally, you can use the peaks you obtained before to classify gene pairs by age group. Age groups are defined based on the $K_s$ peak to which pairs belong. This is useful if you want to analyze duplicate pairs from a specific WGD event, for instance. You can do this with the function `split_pairs_by_peak()`. This function returns a list containing the classified pairs in a data frame, and a ggplot object with the age boundaries highlighted in the histogram of $K_s$ values. ```{r split_by_peak} # Gene pairs without age-based classification head(gmax_ks) # Classify gene pairs by age group pairs_age_group <- split_pairs_by_peak(gmax_ks, peaks) # Inspecting the output names(pairs_age_group) # Take a look at the classified gene pairs head(pairs_age_group$pairs) # Visualize Ks distro with age boundaries pairs_age_group$plot ``` # Data visualization Last but not least, `r BiocStyle::Biocpkg("doubletrouble")` provides users with graphical functions to produce publication-ready plots from the output of `classify_gene_pairs()`, `classify_genes()`, and `pairs2kaks()`. Let's take a look at them one by one. ## Visualizing the frequency of duplicates per mode To visualize the frequency of duplicated gene pairs or genes by duplication type (as returned by `classify_gene_pairs()` and `classify_genes()`, respectively), you will first need to create a data frame of counts with `duplicates2counts()`. To demonstrate how this works, we will use an example data set with duplicate pairs for 3 fungi species (and substitution rates, which will be ignored by `duplicates2counts()`). ```{r} # Load data set with pre-computed duplicates for 3 fungi species data(fungi_kaks) names(fungi_kaks) head(fungi_kaks$saccharomyces_cerevisiae) # Get a data frame of counts per mode in all species counts_table <- duplicates2counts(fungi_kaks |> classify_genes()) counts_table ``` Now, let's visualize the frequency of duplicate gene pairs by duplication type with the function `plot_duplicate_freqs()`. You can visualize frequencies in three different ways, as demonstrated below. ```{r} # A) Facets p1 <- plot_duplicate_freqs(counts_table) # B) Stacked barplot, absolute frequencies p2 <- plot_duplicate_freqs(counts_table, plot_type = "stack") # C) Stacked barplot, relative frequencies p3 <- plot_duplicate_freqs(counts_table, plot_type = "stack_percent") # Combine plots, one per row patchwork::wrap_plots(p1, p2, p3, nrow = 3) + patchwork::plot_annotation(tag_levels = "A") ``` If you want to visually the frequency of duplicated **genes** (not gene pairs), you'd first need to classify genes into unique modes of duplication with `classify_genes()`, and then repeat the code above. For example: ```{r fig.height = 3, fig.width = 8} # Frequency of duplicated genes by mode classify_genes(fungi_kaks) |> # classify genes into unique duplication types duplicates2counts() |> # get a data frame of counts (long format) plot_duplicate_freqs() # plot frequencies ``` ## Visualizing $K_s$ distributions As briefly demonstrated before, to plot a $K_s$ distribution for the whole paranome, you will use the function `plot_ks_distro()`. ```{r fig.height=3, fig.width=9} ks_df <- fungi_kaks$saccharomyces_cerevisiae # A) Histogram, whole paranome p1 <- plot_ks_distro(ks_df, plot_type = "histogram") # B) Density, whole paranome p2 <- plot_ks_distro(ks_df, plot_type = "density") # C) Histogram with density lines, whole paranome p3 <- plot_ks_distro(ks_df, plot_type = "density_histogram") # Combine plots side by side patchwork::wrap_plots(p1, p2, p3, nrow = 1) + patchwork::plot_annotation(tag_levels = "A") ``` However, visualizing the distribution for the whole paranome can mask patterns that only happen for duplicates originating from particular duplication types. For instance, when looking for evidence of WGD events, visualizing the $K_s$ distribution for SD-derived pairs only can reveal whether syntenic genes cluster together, suggesting the presence of WGD history. To visualize the distribution by duplication type, use `bytype = TRUE` in `plot_ks_distro()`. ```{r fig.width = 8, fig.height=4} # A) Duplicates by type, histogram p1 <- plot_ks_distro(ks_df, bytype = TRUE, plot_type = "histogram") # B) Duplicates by type, violin p2 <- plot_ks_distro(ks_df, bytype = TRUE, plot_type = "violin") # Combine plots side by side patchwork::wrap_plots(p1, p2) + patchwork::plot_annotation(tag_levels = "A") ``` ## Visualizing substitution rates by species The function `plot_rates_by_species()` can be used to show distributions of substitution rates ($K_s$, $K_a$, or their ratio $K_a/K_s$) by species. You can choose which rate you want to visualize, and whether or not to group gene pairs by duplication mode, as demonstrated below. ```{r fig.width = 6, fig.height = 4} # A) Ks for each species p1 <- plot_rates_by_species(fungi_kaks) # B) Ka/Ks by duplication type for each species p2 <- plot_rates_by_species(fungi_kaks, rate_column = "Ka_Ks", bytype = TRUE) # Combine plots - one per row patchwork::wrap_plots(p1, p2, nrow = 2) + patchwork::plot_annotation(tag_levels = "A") ``` # Session information {.unnumbered} This document was created under the following conditions: ```{r session_info} sessioninfo::session_info() ``` # References {.unnumbered}