---
title: "Assessing orthogroup inference"
author: 
  - name: Fabricio Almeida-Silva
    affiliation: VIB-UGent Center for Plant Systems Biology, Ghent, Belgium
  - name: Yves Van de Peer
    affiliation: VIB-UGent Center for Plant Systems Biology, Ghent, Belgium
output: 
  BiocStyle::html_document:
    toc: true
    number_sections: yes
bibliography: vignette_02.bib
vignette: >
  %\VignetteIndexEntry{Assessing orthogroup inference}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}  
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
    collapse = TRUE,
    comment = "#>",
    crop = NULL,
    dpi = 100
)
```

# Introduction

The identification of groups of homologous genes within and across species is 
a powerful tool for evolutionary genomics. The most widely used tools to 
identify orthogroups (i.e., groups of orthologous genes) are OrthoFinder [@emms2019orthofinder] and OrthoMCL [@li2003orthomcl]. However, these tools 
generate different results depending on the parameters used, such 
as *mcl inflation parameter*, *E-value*, *maximum number of hits*, and others. 
Here, we propose a protein domain-aware assessment of orthogroup inference. 
The goal is to maximize the percentage of shared protein domains for genes in 
the same orthogroup.

# Installation

```{r installation, eval=FALSE}
if(!requireNamespace('BiocManager', quietly = TRUE))
  install.packages('BiocManager')
BiocManager::install("cogeqc")
```

```{r load_package, message=FALSE}
# Load package after installation
library(cogeqc)
```

# Data description

Here, we will use orthogroups from the PLAZA 5.0 database [@van2021plaza], 
inferred with OrthoFinder [@emms2019orthofinder]. For the purpose of 
demonstration, the complete dataset was filtered to only keep orthogroups for 
the Brassicaceae species *Arabidopsis thaliana* and *Brassica oleraceae*. 
Interpro domain annotations were also retrieved from PLAZA 5.0.

```{r data_description}
# Orthogroups for Arabidopsis thaliana and Brassica oleraceae
data(og)
head(og)

# Interpro domain annotations
data(interpro_ath)
data(interpro_bol)

head(interpro_ath)
head(interpro_bol)
```

If you infer orthogroups with OrthoFinder, you can read and parse the output 
file **Orthogroups.tsv** with the function `read_orthogroups()`. For example:

```{r read_orthogroups_demo}
# Path to the Orthogroups.tsv file created by OrthoFinder
og_file <- system.file("extdata", "Orthogroups.tsv.gz", package = "cogeqc") 

# Read and parse file
orthogroups <- read_orthogroups(og_file)
head(orthogroups)
```

# Assessing orthogroups

In `cogeqc`, you can assess orthogroup inference with either a protein 
domain-based approach or a reference-based approach. Both approaches are 
described below.

## Protein domain-based orthogroup assessment

The protein domain-based assessment of orthogroups is based on the
formula below:


$$
\begin{aligned}
Scores &= \frac{Homogeneity}{Dispersal}
\\
\end{aligned}
$$


The numerator, $homogeneity$, is the mean Sorensen-Dice index for all 
pairwise combinations of genes in an orthogroup. The Sorensen-Dice index
measures how similar two genes are, and it ranges from 0 to 1, with 0 meaning
that a gene pair does not share any protein domain, and 1 meaning that 
it shares all protein domains. In a formal definition:

$$
\begin{aligned}
Homogeneity &=  \frac{1}{N_{pairs}} \sum_{i=1}^{N_{pairs}} SDI_{i}
\\
\\
SDI(A,B) &= \frac{2 \left| A \cap B \right|}{ \left|A \right| + \left| B \right|}
\end{aligned}
$$

where A and B are the set of protein domains associated to genes A and B. 
This way, if all genes in an orthogroup have the same protein domains, it
will have $homogeneity = 1$. If each gene has a different protein domain,
the orthogroup will have $homogeneity = 0$. If only some gene pairs share 
the same domain, $homogeneity$ will be somewhere between 0 and 1. 


The denominator, $dispersal$, aims to correct for overclustering (i.e.,
orthogroup assignments that break "true" gene families into an artificially
large number of smaller subfamilies). It is the mean number of 
orthogroups containing the same protein domain corrected by 
the number of orthogroup. Formally:


$$
\begin{aligned}
Dispersal &=  \frac{1}{N_{domains} N_{OG}} \sum_{i=1}^{N_{domains}}D_{i} \\
\\
\end{aligned}
$$


where $N_{OG}$ is the number of orthogroups, and $D_{i}$ is the number of 
orthogroups containing the protein domain $i$. This term penalizes
orthogroup assignments where the same protein domains appears in multiple 
orthogroups. As orthogroups represent groups of genes that 
evolved from a common ancestor, a protein domain being present in multiple 
orthogroups indicates that this domain evolved multiple times in an independent
way, which is not reasonable from a phylogenetic point of view, despite
convergent evolution. 

To calculate scores for each orthogroup, you can use the 
function `assess_orthogroups()`. This function takes as input a list of 
annotation data frames[^1] and an orthogroups data frame, and returns the 
relative homogeneity scores of each orthogroup for each species. Note that
if you don't want to take the dispersal into account, you can set
`correct_overclustering = FALSE`. This will ignore the denominator of the
score formula.

[^1]: **NOTE:** The names of the list elements must match the species 
abbreviations in the column *Species* of the orthogroups data frame. 
For instance, if your orthogroups data frame contains the species **Ath** 
and **Bol**, the data frames in the annotation list must be named **Ath** and
**Bol** (not necessarily in that order, but with these exact names).


```{r assess_og}
# Create a list of annotation data frames
annotation <- list(Ath = interpro_ath, Bol = interpro_bol)
str(annotation) # This is what the list must look like

og_assessment <- assess_orthogroups(og, annotation)
head(og_assessment)
```

Now, we can calculate the mean score for this orthogroup inference.

```{r mean_h}
mean(og_assessment$Mean_score)
```

Ideally, to have a reliable orthogroup inference, you should be able to run 
OrthoFinder with multiple combinations of parameters and assess each inference
with `assess_orthogroups()`. The inference with the highest mean homonegeneity
will be the best.[^2]

[^2]: **Friendly tip:** if you want to calculate homogeneity scores using a 
single species as a proxy (your orthogroups data frame will have only one 
species), you can use the function `calculate_H()`.

## Reference-based orthogroup assessment

In some cases, you may want to compare your orthogroup inference to a 
reference orthogroup inference. To do that, you can use the function `compare_orthogroups()`. For example, let's simulate a different orthogroup 
inference by shuffling some rows of the `og` data frame and comparing it to 
the original data frame.

```{r ref-based_og_assessment}
set.seed(123)

# Subset the top 5000 rows for demonstration purposes
og_subset <- og[1:5000, ]
ref <- og_subset

# Shuffle 100 genes to simulate a test set
idx_shuffle <- sample(seq_len(nrow(og_subset)), 100, replace = FALSE)
test <- og_subset
test$Gene[idx_shuffle] <- sample(
  test$Gene[idx_shuffle], size = length(idx_shuffle), replace = FALSE
)

# Compare test set to reference set
comparison <- compare_orthogroups(ref, test)
head(comparison)

# Calculating percentage of preservation
preserved <- sum(comparison$Preserved) / length(comparison$Preserved)
preserved
```

As we can see, `r paste0(round(preserved * 100, 2), "%")` of the orthogroups 
in the reference data set are preserved in the shuffled data set.

# Visualizing summary statistics

Now that you have identified the best combination of parameters for your 
orthogroup inference, you can visually explore some of its summary statistics.
OrthoFinder automatically saves summary statistics in a directory named **Comparative_Genomics_Statistics**. You can parse this directory in a list 
of summary statistics with the function `read_orthofinder_stats()`.
To demonstrate it, let's read the output of OrthoFinder's example with 
model species.

```{r}
stats_dir <- system.file("extdata", package = "cogeqc")
ortho_stats <- read_orthofinder_stats(stats_dir)
ortho_stats
```

Now, we can use this list to visually explore summary statistics.

## Species tree

To start, one would usually want to look at the species tree to detect 
possible issues that would compromise the accuracy of orthologs detection. 
The tree file can be easily read with `treeio::read.tree()`.

```{r plot_species_tree}
data(tree)
plot_species_tree(tree)
```

You can also include the number of gene duplications in each node.

```{r plot_species_tree_with_dups}
plot_species_tree(tree, stats_list = ortho_stats)
```

## Species-specific duplications

The species tree above shows duplications per node, but it does not show
species-duplications. To visualize that, you can use the 
function `plot_duplications()`.

```{r}
plot_duplications(ortho_stats)
```


## Genes in orthogroups

Visualizing the percentage of genes in orthogroups is particularly useful 
for quality check, since one would usually expect a large percentage of
genes in orthogroups, unless there is a very distant species in OrthoFinder's 
input proteome data.

```{r plot_genes_in_ogs}
plot_genes_in_ogs(ortho_stats)
```

## Species-specific orthogroups

To visualize the number of species-specific orthogroups, use the function
`plot_species_specific_ogs()`. This plot can reveal a unique gene repertoire 
of a particular species if it has a large number of species-specific OGs as 
compared to the other ones.  

```{r ssOGs}
plot_species_specific_ogs(ortho_stats)
```

## All in one

To get a complete picture of OrthoFinder results, you can combine all plots 
together with `plot_orthofinder_stats()`, a wrapper that integrates all 
previously demonstrated plotting functions.

```{r plot_orthofinder_stats, fig.width = 12, fig.height = 7}
plot_orthofinder_stats(
  tree, 
  xlim = c(-0.1, 2),
  stats_list = ortho_stats
)
```

## Orthogroup overlap

You can also visualize a heatmap of pairwise orthogroup overlap across species 
with `plot_og_overlap()`. 

```{r plot_og_overlap}
plot_og_overlap(ortho_stats)
```

## Orthogroup size per species

If you want to take a look at the distribution of OG sizes for each species, 
you can use the function `plot_og_sizes`. If you have many extreme values and 
want to visualize the shape of the distribution in a better way, you can also 
log transform the OG sizes by setting `log = TRUE`.

```{r og_sizes}
plot_og_sizes(og) 
plot_og_sizes(og, log = TRUE) # natural logarithm scale
```

# Session information {.unnumbered}

This document was created under the following conditions:

```{r session_info}
sessionInfo()
```

# References {.unnumbered}