---
title: "orthogene: Getting Started"
author: "
Author: Brian M. Schilder
"
date: "Most recent update: `r format( Sys.Date(), '%b-%d-%Y')`
"
output:
BiocStyle::html_document:
vignette: >
%\VignetteIndexEntry{orthogene}
%\usepackage[utf8]{inputenc}
%\VignetteEngine{knitr::rmarkdown}
---
## `orthogene`: Interspecies gene mapping
`orthogene` is an R package for easy mapping of orthologous genes
across hundreds of species.
It pulls up-to-date interspecies gene ortholog mappings across 700+ organisms.
It also provides various utility functions to map common objects
(e.g. data.frames, gene expression matrices, lists)
onto 1:1 gene orthologs from any other species.
In brief, `orthogene` lets you easily:
- [**`convert_orthologs`** between any two species](https://neurogenomics.github.io/orthogene/articles/orthogene#convert-orthologs)
- [**`map_species`** names onto standard taxonomic ontologies](https://neurogenomics.github.io/orthogene/articles/orthogene#map-species)
- [**`report_orthologs`** between any two species](https://neurogenomics.github.io/orthogene/articles/orthogene#report-orthologs)
- [**`map_genes`** onto standard ontologies](https://neurogenomics.github.io/orthogene/articles/orthogene#map-genes)
- [**`aggregate_mapped_genes`** in a matrix.](https://neurogenomics.github.io/orthogene/articles/orthogene#aggregate-mapped-genes)
- [get **`all_genes`** from any species](https://neurogenomics.github.io/orthogene/articles/orthogene#get-all-genes)
# Installation
```{r, eval=FALSE}
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
# orthogene is only available on Bioconductor>=3.14
if(BiocManager::version()<"3.14") BiocManager::install(version = "3.14")
BiocManager::install("orthogene")
```
```{r setup}
library(orthogene)
data("exp_mouse")
# Setting to "homologene" for the purposes of quick demonstration.
# We generally recommend using method="gprofiler" (default).
method <- "homologene"
```
# Examples
## Convert orthologs
[`convert_orthologs`](https://neurogenomics.github.io/orthogene/reference/convert_orthologs.html) is very flexible with what users can supply as `gene_df`,
and can take a `data.frame`/`data.table`/`tibble`, (sparse) `matrix`,
or `list`/`vector` containing genes.
Genes, transcripts, proteins, SNPs, or genomic ranges will be recognised
in most formats (HGNC, Ensembl, RefSeq, UniProt, etc.)
and can even be a mixture of different formats.
All genes will be mapped to gene symbols, unless specified otherwise with the
`...` arguments (see `?orthogene::convert_orthologs` or [here](https://neurogenomics.github.io/orthogene/reference/convert_orthologs.html) for details).
### Note on non-1:1 orthologs
A key feature of [`convert_orthologs`](https://neurogenomics.github.io/orthogene/reference/convert_orthologs.html) is that it handles the issue of genes with many-to-many
mappings across species. This can occur due to evolutionary divergence,
and the function of these genes tends to be less conserved and less translatable.
Users can address this using different strategies via `non121_strategy=`:
1. `"drop_both_species"` : Drop genes that have duplicate mappings in either the input_species or output_species, (*DEFAULT*).
2. `"drop_input_species"` : Only drop genes that have duplicate mappings in `input_species`.
3. `"drop_output_species"` : Only drop genes that have duplicate mappings in the `output_species`.
4. `"keep_both_species"` : Keep all genes regardless of whether they have duplicate mappings in either species.
5. `"keep_popular"` : Return only the most "popular" interspecies ortholog mappings. This procedure tends to yield a greater number of returned genes but at the cost of many of them not being true biological 1:1 orthologs.
When `gene_df` is a matrix. These strategies can be used together with `agg_fun`. This feature automatically performs both ortholog aggregation (many:1 mappings) and expansion (1:many mappings) of matrices, depending on the situation. This means that you have the option to keep non-1:1 ortholog genes, and still produce a matrix with only 1 gene per row.
Options include:
1. `"sum"`
2. `"mean"`
3. `"median"`
4. `"min"`
5. `"max"`
For more information on how `orthogene` performs matrix aggregation/expansion, see the documentation for the underlying function: `?orthogene:::many2many_rows`
```{r convert_orthologs}
gene_df <- orthogene::convert_orthologs(gene_df = exp_mouse,
gene_input = "rownames",
gene_output = "rownames",
input_species = "mouse",
output_species = "human",
non121_strategy = "drop_both_species",
method = method)
knitr::kable(as.matrix(head(gene_df)))
```
## Map species
[`map_species`](https://neurogenomics.github.io/orthogene/reference/map_species.html) lets you standardise species names from a wide variety of identifiers
(e.g. common name, taxonomy ID, Latin name, partial match).
All exposed `orthogene` functions (including [`convert_orthologs`](https://neurogenomics.github.io/orthogene/reference/convert_orthologs.html))
use `map_species` under the hood, so you don't have to worry about
getting species names exactly right.
You can check the full list of available species by simply running
`map_species()` with no arguments,
or checking [here](https://biit.cs.ut.ee/gprofiler/page/organism-list).
```{r map_species}
species <- orthogene::map_species(species = c("human",9544,"mus musculus",
"fruit fly","Celegans"),
output_format = "scientific_name")
print(species)
```
## Report orthologs
It may be helpful to know the maximum expected number of orthologous
gene mappings from one species to another.
[`ortholog_report`](https://neurogenomics.github.io/orthogene/reference/report_orthologs.html) generates a report that tells you this information
genome-wide.
```{r report_orthologs}
orth_zeb <- orthogene::report_orthologs(target_species = "zebrafish",
reference_species = "human",
method_all_genes = method,
method_convert_orthologs = method)
knitr::kable(head(orth_zeb$map))
knitr::kable(orth_zeb$report)
```
## Map genes
[`map_genes`](https://neurogenomics.github.io/orthogene/reference/map_genes.html) finds matching *within-species* synonyms across a wide variety of gene naming conventions (HGNC, Ensembl, RefSeq, UniProt, etc.) and returns a table with standardised gene symbols (or whatever output format you prefer).
```{r map_genes}
genes <- c("Klf4", "Sox2", "TSPAN12","NM_173007","Q8BKT6",9999,
"ENSMUSG00000012396","ENSMUSG00000074637")
mapped_genes <- orthogene::map_genes(genes = genes,
species = "mouse",
drop_na = FALSE)
knitr::kable(head(mapped_genes))
```
## Aggregate mapped genes
`aggregate_mapped_genes` does the following:
1. Uses `map_genes` to identify *within-species* many-to-one gene mappings (e.g. Ensembl transcript IDs ==> gene symbols). Alternatively, can map *across species* if output from `map_orthologs` is supplied to `gene_map` argument (and `gene_map_col="ortholog_gene"`).
2. Drops all non-mappable genes.
3. Aggregates the values of matrix `gene_df` using `"sum"`,`"mean"`,`"median"`,`"min"` or `"max"`.
Note, this only works when the input data (`gene_df`) is a sparse or dense matrix, and the genes are row names.
```{r}
data("exp_mouse_enst")
knitr::kable(tail(as.matrix(exp_mouse_enst)))
exp_agg <- orthogene::aggregate_mapped_genes(gene_df=exp_mouse_enst,
input_species="mouse",
agg_fun = "sum")
knitr::kable(tail(as.matrix(exp_agg)))
```
## Get all genes
You can also quickly get all known genes from the genome of a given species with [`all_genes`](https://neurogenomics.github.io/orthogene/reference/all_genes.html).
```{r all_genes}
genome_mouse <- orthogene::all_genes(species = "mouse",
method = method)
knitr::kable(head(genome_mouse))
```
# Session Info
```{r Session Info}
utils::sessionInfo()
```