---
title: "Extended examples"
author: "
Authors: Alan Murphy, Brian Schilder, and Nathan Skene
"
date: "Updated: `r format( Sys.Date(), '%b-%d-%Y')`
"
bibliography: ../inst/cit/EWCE.bib
csl: ../inst/cit/nature.csl
output:
BiocStyle::html_document:
vignette: >
%\VignetteIndexEntry{Extended examples}
%\VignetteEngine{knitr::rmarkdown}
%\usepackage[utf8]{inputenc}
---
# Setup
```{r setup, include=TRUE}
library(EWCE)
library(ggplot2)
set.seed(1234)
```
**NOTE**: This documentation is for the development version of `EWCE`. See [Bioconductor](https://bioconductor.org/packages/release/bioc/html/EWCE.html) for documentation on the current release version.
# Run cell-type enrichment tests
## Introduction
In the following vignette, we provide more a more in-depth version of the
examples provided in the [Getting started vignette](https://nathanskene.github.io/EWCE/articles/EWCE.html).
## Prepare input data
### CellTypeDataset
For this example we use a subset of the genes from the merged dataset
generated in the [Create a CellTypeDataset](#create-a-celltypedataset) section below, which is accessed using `ewceData::ctd()`.
#### CTD levels
Each level of a CTD corresponds to increasingly refined cell-type/-subtype annotations. For example, in the CTD `ewceData::ctd()` level 1 includes the cell-type "interneurons", while level 2 breaks these this group into 16 different
interneuron subtypes ("Int...").
```{r, error=TRUE}
## Load merged cortex and hypothalamus dataset generated by Karolinska institute
ctd <- ewceData::ctd() # i.e. ctd_MergedKI
try({
plt <- EWCE::plot_ctd(ctd = ctd,
level = 1,
genes = c("Apoe","Gfap","Gapdh"),
metric = "mean_exp")
})
```
**Note** - You can load `ewceData::ctd()` offline by passing
`localhub = TRUE`. This will work off a previously cached version of the
reference dataset from `ExperimentHub`.
### Gene lists
For the first demonstration of `EWCE` we will test for whether genes that are genetically associated with Alzheimer's disease are enriched in any particular cell type.
This example gene list is stored within the `ewceData` package:
```{r}
hits <- ewceData::example_genelist()
print(hits)
```
**Note** - You can load `ewceData::example_genelist()` offline by passing
`localhub = TRUE`. This will work off a previously cached version of the
reference dataset from `ExperimentHub`.
#### Gene formats and species
All gene IDs are assumed by the package to be provided in gene symbol format
(rather than Ensembl/Entrez). Symbols can be provided as any
species-specific gene symbols supported by the package `orthogene`,
though the `genelistSpecies` argument will need to be set
appropriately.
Likewise, the single-cell dataset can be from any species,
but the `sctSpecies` argument must be set accordingly.
The example gene list here stores the human genes associated with human disease,
and hence are HGNC symbols.
The next step is to determine the most suitable background set.
This can be user-supplied, but by default the background is all 1:1
ortholog genes shared by `genelistSpecies` and `sctSpecies`
that are also present in `sct_data`.
#### Notes on `orthogene`
`orthogene` substantially improves upon previous ortholog translations
that used the static `ewceData::mouse_to_human_homologs()` file as the former is
updated using the [Homologene](https://www.ncbi.nlm.nih.gov/homologene) database periodically.
```{r}
exp <- ctd[[1]]$mean_exp
#### Old conversion method ####
#if running offline pass localhub = TRUE
m2h <- ewceData::mouse_to_human_homologs()
exp_old <- exp[rownames(exp) %in% m2h$MGI.symbol,]
#### New conversion method (used by EWCE internally) ####
exp_new <- orthogene::convert_orthologs(gene_df = exp,
input_species = "mouse",
output_species = "human",
method = "homologene")
#### Report ####
message("The new method retains ",
formatC(nrow(exp_new) - nrow(exp_old), big.mark = ","),
" more genes than the old method.")
```
`orthogene` is also used internally to standardise gene lists supplied to `EWCE`
functions, such as `EWCE::bootstrap_enrichment_test(hits = )`.
Not only can it map these gene lists across species, but it can also map them
within species. For example, if you provide a list of Ensembl IDs,
it will automatically convert them to standardised HGNC gene symbols so they're
compatible with the similarly standardised CellTypeDataset.
### Setting analysis parameters
We now need to set the parameters for the analysis.
For a publishable analysis we would want to generate over 10,000 random
lists and determine their expression levels, but for computational speed
let us only use `reps=100`. We want to analyse level 1
annotations so set level to 1.
```{r}
# Use 100 bootstrap lists for speed, for publishable analysis use >=10000
reps <- 100
# Use level 1 annotations (i.e. Interneurons)
annotLevel <- 1
```
## Enrichment tests
### Default tests
We have now loaded the SCT data, prepared the gene lists and set the parameters. We run the model as follows.
**Note**: We set the seed at the top of this vignette to ensure reproducibility in the bootstrap sampling function.
#### Parallelisation
You can now speed up the bootstrapping process by parallelising across multiple cores with the parameter `no_cores` (`=1` by default).
```{r}
# Bootstrap significance test, no control for transcript length and GC content
full_results <- EWCE::bootstrap_enrichment_test(sct_data = ctd,
sctSpecies = "mouse",
genelistSpecies = "human",
hits = hits,
reps = reps,
annotLevel = annotLevel)
```
**Note** - You can run `bootstrap_enrichment_test()` offline by passing
`localhub = TRUE`. This will work off a previously cached version of the
reference dataset from `ExperimentHub`.
A note on both the background and target gene lists,
other common gene list objects can be used as inputs such as
`BiocSet::BiocSet` and `GSEABase::GeneSet`.
Below is an example of how to format each for the target gene list (`hits`):
```R
if(!"BiocSet" %in% rownames(installed.packages())) {
BiocManager::install("BiocSet")
}
if(!"GSEABase" %in% rownames(installed.packages())) {
BiocManager::install("GSEABase")
}
library(BiocSet)
library(GSEABase)
# Save both approaches as hits which will be passed to bootstrap_enrichment_test
genes <- c("Apoe","Inpp5d","Cd2ap","Nme8",
"Cass4","Mef2c","Zcwpw1","Bin1",
"Clu","Celf1","Abca7","Slc24a4",
"Ptk2b","Picalm","Fermt2","Sorl1")
#BiocSet::BiocSet, BiocSet_target contains the gene list target
BiocSet_target <- BiocSet::BiocSet(set1 = genes)
hits <- unlist(BiocSet::es_element(BiocSet_target))
#GSEABase::GeneSet, GeneSet_target contains the gene list target
GeneSet_target <- GSEABase::GeneSet(genes)
hits <- GSEABase::geneIds(GeneSet_target)
```
The main table of results is stored in `full_results$results`.
We can see the most significant results using:
```{r}
knitr::kable(full_results$results)
```
#### Plot results
The results can be visualised using another function, which shows for each cell type, the number of standard deviations from the mean the level of expression was found to be in the target gene list, relative to the bootstrapped mean:
```{r, error=TRUE}
try({
plot_list <- EWCE::ewce_plot(total_res = full_results$results,
mtc_method ="BH",
ctd = ctd)
# print(plot_list$plain)
})
```
For publications it can be useful to plot a dendrogram alongside the plot. This can be done by including the cell type data as an additional argument. The dendrogram should automatically align with the graph ticks (thanks to [Robert Gordon-Smith](https://github.com/bobGSmith) for this solution):
```R
print(plot_list$withDendro)
```
If you want to view the characteristics of enrichment for each gene within the list then the `generate_bootstrap_plots` function should be used. This saves the plots into the BootstrapPlots folder. This takes the results of a bootstrapping analysis so as to only generate plots for significant enrichments. The `listFileName` argument is used to give the generated graphs a particular file name. The `savePath` argument is used here to save the files to a temporary directory, this can be updated to your preferred location. The file path where it was saved is returned so the temporary directory
can be located if used.
```R
bt_plot_location <- EWCE::generate_bootstrap_plots(
sct_data = ctd,
hits = hits,
sctSpecies = "mouse",
genelistSpecies = "human",
reps = reps,
annotLevel = annotLevel,
full_results = full_results)
```
### Control for transcript length and GC-content
When analysing genes found through genetic association studies it is important to consider biases which might be introduced as a result of transcript length and GC-content. The package can control for these by selecting the bootstrap lists such that the *i^th^* gene in the random list has properties similar to the*i^th^* gene in the target list. To enable the algorithm to do this it needs to be passed the gene lists as HGNC symbols rather than MGI.
The bootstrapping function then takes different arguments:
```{r}
# Bootstrap significance test controlling for transcript length and GC content
#if running offline pass localhub = TRUE
cont_results <- EWCE::bootstrap_enrichment_test(
sct_data = ctd,
hits = hits,
sctSpecies = "mouse",
genelistSpecies = "human",
reps = reps,
annotLevel = annotLevel,
geneSizeControl = TRUE)
```
#### Plot results
We plot these results using `ewce_plot`:
```{r, error=TRUE}
try({
plot_list <- EWCE::ewce_plot(total_res = cont_results$results,
mtc_method = "BH")
print(plot_list$plain)
})
```
This shows that the controlled method generates enrichments that are generally comparable to the standard method.
### Test different CTD levels
Both the analyses shown above were run on level 1 annotations. It is possible to test on the level 2 cell type level annotations by changing one of the arguments.
```{r}
# Bootstrap significance test controlling for transcript length and GC content
#if running offline pass localhub = TRUE
cont_results <- EWCE::bootstrap_enrichment_test(sct_data = ctd,
hits = hits,
sctSpecies = "mouse",
genelistSpecies = "human",
reps = reps,
annotLevel = 2,
geneSizeControl = TRUE)
```
#### Plot results
```{r, error=TRUE}
try({
plot_list <- EWCE::ewce_plot(total_res = cont_results$results,
mtc_method = "BH")
print(plot_list$plain)
})
```
With the cell subtype analysis each microglial subtype was enriched and correspondingly we see here that the microglial cell type is enriched.
### Plot results from multiple sets of enrichment results
It is often useful to plot results from multiple gene list analyses together.
The `ewce_plot` function allows multiple enrichment analyses to be performed together. To achieve this the results data frames are just appended onto each other, with an additional `list` column added detailing which analysis
they relate to.
To demonstrate this we need to first generate a second analysis so let us sample thirty random genes, and run the bootstrapping analysis on it.
```{r}
#### Generate random gene list ####
#if running offline pass localhub = TRUE
human.bg <- ewceData::mouse_to_human_homologs()$HGNC.symbol
gene.list.2 <- sample(human.bg,size = 30)
#if running offline pass localhub = TRUE
second_results <- EWCE::bootstrap_enrichment_test(sct_data = ctd,
sctSpecies = "mouse",
hits = gene.list.2,
reps = reps,
annotLevel = 1)
full_res2 = data.frame(full_results$results,
list="Alzheimers")
rando_res2 = data.frame(second_results$results,
list="Random")
merged_results = rbind(full_res2, rando_res2)
```
#### Plot results
Here we apply multiple-testing correction (BH = Benjamini-Hochberg) to guard against positive results due to chance.
```{r, error=TRUE}
try({
plot_list <- EWCE::ewce_plot(total_res = merged_results,
mtc_method = "BH")
print(plot_list$plain)
})
```
As expected, the second randomly generated gene list shows no
significant enrichment results.
# Create a CellTypeDataset
## Introduction
`EWCE` was originally designed to work with the single-cell cortical
transcriptome data from the [Linnarsson Lab](http://linnarssonlab.org/) [@zeisel2015cell] (available [here](http://linnarssonlab.org/cortex/)).
Using this package it is possible to read in any single-cell transcriptome data, provided that you have a cell by gene expression matrix
(with each cell as a separate column) and a separate annotation dataframe,
with a row for each cell.
The `EWCE` process involves testing for whether the genes in a
target list have higher levels of expression in a given cell type than
can reasonably be expected by chance.
The probability distribution for this is estimated by randomly generating
gene lists of equal length from a set of background genes.
`EWCE` can be applied to any gene list. In the original paper we reported its
application to genetic and transcriptomic datasets [@skene_2016]..
Note that throughout this vignette we use the terms 'cell type'
and 'cell subtype' to refer to two levels of annotation of what a cell type is.
This is described in further detail in our paper[@skene_2016],
but relates to the two levels of annotation provided in the Linnarsson dataset [@zeisel2015cell].
In this dataset a cell is described as having a cell type (i.e. 'Interneuron')
and subcell type (i.e. 'Int11' a.k.a Interneuron type 11).
Here we provide an example of converting a scRNA-seq dataset ([Zeisel 2015](https://www.science.org/doi/10.1126/science.aaa1934))
into a CellTypeDataset (CTD) so it can used with *EWCE*.
## Loading datasets
The first step for all analyses is to load the single-cell
transcriptome (SCT) data.
For the purposes of this example we will use the dataset described in:
> [Cell types in the mouse cortex and hippocampus revealed by single-cell RNA-seq](https://www.science.org/doi/10.1126/science.aaa1934), Science, 2015
This is stored and can be loaded as follows from `ewceData` package.
*EWCE* can generate CTD from two different data input types:
1. A gene x cell expression matrix
(can be dense matrix, sparse matrix, data.frame, or DelayedArray).
2. A `SingleCellExperiment` (SCE) or `SummarizedExperiment` object.
To check the data, we can quickly plot the distribution of
expression of a given gene across all the cell types.
```{r, error=TRUE}
# example datasets held in ewceData package
#if running offline pass localhub = TRUE
cortex_mrna <- ewceData::cortex_mrna()
gene <- "Necab1"
cellExpDist <- data.frame(Expression=cortex_mrna$exp[gene,],
Celltype=cortex_mrna$annot[
colnames(cortex_mrna$exp),]$level1class)
try({
ggplot(cellExpDist) +
geom_boxplot(aes(x=Celltype, y=Expression), outlier.alpha = .5) +
theme_bw() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
})
```
To note for the analysis of the `cortex_mrna` dataset,
I will proved the alternative for an SCE object under each segment of code
if there is any difference.
## Convert single-cell formats
If you would like to try running the functions
on an SCE object instead, we can convert `cortex_mrna` with the package [`scKirby`](https://github.com/neurogenomics/scKirby).
`scKirby` automatically infers the input file type and converts it
to SCE format by default.
```R
if (!"SingleCellExperiment" %in% rownames(installed.packages())){BiocManager::install("SingleCellExperiment")}
library(SingleCellExperiment)
if (!"scKirby" %in% rownames(installed.packages())){remotes::install_github("bschilder/scKirby")}
library(scKirby)
# Make SCE object
cortex_mrna <- ewceData::cortex_mrna()
cortex_mrna_sce <- scKirby::ingest_data(cortex_mrna, save_output = FALSE)
# Now to plot the SCE dataset:
gene="Necab1"
cellExpDist_sce = data.frame(e=assays(cortex_mrna_sce)[[1]][gene,],
l1=colData(cortex_mrna_sce)[
colnames(assays(cortex_mrna_sce)[[1]]),
]$level1class)
ggplot(cellExpDist_sce) + geom_boxplot(aes(x=l1,y=e)) + xlab("Cell type") +
ylab("Unique Molecule Count") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
```
## Correct gene symbols
It is very common for publicly available transcriptome datasets to use
incorrect gene symbols
(often some gene names will have been mangled by opening in Excel).
A function is provided to correct these where out of date aliases were used
and give a warning if appears possible that excel has mangled some of
the gene names. This function can be used with a downloaded reference
file like the
[MRK_List2 file](http://www.informatics.jax.org/downloads/reports/MRK_List2.rpt)
from MGI which lists known synonyms for MGI gene symbols.
If no file path is passed, there is a loaded reference dataset of the
MRK_List2 which will be used: `ewceData::mgi_synonym_data()`.
We recommend running this function on all input datasets.
```R
#if running offline pass localhub = TRUE
cortex_mrna$exp = fix_bad_mgi_symbols(cortex_mrna$exp)
```
**Note** - You can run `fix_bad_mgi_symbols()` offline by passing
`localhub = TRUE`. This will work off a previously cached version of the
reference dataset from `ExperimentHub`.
For the SCE version, pass the whole SCE object:
```R
#Note that data in SCE object must be in 'counts' assay
#The fix_bad_hgnc_symbols function can similarly take an SCE object as an input
cortex_mrna_sce = fix_bad_mgi_symbols(cortex_mrna_sce)
```
**Note** - You can run `ewce_expression_data()` offline by passing
`localhub = TRUE`. This will work off a previously cached version of the
reference dataset from `ExperimentHub`.
## Drop genes
The vignette takes a while to go through if you use all genes. So let's restrict to 1000 random genes.
```{r }
nKeep <- 1000
must_keep <- c("Apoe","Gfap","Gapdh")
keep_genes <- c(must_keep,sample(rownames(cortex_mrna$exp),997))
cortex_mrna$exp <- cortex_mrna$exp[keep_genes,]
dim(cortex_mrna$exp)
```
```R
#### SCE ####
nKeep = 1000
must_keep = c("Apoe","Gfap","Gapdh")
keep_genes = c(must_keep,sample(names(rowRanges(cortex_mrna_sce)),997))
cortex_mrna_sce = cortex_mrna_sce[keep_genes,]
```
## Normalization [Optional]
A different number of reads is found across each cell. We suggest using [`scTransform`](https://github.com/satijalab/sctransform) to normalise for differences due to cell size, then linearly scale.
Note that this might be slow.
```{r, style="max-height: 100px;", warning=FALSE, eval = FALSE}
cortex_mrna$exp_scT_normed <- EWCE::sct_normalize(cortex_mrna$exp)
```
### SingleCellExperiment
```R
#### SCE ####
SummarizedExperiment::assays(cortex_mrna_sce)$normcounts <- EWCE::sct_normalize(SummarizedExperiment::assays(cortex_mrna_sce)$counts)
```
Note that this is optional (and was not used in the original EWCE publication)
so by all means ignore this `scTransform` step.
## Calculate specificity matrices
Next,`drop_uninformative_genes` drops uninformative genes in order to
reduce compute time and noise in subsequent steps.
It achieves this through several steps, each of which are optional:
1. **Drop non-1:1 orthologs**: Removes genes that don't have 1:1 orthologs
with the output_species ("human" by default).
2. **Drop non-varying genes**: Removes genes that don't vary across cells
based on variance deciles.
3. **Drop non-differentially expressed genes (DEGs)**: Removes genes that are
not significantly differentially expressed across cell-types.
Multiple DEG methods are currently available
(see `?drop_uninformative_genes` for up-to-date info).
### Parallelisation
You can now speed up the DGE process by parallelising across
multiple cores with the parameter `no_cores` (`=1` by default).
```{r }
# Generate cell type data for just the cortex/hippocampus data
exp_CortexOnly_DROPPED <- EWCE::drop_uninformative_genes(
exp = cortex_mrna$exp,
input_species = "mouse",
output_species = "human",
level2annot = cortex_mrna$annot$level2class)
```
## Generate CellTypeDataset
Now we generate the CellTypeDataset with the
By default the function returns the path where the resulting file was saved.
If you want to return the CTD itself as well, set `return_ctd=TRUE` and it will
be returned in a list (along with the save path).
### Parallelisation
You can now speed up `generate_celltype_data` by parallelising across
multiple cores with the parameter `no_cores` (`=1` by default).
```{r}
annotLevels <- list(level1class=cortex_mrna$annot$level1class,
level2class=cortex_mrna$annot$level2class)
fNames_CortexOnly <- EWCE::generate_celltype_data(
exp = exp_CortexOnly_DROPPED,
annotLevels = annotLevels,
groupName = "kiCortexOnly")
ctd_CortexOnly <- EWCE::load_rdata(fNames_CortexOnly)
```
### SingleCellExperiment
```R
#### SCE ####
# Generate cell type data for just the cortex/hippocampus data
exp_CortexOnly_DROPPED <- drop_uninformative_genes(exp=cortex_mrna_sce,
drop_nonhuman_genes = T,
input_species = "mouse",
level2annot=cortex_mrna_sce$level2class)
annotLevels = list(level1class=cortex_mrna_sce$level1class,
level2class=cortex_mrna_sce$level2class)
fNames_CortexOnly <- generate_celltype_data(exp=exp_CortexOnly_DROPPED,
annotLevels=annotLevels,
groupName="kiCortexOnly",
savePath=tempdir())
```
To note on the SCE approach, these are the only differences
for handling the analysis of an SCE object.
All downstream analysis on `cortex_mrna` would be the same.
The same approach can be used for other datasets in the vignette if
converted such as `hypothalamus_mrna` or your own SCE object.
# Merge two single-cell datasets
Often it is useful to merge two single-cell datasets.
For instance, there are separate files for the cortex and
hypothalamus datasets generated by the Karolinska institute.
A subset of the resulting merged dataset is loaded using `ctd()`
but it is worth understanding the approach (outlined below) if
you want to repeat this for other single-cell datasets.
The dataset is available from the [GEO](ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE74nnn/GSE74672/suppl/GSE74672_expressed_mols_with_classes.xlsx.gz)
but is also available in the `ewceData` package: `hypothalamus_mrna()`.
See the `ewceData` for details of how this file was preprocessed after
first being downloaded from GEO, unzipped and read into R.
We can now merge the hypothalamus dataset with the cortex dataset
and then calculate specificity:
## Load hypothalamus dataset
```{r}
#if running offline pass localhub = TRUE
hypothalamus_mrna <- ewceData::hypothalamus_mrna()
```
## Fix bad MGI symbols
```{r}
#if running offline pass localhub = TRUE
hypothalamus_mrna$exp <- EWCE::fix_bad_mgi_symbols(exp = hypothalamus_mrna$exp)
cortex_mrna$exp <- EWCE::fix_bad_mgi_symbols(exp = cortex_mrna$exp)
```
## Merge CTD
Merge the CTDs - again this can be run
with SCE or other ranged SE objects as input.
```{r}
merged_KI <- EWCE::merge_two_expfiles(exp1 = hypothalamus_mrna$exp,
exp2 = cortex_mrna$exp,
annot1 = hypothalamus_mrna$annot,
annot2 = cortex_mrna$annot,
name1 = "Hypothalamus (KI)",
name2 = "Cortex/Hippo (KI)")
```
## Drop uninformative genes
Drop genes which don't vary significantly between cell types.
```{r}
exp_merged_DROPPED <- EWCE::drop_uninformative_genes(
exp = merged_KI$exp,
drop_nonhuman_genes = TRUE,
input_species = "mouse",
level2annot = merged_KI$annot$level2class)
```
## Generate CellTypeDataset
```{r}
annotLevels = list(level1class=merged_KI$annot$level1class,
level2class=merged_KI$annot$level2class)
# Update file path to where you want the cell type data files to save on disk
fNames_MergedKI <- EWCE::generate_celltype_data(exp = exp_merged_DROPPED,
annotLevels = annotLevels,
groupName = "MergedKI",
savePath = tempdir())
ctd_MergedKI <- EWCE::load_rdata(fNames_MergedKI)
```
## Understanding specificity matrices
While not required for further analyses it helps to understand
what the outputs of this function are.
Note firstly, that it is a list such that `ctd[[1]]` contains data
relating to level 1 annotations and `ctd[[2]]` relates to level 2 annotations.
Using the `ggplot2` package to visualise the data,
let us examine the expression of a few genes.
If you have not already done so you will need to first install
the ggplot2 package with `install.packages("ggplot2")`.
For this example we use a subset of the genes from the merged dataset
generated above, which is accessed using `ewceData::ctd()`.
We recommend that you use the code above to regenerate this though and
drop the first command from the below section.
```{r, eval=FALSE}
## You can also import the pre-merged cortex and hypothalamus dataset
# generated by Karolinska institute. `ewceData::ctd()` is the same file as
# ctd_MergedKI above.
#if running offline pass localhub = TRUE
ctd <- ewceData::ctd()
```
```{r, fig.width = 7, fig.height = 4, error=TRUE}
ctd <- ctd_MergedKI
try({
plt <- EWCE::plot_ctd(ctd = ctd,
level = 1,
genes = c("Apoe","Gfap","Gapdh"),
metric = "mean_exp")
})
```
This graph shows the average expression of three genes: *Apoe, Gfap*
and *Gapdh*. While there are substantial differences in which cell types
express these genes, the dominant effect seen here is the overall expression
level of the data. For the purposes of this analysis though,
we are not interested in overall expression level and only wish to know about
the proportion of a genes expression which is found in a particular cell type.
We can study this instead using the following code which examines the
data frame `ctd[[1]]$specificity`:
```{r, fig.width = 7, fig.height = 4, error=TRUE}
try({
plt <- EWCE::plot_ctd(ctd = ctd,
level = 1,
genes = c("Apoe","Gfap","Gapdh"),
metric = "specificity")
})
```
We can now see in this graph that *Gfap* is the most specific to a cell
type (Type 1 Astrocytes) of any of those three genes, with over 60% of
its expression found in that cell type.
It can also be seen that the majority of expression of Gapdh is in
neurons but because their are a greater number of neuronal subtypes,
the total expression proportion appears lower.
We can examine expression across level 2 cell type level annotations
by looking at `ctd[[2]]$specificity`:
```{r, fig.width = 7, fig.height = 4, error=TRUE}
try({
plt <- EWCE::plot_ctd(ctd = ctd,
level = 2,
genes = c("Apoe","Gfap","Gapdh"),
metric = "specificity")
})
```
# Run conditional cell-type enrichment tests
## Prepare data
```{r}
#if running offline pass localhub = TRUE
ctd <- ewceData::ctd()
hits <- ewceData::example_genelist()
## NOTE: rep=100 for demo purposes only.
## Use >=10,000 for publication-quality results.
reps <- 100
annotLevel <- 1
```
## Controlling for expression in another cell type
In a followup paper we found that an enrichment detected for Schizophrenia in Somatosensory Pyramidal neurons could be explained by accounting for expression in Hippocampal CA1 pyramidal neurons. These results are described here:
[Skene, et al. Genetic identification of brain cell types underlying schizophrenia.Nature Genetics, 2018.](https://www.nature.com/articles/s41588-018-0129-5)
Those results were generated using an alternative enrichment method designed for use with GWAS Summary Statistics rather than gene sets. The same sort of approach can be extended to EWCE as well, and we have implemented it within this package. When testing for enrichment the other gene sets that are sampled are selected to have equivalent specificity in the controlled cell type.
We demonstrate it's use below to test whether the enrichment in astrocytes is still present after controlling for the enrichment within microglia.
### Parallelisation
You can now speed up the bootstrapping process by parallelising across
multiple cores with the parameter `no_cores` (`=1` by default).
```{r }
#if running offline pass localhub = TRUE
unconditional_results <- EWCE::bootstrap_enrichment_test(
sct_data = ctd,
hits = hits,
sctSpecies = "mouse",
genelistSpecies = "human",
reps = reps,
annotLevel = annotLevel)
conditional_results_micro <- EWCE:: bootstrap_enrichment_test(
sct_data = ctd,
hits = hits,
sctSpecies = "mouse",
genelistSpecies = "human",
reps = reps,
annotLevel = annotLevel,
controlledCT = "microglia")
conditional_results_astro <- EWCE::bootstrap_enrichment_test(
sct_data = ctd,
hits = hits,
sctSpecies = "mouse",
genelistSpecies = "human",
reps = reps,
annotLevel = annotLevel,
controlledCT = "astrocytes_ependymal")
```
### Merge and plot results
```{r, error=TRUE}
merged_results <- rbind(
data.frame(unconditional_results$results,
list="Unconditional Enrichment"),
data.frame(conditional_results_micro$results,
list="Conditional Enrichment (Microglia controlled)"),
data.frame(conditional_results_astro$results,
list="Conditional Enrichment (Astrocyte controlled)")
)
try({
plot_list <- EWCE::ewce_plot(total_res = merged_results,
mtc_method = "BH")
print(plot_list$plain)
})
```
When controlling for astrocytes the enrichment in astrocytes is
totally abolished as expected, and vica versa. The enrichment in microglia
remains strongly significant however after controlling for astrocytes.,
suggesting that this enrichment is independent of that in astrocytes.
## Gene set enrichment analysis controlling for cell type expression
Traditionally the standard analysis run on all gene sets was the GO
enrichment analysis. Once you have established that a given gene list
is enriched for a given cell type, it becomes questionable whether a GO
enrichment is driven purely by the underlying cell type enrichment.
For instance, it is well established that genes associated with schizophrenia
are enriched for the human post-synaptic density genes, however,
it has also been shown that schizophrenia is enriched for specificity
in CA1 pyramidal neurons (which highly express hPSD genes).
These two enrichments can be disassociated using the following analysis:
```R
# set seed for bootstrap reproducibility
mouse_to_human_homologs <- ewceData::mouse_to_human_homologs()
m2h = unique(mouse_to_human_homologs[,c("HGNC.symbol","MGI.symbol")])
schiz_genes <- ewceData::schiz_genes()
id_genes <- ewceData::id_genes()
mouse.hits.schiz = unique(m2h[m2h$HGNC.symbol %in% schiz_genes,"MGI.symbol"])
mouse.hits.id = unique(m2h[m2h$HGNC.symbol %in% id_genes,"MGI.symbol"])
mouse.bg = unique(m2h$MGI.symbol)
hpsd_genes <- ewceData::hpsd_genes()
mouse.hpsd = unique(m2h[m2h$HGNC.symbol %in% hpsd_genes,"MGI.symbol"])
rbfox_genes <- ewceData::rbfox_genes()
res_hpsd_schiz = controlled_geneset_enrichment(disease_genes=mouse.hits.schiz,
functional_genes = mouse.hpsd,
bg_genes = mouse.bg,
sct_data = ctd,
annotLevel = 1,
reps=reps,
controlledCT="pyramidal CA1")
res_rbfox_schiz =
controlled_geneset_enrichment(disease_genes=mouse.hits.schiz,
functional_genes = rbfox_genes,
bg_genes = mouse.bg,
sct_data = ctd,
annotLevel = 1,
reps=reps,
controlledCT="pyramidal CA1")
print(res_hpsd_schiz)
print(res_rbfox_schiz)
res_hpsd_id =
controlled_geneset_enrichment(disease_genes=mouse.hits.id,
functional_genes = mouse.hpsd,
bg_genes = mouse.bg,
sct_data = ctd,
annotLevel = 1,
reps=reps,
controlledCT = "pyramidal SS")
res_rbfox_id = controlled_geneset_enrichment(disease_genes=mouse.hits.id,
functional_genes = rbfox_genes,
bg_genes = mouse.bg,
sct_data = ctd,
annotLevel = 1,
reps=reps,
controlledCT="pyramidal SS")
print(res_hpsd_id)
print(res_rbfox_id)
```
The analysis also tests for enrichment of Rbfox binding genes in
the schizophrenia susceptibility genes, as well as both hPSD and
Rbfox genes in Intellectual Disability genes. All of the enrichments
are confirmed as still being present after controlling for the associated
cell type, apart from the enrichment of PSD genes in Schizophrenia
which falls from borderline to non-significant.
## Controlling for multiple cell types
```R
controlledCTs = c("pyramidal CA1","pyramidal SS","interneurons")
res_hpsd_schiz =
controlled_geneset_enrichment(disease_genes=mouse.hits.schiz,
functional_genes = mouse.hpsd,
bg_genes = mouse.bg,
sct_data = ctd,
annotLevel = 1,
reps=reps,
controlledCT=controlledCTs)
res_rbfox_schiz =
controlled_geneset_enrichment(disease_genes=mouse.hits.schiz,
functional_genes = rbfox_genes,
bg_genes = mouse.bg, sct_data = ctd,
annotLevel = 1, reps=reps,
controlledCT=controlledCTs)
print(res_hpsd_schiz)
print(res_rbfox_schiz)
res_hpsd_id =
controlled_geneset_enrichment(disease_genes=mouse.hits.id,
functional_genes = mouse.hpsd,
bg_genes = mouse.bg,
sct_data = ctd,
annotLevel = 1,
reps=reps,
controlledCT=controlledCTs)
res_rbfox_id = controlled_geneset_enrichment(disease_genes=mouse.hits.id,
functional_genes = rbfox_genes,
bg_genes = mouse.bg,
sct_data = ctd,
annotLevel = 1,
reps=reps,
controlledCT=controlledCTs)
print(res_hpsd_id)
print(res_rbfox_id)
```
# Apply to transcriptomic data
## Prepare data
```{r}
#if running offline pass localhub = TRUE
ctd <- ewceData::ctd()
tt_alzh <- ewceData::tt_alzh()
## NOTE: rep=100 for demo purposes only.
## Use >=10,000 for publication-quality results.
reps <- 100
```
## Analysing single transcriptome study
For the prior analyses the gene lists were not associated with any numeric
values or directionality.
The methodology for extending this form of analysis to transcriptomic
studies simply involves thresholding the most
upregulated and downregulated genes.
To demonstrate this we have an example dataset `tt_alzh`.
This data frame was generated using `limma`
from a set of post-mortem tissue samples from
Brodmann area 46 which were described in a paper by the Haroutunian lab[@haroutunian2009transcriptional].
The first step is to load the data, obtain the MGI ids,
sort the rows by t-statistic and then select the most up/down-regulated genes.
The package then has a function `ewce_expression_data`
which thresholds and selects the gene sets, and calls the EWCE function.
Below we show the function call using the default settings,
but if desired different threshold values can be used,
or alternative columns used to sort the table.
```{r, results="hide"}
# ewce_expression_data calls bootstrap_enrichment_test so
tt_results <- EWCE::ewce_expression_data(sct_data = ctd,
tt = tt_alzh,
annotLevel = 1,
ttSpecies = "human",
sctSpecies = "mouse")
```
**Note** - You can run `ewce_expression_data()` offline by passing
`localhub = TRUE`. This will work off a previously cached version of the
reference dataset from `ExperimentHub`.
### Plot results
The results of this analysis can be plotted using the `ewce_plot` function.
```{r fig.width = 7, fig.height = 4, error=TRUE}
try({
plot_list <- EWCE::ewce_plot(tt_results$joint_results)
print(plot_list$plain)
})
```
As was reported in our paper, neuronal genes are found to be
downregulated while glial genes are upregulated.
## Generating bootstrap plots for transcriptomes
A common request is to explain which differentially expressed genes
are associated with a cell type...
```R
full_result_path <- EWCE::generate_bootstrap_plots_for_transcriptome(
sct_data = ctd,
tt = tt_alzh,
annotLevel = 1,
full_results = tt_results,
listFileName = "examples",
reps = reps,
ttSpecies = "human",
sctSpecies = "mouse",
sig_only = FALSE)
```
## Merging multiple transcriptome studies
Where multiple transcriptomic studies have been performed with the same purpose, i.e. seeking differential expression in dlPFC of post-mortem schizophrenics, it is common to want to determine whether they exhibit any shared signal. EWCE can be used to merge the results of multiple studies.
To demonstrate this we use a two further Alzheimer's transcriptome datasets coming from Brodmann areas 36 and 44: these area stored in `tt_alzh_BA36` and `tt_alzh_BA44`. The first step is to run EWCE on each of these individually and store the output into one list.
### Load data
```R
#if running offline pass localhub = TRUE
tt_alzh_BA36 <- ewceData::tt_alzh_BA36()
tt_alzh_BA44 <- ewceData::tt_alzh_BA44()
```
### Run EWCE analysis
```R
tt_results_36 <- EWCE::ewce_expression_data(sct_data = ctd,
tt = tt_alzh_BA36,
annotLevel = 1,
ttSpecies = "human",
sctSpecies = "mouse")
tt_results_44 <- EWCE::ewce_expression_data(sct_data = ctd,
tt = tt_alzh_BA44,
annotLevel = 1,
ttSpecies = "human",
sctSpecies = "mouse")
# Fill a list with the results
results <- EWCE::add_res_to_merging_list(tt_results)
results <- EWCE::add_res_to_merging_list(tt_results_36,results)
results <- EWCE::add_res_to_merging_list(tt_results_44,results)
# Perform the merged analysis
# For publication reps should be higher
merged_res <- EWCE::merged_ewce(results = results,
reps = 10)
print(merged_res)
```
The results can then be plotted as normal using the `ewce_plot` function.
```R
plot_list <- EWCE::ewce_plot(merged_res)
print(plot_list$plain)
```
The merged results from all three Alzheimer's brain regions are found
to be remarkably similar, as was reported in our paper.
# Session Info
```{r Session Info}
utils::sessionInfo()
```
# References