---
title: "Noncoding RNA Set Cis Annotation and Enrichment"
shorttitle: "NoRCE"
author: "Gulden Olgun "
date: "`r Sys.Date()`"
output:
  prettydoc::html_pretty:
    theme: cayman
    highlight: github
    use_bioc: true
    toc: true
    toc_depth: 4

vignette: >
  %\VignetteIndexEntry{Noncoding RNA Set Cis Annotation and Enrichment}
  %\usepackage[utf8]{inputenc}
  %\VignetteEncoding{UTF-8}
  %\VignetteEngine{knitr::knitr}
editor_options: 
  chunk_output_type: inline
---

```{r style, echo=FALSE, results="asis", message=FALSE, warnings = FALSE}
knitr::opts_chunk$set(tidy = FALSE,
                      warning = FALSE,
                      message = FALSE,fig.width=6, fig.height=6 )
```

```{r set size}
knitr::opts_knit$set(width = 80)
```

# Noncoding RNA Set Cis Annotation and Enrichment

[NoRCE](https://github.com/guldenolgun) package systematically performs annotation and enrichment analysis for a set of regulatory non-coding RNA genes. NoRCE analyses are based on spatially proximal mRNAs at a certain distance for a set of non-coding RNA genes or regions of interest. Moreover, specific analyses such as biotype selection, miRNA-mRNA co-expression, miRNA-mRNA target prediction can be performed for filtering. Besides, it allows to curate the gene set according to the topologically associating domain (TAD) regions.

## Supported Assemblies and Organisms
+ *Homo Sapiens (hg19 and hg38)*
+ *Mus Musculus (mm10)*
+ *Rattus Norvegicus (rn6)*
+ *Drosophila Melanogaster (dm6)*
+ *Danio Rerio (danRer10)*
+ *Caenorhabditis Elegans (ce11)*
+ *Saccharomyces Cerevisiae (sc3)*

# Installation
To install the NoRCE

```{r Install, eval=FALSE, echo=TRUE, include=TRUE}
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("NoRCE")
```

```{r Load, eval=TRUE, echo=TRUE, include=TRUE}
library(NoRCE)
```


# GO Enrichment Analysis 

GO enrichment analysis can be performed based on gene neigbourhood, predicted targets, co-expression values and/or topological domain analysis. HUGO, ENSEMBL gene, ENSEMBL transcript gene, ENTREZ ID and miRBase names are supported formats for the input gene list. Moreover, NoRCE accepts a list of genomic regions. The input genomic region list should be in a .bed format. Each analysis is controlled by corresponding parameters. When related parameters are set, the gene set resulting from the intersection of those analysis will be considered for enrichment analysis (Co-expression analysis can be augmented with other analysis). GO enrichment analysis are carried out with `geneGOEnricher` and `geneRegionGOEnricher` functions. Also, miRNA gene enrichment are carried out with `mirnaGOEnricher` and `mirnaRegionGOEnricher` functions. Species assembly must be defined using the `org_assembly` parameter. NoRCE allows the user to use background gene set. The background gene set and the format of the gene set should be defined.  


### Enrichment Analysis Based on Gene Neighbourhood

When the `near` parameter is set to `TRUE`, the closest genes for the input gene list are retrieved. The gene neighbourhood taken into consideration is controlled by the `upstream` and `downstream` parameters. By default, all genes that fall into 10 kb upstream and downstream of the input genes are retrieved. Also, using `searchRegion` parameter, the analysis can be performed for only those genes whose exon or intron regions fall into the specified upstream and downstream range of the input genes.

```{r Load package, eval=TRUE, echo=TRUE, include=TRUE}
library(NoRCE)
```

```{r, Enrichment analysis based on gene neighbourhood when input set is consist of genes, eval=TRUE, echo = FALSE}
ncGO<-geneGOEnricher(gene = brain_disorder_ncRNA, org_assembly='hg19', near=TRUE, genetype = 'Ensembl_gene')

```

Moreover, [NoRCE](https://github.com/guldenolgun/NoRCE) can convert .txt file or data frame to a .bed formatted file to make it available for region based analysis (`readbed` function).

### Enrichment Analysis Based on Target Prediction
For a set of miRNA genes, target prediction is controlled by the `target` parameter. Once this parameter is set to `TRUE`, TargetScan prediction is used to curate the gene list that will be enriched. 

```{r, Intersection of the nearest genes of the input gene set and the potential target set is carries out for enrichment analysis, eval=TRUE}
mirGO<-mirnaGOEnricher(gene = brain_mirna, org_assembly='hg19', near=TRUE, target=TRUE)

```

The above example shows that the GO enrichment is performed based on neighbouring coding genes of brain miRNA targeted by the same brain miRNA gene set.   

### Enrichment Analysis Based on Topological Associating Domain Analysis
Gene annotation based on topologically associating domain regions are conducted whether ncRNAs fall into the TAD regions and coding gene assignment only those that are in the same TAD region are included in the neighborhood coding gene set. If cell-line(s) for TAD region is specified, only regions that are associated with the given cell-line(s) are considered. User defined and pre-defined TAD regions can be used to find potential gene set list for enrichment. For human, mouse and fruit fly, pre-defined TAD regions are supplied and custom TAD regions must be in a .BED format. Cell-lines are controlled by the `cellline` parameter. Cell-lines can be listed with the `listTAD` function. 

```{r, Retrieve list of cell-line , eval=TRUE}
a<-listTAD(TADName = tad_hg19)
```

```{r, Enrichment based on TAD cellline, eval=TRUE}
mirGO<-mirnaGOEnricher(gene = brain_mirna, org_assembly='hg19', near=TRUE, isTADSearch = TRUE, TAD = tad_hg19)

```

User defined TAD regions can be used as an input for the TAD regions and gene enrichment can be performed based on these custom TAD regions. `TAD` parameter is provided to input the bed formatted TAD regions .


### Enrichment Analysis Based on Correlation Analysis
Enrichment based on correlation analysis is conducted with the `express` parameter. For a given cancer, pre-calculated Pearson correlation coefficient between miRNA-mRNA and miRNA-lncRNA expressions can be used to augment or filter the results. User can define the correlation coefficient cutoff and cancer of interest with `minAbsCor` and `cancer` parameter, respectively. The path of the pre-computed correlation database called as [miRCancer.db](https://figshare.com/articles/miRCancer\_db\_gz/5576329) must be given as an input to a `databaseFile` parameter. 

Two custom defined expression data can be utilized to augment or filter the coding genes that are found using the previous analysis. Expression data must be patient by gene data and headers should be gene names. If no header is defined, `label1` and `label2` must be used to define the headers. The correlation cutoff can be defined with `minAbsCor` parameter. 

# Pathway Enrichment
As in GO enrichment analysis, pathway enrichment analysis can be performed based on gene neigbourhood, predicted targets, correlation coefficient and/or topological domain analysis. Each parameter is controlled by the related parameters and HUGO, ENSEMBL gene, ENSEMBL transcript gene, ENTREZ ID and miRNA name is supported for the input gene list. Non-coding genes can be annotated and enriched with KEGG, Reactome and Wiki pathways. Moreover, pathway enriched can be performed based on custom GMT file. GMT file supports both gene format of ENTREZ ID, Symbol and it is controlled by the `isSymbol` parameter. `genePathwayEnricher` and `geneRegionPathwayEnricher` functions fulfill the pathway enrichment for the genes and regions expect the miRNA genes and for the miRNA `mirnaPathwayEnricher` and `mirnaRegionPathwayEnricher` is used.

```{r, Pathway enrichment based on the gen sets that falls into the TAD regions, eval=TRUE}
ncRNAPathway<-genePathwayEnricher(gene = brain_disorder_ncRNA, org_assembly='hg19', isTADSearch = TRUE,TAD = tad_hg19, genetype = 'Ensembl_gene')
```

# Gene Enrichment Analysis
As a part of pathway analysis, NoRCE carries out hypergeometric test on/gene enrichment of a set of noncoding genes based on gene neigbourhood, target prediction, correlation coefficient and/or topological domain analysis for a given population coding gene set. Genes that form the population set should be provided with `gmtName` and population dataset should be data frame. 

# Pre-processing Steps
## Filter ncRNA Genes Based on Biotype Subsets
Specific biotype of RNA have different characteristics. Biotype information of each gene is provided in [GENCODE](https://www.gencodegenes.org/pages/biotypes.html) as a .GTF format for human and mouse genomes. `filterBiotype` in [NoRCE](https://github.com/guldenolgun/NoRCE) extracts the genes according to the given biotypes and curates the input list accordingly. 


## Co-expression Analysis
Pre-calculated Pearson correlation coefficient values based on TCGA data or correlation coefficient values measured from the user data can be used for filtering the gene list that are used for annotation and enrichment or applying annotation and enrichment analysis for using only co-expression analysis. Final set is determined by curating the gene list for a given p-value, correlation threshold and p-Adjusted-value. 

### Co-expression Analysis in The Cancer Genome Atlas 
`corrbased` provides pre-measured Pearson correlation between lncRNA-mRNA and miRNA-mRNA interactions. Input list can be curated with the miRNA or lncRNA gene sets whose correlation values exceed the given threshold. Gene expressions are gathered from TCGA. In order to run this part, [miRCancer.db](https://figshare.com/articles/miRCancer\_db\_gz/5576329) database must be downloaded locally. Input list can also be a set of mRNA to allow enrichment analysis of miRNAs that are correlated with those mRNAs.

### Co-expression Analysis in Custom Expression Data
Correlation coefficient values between two custom expression data can be calculated and possible interactions can be identified with a predefined threshold filtering in NoRCE. `calculateCorr` function takes two custom data and calculate the correlation between two genes by using correlation method that is defined by the `corrMethod` parameter. It curates the genes based on the cut-offs of the p-value, correlation value and p-Adjusted-value.

```{r Custom correlation analysis, eval=TRUE, echo=TRUE}
dataCor <- calculateCorr(exp1 =  mirna[,1:50], exp2 = mrna[,1:50])

```
 
## Visualization
Results can be finalized in a tabular format or with relevant graphs. Some plots are available only for the GO enrichment analysis, and some of them are available only for the pathway enrichment. 

### Tabular Format
Information about the enrichment result can be written down in a tabular format as a txt file. Results are sorted according to the p-value or p-adjusted-value and all of the enrichment results or user defined number of top enrichment can be written down. This function is suitable for both GO and pathway enrichment.

### Dot Plot
Dot plot for the given number of top enrichments can be utilized for further analysis. In the dot plot, number of overlapped genes that are annoted with the enriched GO-term and occur in the input list, p-value or p-value adjustment value for the selected correction method per GO or pathway is provided.

### GO\:mRNA Network
Relationship between top enriched GO-terms and mRNA genes are shown in an undirected network. `isNonCode` parameter checks whether list of enriched noncoding genes will be employ for the node name. Node name decision for the GO-term or GO-ID is determined by the `takeID` parameter. For the node name decision for the GO-term, parameter must set to the `FALSE`.

### GO\:ncRNA Gene Network
Relationship between top enriched GO-terms and noncoding genes are shown in an undirected network. `isNonCode` parameter checks whether list of enriched noncoding genes will be employed for the node name. Node name decision for the GO-term or GO-ID is determined by the `takeID` parameter. For the node name decision for the GO-term, parameter must set to the `FALSE`.


### GO DAG Network
Directed acyclic graph of the top enriched GO-terms can be illustareted with the `getGoDag` function. Enriched GO-terms are marked with a range of color based on the p-value or p-adjusted-value. P-value ranges can be changed with the `p_range` parameter. 


### KEGG and Reactome Pathway Map
Map of the enriched pathways can be demonstrated in the browser. Due to the limitation of the pathways, each pathway should be treated separately for the pathway map. Moreover, matching genes in the enrichment gene set for the corresponding pathway are marked with color.


# Citation 
If you use [NoRCE](https://github.com/guldenolgun/NoRCE), please cite.