--- title: "MEB Tutorial" author: "Yan Zhou, Jiadi Zhu" package: MEB date: "`r Sys.Date()`" output: BiocStyle::html_document vignette: > %\VignetteIndexEntry{MEB Tutorial} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- # Introduction This package provides a method to identify differential expression genes in the same or different species. Given that non-DE genes have some similarities in features, we build a normalization-invariant minimum enclosing ball (NIMEB) to cover those non-DE genes in feature space, then those DE genes, which are enormously different from non-DE genes, being regarded as outliers and rejected outside the ball. The method on this package are described in the article *A normalization-invariant minimum enclosing ball method to detect differentially expressed genes for RNA-seq data* by Yan Zhou, Bin Yang, Junhui Wang, Jiadi Zhu and Guoliang Tian (2020, pending publication). The normalization-invariant minimum enclosing ball (NIMEB) method is developed for detecting differential expression genes in the same or different species. Compared with existing methods, it is no need to normalize data in advance. Besides, the NIMEB method could be easily applied to the same or different species data and without changing too much. We have implemented the NIMEB method via an R function. We make an R package named MEB, and give a tutorial for the package. The method consists three steps. Step 1: Data Pre-processing; Step 2: Training a model for the training genes; Step 3: Discriminating a gene whether a DE gene. We employ a simulation and real dataset for the same and different species to illustrate the usage of the MEB package. The programs can run under the Linux system and windows 10 system. The R versions should larger than 3.6.0. # Preparations To install the MEB package into your R environment, start R and enter: ```{r, eval=FALSE} install.packages("BiocManager") BiocManager::install("MEB") ``` Then, the MEB package is ready to load. ```{r} library(MEB) ``` # Data format In order to show the MEB workflow, the package includes the example data sets, which includes the simulation and real data for the same and different species. The next we will show the datasets in the package. There are four datasets in the data subdirectory of MEB package, to consistent with standard Bioconductor representations, we transform the format of dataset as *SummarizedExperiment*, please refer R package * SummarizedExperiment* for details. The four datasets are **sim_data_sp**, **sim_data_dsp**, **real_data_sp** and **real_data_dsp**. **real_data_sp** is a real dataset for the same species, which comes from *RNA-seq: an assessment of technical reproducibility and comparisonwith gene expression arrays* by Marioni J.C., Mason C.E., et al. (2008). Genome Res. 18(9), 1509–1517. **real_data_dsp** is a real dataset for the different species, which comes from *The evolution of gene expression levels in mammalian organs* by Brawand, D., Soumillon, M., Necsulea, A. and Julien, P. et al. (2011). Nature, 478, 343-348. **sim_data_sp** and **sim_data_dsp** are two simulation datasets for the same and different species, respectively. Refering *A normalization-invariant minimum enclosing ball method to detect differentially expressed genes for RNA-seq data* by Yan Zhou, Bin Yang, Junhui Wang, Jiadi Zhu and Guoliang Tian (2020, pending publication) for the generation procedure. ```{r} data(sim_data_sp) sim_data_sp ``` **sim_data_sp.RData** includes 2 columns, * the first column is the RNA-seq short read counts for the first sample; * the second column is the RNA-seq short read counts for the second sample; * each row represents a gene, and the first 1000 genes are housekeeping genes. ```{r} data(real_data_sp) real_data_sp ``` **real_data_sp** includes 10 columns, * there are two samples about kidney and liver, and each with five biological replicates; * each row represents a gene, and the first 530 genes are housekeeping genes. ```{r} data(sim_data_dsp) sim_data_dsp ``` **sim_data_dsp.RData** includes 4 columns, * the first and the third columns are the gene length for two species; * the second and the fouth columns are the RNA-seq short read counts for two species; * each row represents an orthologous gene, and the first 1000 genes are the conserved genes. ```{r} data(real_data_dsp) real_data_dsp ``` **real_data_dsp.RData** includes 4 columns, * the first and the third columns are the gene length for human and mouse; * the second and the fouth columns are the RNA-seq short read counts for human and mouse; * each row represents an orthologous gene, and the first 143 genes are the conserved genes. # Training a model for the training genes Based on a part of known housekeeping and conserved genes, we can train our model for the above four datasets. The next we will show how to use the *NIMEB()* function to train a model. 1. Simulation data for the same species ```{r, message = FALSE, warning = FALSE} library(SummarizedExperiment) ``` ```{r} data(sim_data_sp) gamma <- seq(1e-06,5e-05,1e-06) sim_model_sp <- NIMEB(countsTable=assay(sim_data_sp), train_id=1:1000, gamma, nu = 0.01, reject_rate = 0.05, ds = FALSE) ``` 2. Real data for the same species ```{r} data(real_data_sp) gamma <- seq(1e-06,5e-05,1e-06) real_model_sp <- NIMEB(countsTable=assay(real_data_sp), train_id=1:530, gamma, nu = 0.01, reject_rate = 0.1, ds = FALSE) ``` 3. Simulation data for the different species ```{r} data(sim_data_dsp) gamma <- seq(1e-07,2e-05,1e-06) sim_model_dsp <- NIMEB(countsTable=assay(sim_data_dsp), train_id=1:1000, gamma, nu = 0.01, reject_rate = 0.1, ds = TRUE) ``` 4. Real data for the different species ```{r} data(real_data_dsp) gamma <- seq(5e-08,5e-07,1e-08) real_model_dsp <- NIMEB(countsTable=assay(real_data_dsp), train_id=1:143, gamma, nu = 0.01, reject_rate = 0.1, ds = TRUE) ``` The output for *NIMEB()* is "*model*", "*gamma*" and *train_error*. *model* is the model we used to discriminate a new gene, *gamma* represents the selected gamma parameters in model NIMEB, *train_error* represents the corresponding train_error when the value of gamma changed. # Discriminating a gene whether a DE gene Giving the model, we could predict a gene and find out whether DE gene. For example, in *sim_data_sp* data, we predict the discrimination results as follows: ```{r} sim_model_sp_pred <- predict(sim_model_sp$model, assay(sim_data_sp)) summary(sim_model_sp_pred) ``` Based on the model we trained, we could discriminate each genes whether DE gene, if the discrimination result is *TRUE*/*FALSE*, the gene is non-DE/DE gene.