--- title: "ImpulseDE" author: "Jil Sander" date: "June 16th, 2016" output: pdf_document vignette: > %\VignetteIndexEntry{ImpulseDE} %\VignetteEngine{knitr::rmarkdown} --- # Table of Contents 1. [Introduction](#introduction) 1.1 [Requirements for input data](#requirements-for-input-data) 1.2 [Calling ImpulseDE](#calling-impulsede) 1.3 [Output of ImpulseDE](#output-of-impulsede) 2. [Differential expression analysis for single time courses]( #differential-expression-analysis-for-single-time-courses) 2.1 [Running the analysis 1TK](#running-the-analysis-1tk) 2.2 [Plotting genes of interest 1TK](#plotting-genes-of-interest-1tk) 2.3 [Imputing values for missing time points 1TK]( #imputing-values-for-missing-time-points-1tk) 3. [Differential expression analysis between two time courses]( #differential-expression-analysis-between-two-time-courses) 3.1 [Running the analysis 2TK](#running-the-analysis-2tk) 3.2 [Plotting genes of interest 2TK](#plotting-genes-of-interest-2tk) 3.3 [Imputing values for missing time points 2TK]( #imputing-values-for-missing-time-points-2tk) 4. [References](#references) # Introduction *ImpulseDE* detects differentially expressed (DE) genes in high-throughput time course experiments. It accepts two different kinds of inputs: whether a single time course dataset (1TK) or a dataset containing a case as well as a control time course for each time point measured (2TK). For the first scenario, it identifies genes being differentially expressed across time points. For the second scenario, *ImpulseDE* reports genes being differentially expressed between both conditions. *ImpulseDE* follows a five-step workflow: ------------------------------------------------------------------------------ Step Explanation ----------------- ---------------------------------------------------------- Clustering The genes are clustered into a limited number of groups using k-means. In default modus, *ImpulseDE* prints the plots for each cluster. Fit to clusters *ImpulseDE* is based on the impulse model proposed by Chechik and Koller, which reflects a two-step behavior of genes within a cell responding to environmental changes [@chechikkoller2009]. This model is fitted to the mean expression profiles of the clusters. Fit to genes The best parameter sets obtained from the clusters are then used to fit an impulse model to each gene. Fit to random data The impulse model is fitted to a randomized dataset (bootstrap), which is essential to detect differentially expressed genes [@storey2005]. Detection of Detection of differentially expressed genes utilizing the differentially fits to the real and randomized data sets. FDR-correction expressed genes is performed to obtain adjusted p-values [@bh1995]. ------------------------------------------------------------------------------ ## Requirements for input data *ImpulseDE* requires an expression table (only numbers) as well as an annotation table (characters allowed). The requirements for the two tables are the following: -------------------------------------------------------------------------------- Feature Explanation ----------------- ------------------------------------------------------------- Expression table Genes have to be in rows and samples in columns. Both rows and columns should have unique identifiers. Annotation table Must have two columns, one carrying the timestamps as numeric numbers and the other one carrying the condition information. In the case of two time courses, two conditions are required. More than two conditions are allowed to be specified, but then two conditions (one case and one control conditions) have to be specified by the user for each run separately. The samples (row names) do not have to have the same order as in the expression table (column names), but the sample identifiers must be identical. Additional columns are allowed but will be ignored later on. Missing values Are **not supported**. Genes having missing values for at least one sample will be excluded from the analysis. Time points Since the parametric model contains six parameters, the dataset should contain at least **six** time points. Normalization Gene expression data should be **properly normalized** and filtering including **log2-transformation** and **filtered** to avoid spending time on fitting the model to non-informative genes (e.g. not expressed or not variable genes). No impulse model will be fitted to genes having a coefficient of variation less than 0.025; instead, the mean across all samples is returned as the "fit". -------------------------------------------------------------------------------- ## Calling ImpulseDE The first four input parameters have to be specified by the user. Those are the names of the two input tables (*expression\_table* and *annotation\_table*) as well as the two column names carrying the time (*colname\_time*) and condition (*colname\_condition*) information within the annotation table. Additional parameters can be set to specify the time course scenario as well as fitting and parallelization parameters. In the default modus, *ImpulseDE* expects a single time course scenario without any control data (*control\_timecourse = FALSE* and *control\_name = NULL*). In the case of two time courses, the *control\_timecourse* parameter has to be set to *TRUE* and for *control\_name* the name of the control within *colname\_condition* has to be specified. If more than two conditions are present within the annotation table, *case\_name* has to be set in addition to run *ImpulseDE* for the desired case condition. Regarding the fitting, as default *ImpulseDE* will run 100 iterations (*n\_iter = 100*) to optimize the model parameters, generate 50.000 random data points (*n\_randoms = 50.000*) to estimate bootstrapped p-values for DE analysis, and determines DE genes using an FDR-adjusted p-value cutoff (q-value) of 1% (*Q\_value = 0.01*). Furthermore, in default modus it will split the run into 4 processes (*n\_process = 4*). If parallelization is not possible on the device or is not admired, *n\_process* should be set to *1*. ## Output of ImpulseDE *ImpulseDE* returns a list consisting of three sublists: *impulse\_fit\_results*, *DE\_results* and *clustering\_results*. The first contains the fitted impulse model parameters, sum of squared fitting errors as well as the calculated impulse values for all time points. The second provides the names of the genes being called as differentially expressed according to a specified cutoff together with the adjusted p-values (*DE\_genes*) as well as the adjusted p-values, flags and results of additional tests for all genes (*pvals\_and\_flags*). The third speciFIes the clusters, to which the genes were assigned to as well as the mean expression values for the clusters. # Differential expression analysis for single time courses In the case of a single time course experiment, *ImpulseDE* will detect differentially expressed genes over time. A fitting dataset is provided within the R package *longitudinal*, where T cells were stimulated with PMA and ionomicin and harvested at 10 different time points [@rangel2004]. The dataset contains 10 measurements per time point for 58 genes: ```{r chunk1} # (Install package longitudinal) and load it library(longitudinal) # attach T cell data data(tcell) # check dimension of data matrix of interest dim(tcell.10) ``` In order to be able to apply *ImpulseDE* on this dataset, *tcell.10* has to be transposed using *t()* during the call since genes need to be in rows and samples in columns. Additionally, it is necessary to create a proper annotation table: ```{r chunk2} # generate annotation table with columns "Time" and "Condition" annot <- as.data.frame(cbind("Time" = sort(rep(get.time.repeats(tcell.10)$time,10)), "Condition" = "activated"), stringsAsFactors = FALSE) # Time columns must be numeric annot$Time <- as.numeric(annot$Time) # rownames of annotation table must appear in data table rownames(annot) = rownames(tcell.10) head(annot) ``` It is important that the *Time* column contains numeric values and that the *Condition* column is not a factor. Since the dataset contains only a sinlge time course and therefore only one condition, the *Condition* column contains only one unique value, *activated*. ## Running the analysis 1TK *ImpulseDE* provides a single function, *impulse\_DE*, which runs all the analysis steps automatically and prints the current status on the screen. To run *ImpulseDE* with all default options, only four variables need to be set for the single time course scenario: *expression\_table*, *annotation\_table*, *colname\_time* and *colname\_condition*. However, for demonstration purposes the number of iterations, randomizations as well as the number of used processors will be reduced. For real datasets, it not recommended to reduce *n\_iter* as well as *n\_randoms*. Additionally, the analysis will be limited to the first 20 genes: ```{r chunk3, fig.height = 3, fig.width = 6} # load package library(ImpulseDE) # start analysis impulse_results <- impulse_DE(t(tcell.10)[1:20,], annot, "Time", "Condition", n_iter = 10, n_randoms = 10, n_process = 1, new_device = FALSE) ``` (**Note**: *new\_device* is set to *FALSE* in all plot functions here to avoid the generation of emtpy pages within the vignette. Usually, it is recommend to keep this option *TRUE*, which will open a new device for each plot. Otherwise, all earlier plots will be overwritten.) ## Plotting genes of interest 1TK Plotting a custom list of genes can be done by using the function *plot\_impulse*. For this the fitting results are needed, which can be taken from the generated result object *impulse\_results*. As an example, some genes being called as differentially expressed are plotted. ```{r chunk4, fig.show = 'hold', fig.height = 7} genes = c("SIVA","CD69","ZNFN1A1","JUND","ITGAM","SMN1","PCNA") plot_impulse(gene_IDs = genes, data_table = t(tcell.10), data_annotation = annot, imp_fit_genes = impulse_results$impulse_fit_results, file_name_part = "four_NV_genes", new_device = FALSE) ``` For example, *JUND* and *CD69* show very typical impulse-like expression patterns, which cleary change significantly over time. ## Imputing values for missing time points 1TK To impute values for an uncovered time point for a specific gene, the following command can be used: ```{r chunk5} # impute expression value for time point 60 for gene "JUND" (imp_results <- calc_impulse(impulse_results$impulse_fit_results$impulse_parameters_case[ "JUND",1:6], 60)) ``` # Differential expression analysis between two time courses In the case of a two time course experiment, *ImpulseDE* will detect differentially expressed genes between both conditions. In order to generate two time courses out of the T cell data set, the replicates will be splitted and to the second half some random numbers are added: ```{r chunk6} # split dataset into two halfs case_data <- t(tcell.10)[,seq(1,ncol(t(tcell.10)),2)] control_data <- t(tcell.10)[,seq(2,ncol(t(tcell.10)),2)] # add some random values to "control_data" to make data different control_data <- control_data + t(apply(control_data,1,function(x) runif(length(x),0,0.5)*sample(c(-1,1),length(x), replace = TRUE) + sample(c(seq(-2,2,0.5)),1))) tcell_2tk <- cbind(case_data, control_data) ``` At last, a proper annotation table has to be generated: ```{r chunk7} annot_2tk <- annot[colnames(tcell_2tk),] annot_2tk[51:100,"Condition"] = "control" head(annot_2tk) tail(annot_2tk) ``` ## Running the analysis 2TK In contrast to the single time course scenario, six variables need to be set: *expression\_table*, *annotation\_table*, *colname\_time*, *colname\_condition*, *control\_timecourse* and *control\_name*. Here again, for demonstration purposes the number of itertions, randomizations as well as the number of used processors will be reduced. For real datasets, it not recommended to reduce *n\_iter* as well as *n\_randoms*. Again, the analysis will be reduced to the first 20 genes: ```{r chunk8, fig.height = 3, fig.width = 6} # load package library(ImpulseDE) # start analysis impulse_results <- impulse_DE(tcell_2tk[1:20,], annot_2tk, "Time", "Condition", TRUE, "control", n_iter = 10, n_randoms = 10, n_process = 1, new_device = FALSE) ``` ## Plotting genes of interest 2TK As an example, some genes being called as differentially expressed are plotted as well: ```{r chunk9, fig.show='hold', fig.height = 7} genes = c("SIVA","ZNFN1A1","IL4R","MAP2K4","ITGAM","SMN1","CASP8","E2F4","PCNA") plot_impulse(gene_IDs = genes, data_table = tcell_2tk, data_annotation = annot_2tk, imp_fit_genes = impulse_results$impulse_fit_results, control_timecourse = TRUE, control_name = "control", file_name_part = "four_NV_genes_2tk", new_device = FALSE) ``` ## Imputing values for missing time points 2TK To impute values for an uncovered time point for a specific gene, *calc\_impulse* has to be applied to both datasets separately: ```{r chunk10} # impute expression value for time point 60 for gene "JUND" # case data (imp_results <- calc_impulse(impulse_results$impulse_fit_results$impulse_parameters_case[ "JUND",1:6], 60)) # control data (imp_results <- calc_impulse(impulse_results$impulse_fit_results$impulse_parameters_control[ "JUND",1:6], 60)) ``` # References --- references: - id: chechikkoller2009 title: Timing of Gene Expression Responses to Environmental Changes. author: - family: Chechik given: Gal - family: Koller given: Daphne container-title: Journal of Computational Biology volume: 16 URL: http://online.liebertpub.com/doi/abs/10.1089/cmb.2008.13TT DOI: 10.1089/cmb.2008.13TT issue: 2 page: 279-290 type: article-journal issued: year: 2009 month: 2 - id: storey2005 title: Significance analysis of time course microarray experiments. author: - family: Storey given: John D. - family: Xiao given: Wenzhong - family: Leek given: Jeffrey T. - family: Tompkins given: Ronald G. - family: Davis given: Ronald W. container-title: Proceedings of the National Academy of Sciences volume: 102 URL: http://www.pnas.org/content/102/36/12837.full DOI: 10.1073/pnas.0504609102 issue: 36 page: 12837-12842 type: article-journal issued: year: 2005 month: 2 - id: bh1995 title: > Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. author: - family: Benjamini given: Yoav - family: Hochberg given: Yosef container-title: > Journal of the Royal Statistical Society Series B (Methodological) volume: 57 URL: http://www.jstor.org/stable/2346101 DOI: 10.2307/2346101 issue: 1 page: 289-300 type: article-journal issued: year: 1995 month: 3 - id: rangel2004 title: > Modeling T-cell activation using gene expression profiling and state-space models. author: - family: Rangel given: Claudia - family: Angus given: John - family: Ghahramani given: Zoubin - family: Lioumi given: Maria - family: Sotheran given: Elizabeth - family: Gaiba given: Alessia - family: Wild given: David L. - family: Falciani given: Francesco container-title: Bioinformatics volume: 20 URL: http://bioinformatics.oxfordjournals.org/content/20/9/1361.long DOI: 10.1093/bioinformatics/bth093 issue: 9 page: 1361-1372 type: article-journal issued: year: 2004 month: 2 ---