--- title: "Introduction to DEP" author: "Arne Smits" date: "`r doc_date()`" package: "`r pkg_ver('DEP')`" output: BiocStyle::html_document vignette: > %\VignetteIndexEntry{DEP: Introduction} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r required packages, echo = FALSE, warning=FALSE, results="hide"} suppressPackageStartupMessages({ library("BiocStyle") library("DEP") library("dplyr") }) ``` \pagebreak # Overview of the analysis This package provides an integrated analysis workflow for robust and reproducible analysis of mass spectrometry proteomics data for differential protein expression or differential enrichment. It requires tabular input (e.g. txt files) as generated by quantitative analysis softwares of raw mass spectrometry data, such as [MaxQuant](http://www.nature.com/nbt/journal/v26/n12/full/nbt.1511.html) or [IsobarQuant](http://www.nature.com/nprot/journal/v10/n10/full/nprot.2015.101.html). Functions are provided for data preparation, filtering, variance normalization and imputation of missing values, as well as statistical testing of differentially enriched / expressed proteins. It also includes tools to check intermediate steps in the workflow, such as normalization and missing values imputation. Finally, visualization tools are provided to explore the results, including heatmap, volcano plot and barplot representations. For scientists with limited experience in R, the package also contains wrapper functions that entail the complete analysis workflow and generate a report (see the chapter on [Workflow functions](#workflow-functions-for-the-entire-analysis)). Even easier to use are the interactive Shiny apps that are provided by the package (see the chapter on [Shiny apps](#interactive-analysis-using-the-dep-shiny-apps)). # Installation The most recent version of R can be downloaded from [CRAN](https://cran.r-project.org). Additional system requirements are [Pandoc](http://pandoc.org/index.html), as well as [Rtools](https://cran.r-project.org/bin/windows/Rtools/) for Windows and [NetCDF](https://www.unidata.ucar.edu/software/netcdf/) for Linux. Start R and install the DEP package: ```{r install, eval = FALSE} source("https://bioconductor.org/biocLite.R") biocLite("DEP") library("DEP") ``` # Interactive analysis using the DEP Shiny apps The package contains `r CRANpkg("shiny") ` apps, which allow for the interactive analysis entailing the entire workflow described below. These apps are especially relevant for users with limited or no experience in R. Currently, there are two different apps. One for label-free quantification (LFQ)-based analysis (output from MaxQuant) and the second for tandem-mass-tags (TMT)-based analysis (output from IsobarQuant). To run the apps, simply run this single command: ``` {r run_app, eval = FALSE} # For LFQ analysis run_app("LFQ") # For TMT analysis run_app("TMT") ``` # Differential analysis ## Example dataset: Ubiquitin interactors We analyze a proteomics dataset in which Ubiquitin-protein interactors were characterized ([Zhang et al. Mol Cell 2017](http://www.cell.com/molecular-cell/fulltext/S1097-2765(17)30004-7)). The raw mass spectrometry data were first analyzed using MaxQuant ([Cox and Mann, Nat Biotech 2007](http://www.nature.com/nbt/journal/v26/n12/full/nbt.1511.html)) and the resulting "proteinGroups.txt" file is used as input for the downstream analysis. ## Loading of the data ```{r load data} # Loading a package required for data handling library("dplyr") # The data is provided with the package data <- UbiLength # We filter for contaminant proteins and decoy database hits, which are indicated by "+" in the columns "Potential.contaminants" and "Reverse", respectively. data <- filter(data, Reverse != "+", Potential.contaminant != "+") ``` The data.frame has the following dimensions: ``` {r dimension} dim(data) ``` The data.frame has the following column names: ``` {r colnames} colnames(data) ``` The "LFQ.intensity" columns will be used for subsequent analysis. ## Data preparation The dataset has unique Uniprot identifiers, however those are not immediately informative. The associated gene names are informative, however these are not always unique. ``` {r unique} # Are there any duplicated gene names? data$Gene.names %>% duplicated() %>% any() # Make a table of duplicated gene names data %>% group_by(Gene.names) %>% summarize(frequency = n()) %>% arrange(desc(frequency)) %>% filter(frequency > 1) ``` For further analysis these proteins must get unique names. Additionally, some proteins do not have an annotated gene name and for those we will use the Uniprot ID. ``` {r unique_names} # Make unique names using the annotation in the "Gene.names" column as primary names and the annotation in "Protein.IDs" as name for those that do not have an gene name. data_unique <- make_unique(data, "Gene.names", "Protein.IDs", delim = ";") # Are there any duplicated names? data$name %>% duplicated() %>% any() ``` ## Generate a SummarizedExperiment object Many [Bioconductor](http://www.bioconductor.org/) packages use `r Biocpkg("SummarizedExperiment") ` objects as input and/or output. This class of objects contains and coordinates the actual (assay) data, information on the samples as well as feature annotation. We can generate the SummarizedExperiment object from our data using two different approaches. We can extract sample information directly from the column names or we add sample information using an experimental design template. An experimental design is included in the package for our example dataset. The experimental design must contain 'label', 'condition' and 'replicate' columns. The 'label' column contains the identifiers of the different samples and they should correspond to the column names containing the assay data. The 'condition' and 'replicate' columns contain the annotation of these samples as defined by the user. ``` {r expdesign, echo = FALSE} # Display experimental design knitr::kable(UbiLength_ExpDesign) ``` ``` {r to_exprset} # Generate a SummarizedExperiment object using an experimental design LFQ_columns <- grep("LFQ.", colnames(data_unique)) # get LFQ column numbers experimental_design <- UbiLength_ExpDesign data_se <- make_se(data_unique, LFQ_columns, experimental_design) # Generate a SummarizedExperiment object by parsing condition information from the column names LFQ_columns <- grep("LFQ.", colnames(data_unique)) # get LFQ column numbers data_se_parsed <- make_se_parse(data_unique, LFQ_columns) # Let's have a look at the SummarizedExperiment object data_se ``` ## Prerequisites of the SummarizedExperiment object The _make_se_ and _make_se_parse_ functions generate a `r Biocpkg("SummarizedExperiment") ` object that has a couple of specifications. The assay data is log2-transformed and its rownames depict the protein names. The rowData contains, amongst others, the 'name' and 'ID' columns that were generated by _make_unique_. The colData contains the experimental design and thereby the sample annotation. Thereby the colData includes the 'label', 'condition' and 'replicate' columns as well as a newly generated 'ID' column. The log2-transformed assay data and the specified rowData and colData columns are prerequisites for the subsequent analysis steps. ## Filter on missing values The dataset contains proteins which are not quantified in all replicates. Some proteins are even only quantified in a single replicate. ``` {r plot_data_noFilt, fig.width = 4, fig.height = 4} # Plot a barplot of the protein identification overlap between samples plot_frequency(data_se) ``` This leaves our dataset with missing values, which need to be imputed. However, this should not be done for proteins that contain too many missing values. Therefore, we first filter out proteins that contain too many missing values. This is done by setting the threshold for the allowed number of missing values per condition in the _filter_missval_ function. ``` {r filter_missval} # Filter for proteins that are identified in all replicates of at least one condition data_filt <- filter_missval(data_se, thr = 0) # Less stringent filtering: # Filter for proteins that are identified in 2 out of 3 replicates of at least one condition data_filt2 <- filter_missval(data_se, thr = 1) ``` After filtering, the number of identified proteins per sample can be plotted as well as the overlap in identifications between samples. ``` {r plot_data, fig.width = 4, fig.height = 4} # Plot a barplot of the number of identified proteins per samples plot_numbers(data_filt) ``` ``` {r plot_data2, fig.width = 3, fig.height = 4} # Plot a barplot of the protein identification overlap between samples plot_coverage(data_filt) ``` ## Normalization The data is background corrected and normalized by variance stabilizing transformation (`r Biocpkg("vsn") `). ``` {r normalize} # Normalize the data data_norm <- normalize_vsn(data_filt) ``` The normalization can be inspected by checking the distributions of the samples before and after normalization. ``` {r plot_norm, fig.width = 4, fig.height = 5} # Visualize normalization by boxplots for all samples before and after normalization plot_normalization(data_filt, data_norm) ``` ## Impute data for missing values The remaining missing values in the dataset need to be imputed. The data can be missing at random (MAR), for example if proteins are quantified in some replicates but not in others. Data can also be missing not at random (MNAR), for example if proteins are not quantified in specific conditions (e.g. in the control samples). MNAR can indicate that proteins are below the detection limit in specific samples, which could be very well the case in proteomics experiments. For these different conditions, different imputation methods have to be used, as described in the `r Biocpkg("MSnbase") ` vignette and more specifically in the _impute_ function descriptions. To explore the pattern of missing values in the data, a heatmap is plotted indicating whether values are missing (0) or not (1). Only proteins with at least one missing value are visualized. ``` {r plot_missval, fig.height = 4, fig.width = 3} # Plot a heatmap of proteins with missing values plot_missval(data_filt) ``` This heatmap indicates that missing values are highly biased to specific samples. The example dataset is an affinity enrichment dataset of ubiquitin interactors, which is likely to have proteins which are below the detection limit in specific samples. These can be proteins that are specifically enriched in the ubiquitin purifications, but are not enriched in the controls samples, or vice versa. To check whether missing values are biased to lower intense proteins, the densities and cumulative fractions are plotted for proteins with and without missing values. ``` {r plot_detect, fig.height = 4, fig.width = 4} # Plot intensity distributions and cumulative fraction of proteins with and without missing values plot_detect(data_filt) ``` Indeed the proteins with missing values have on average low intensities. This data (MNAR and close to the detection limit) should be imputed by a left-censored imputation method, such as the quantile regression-based left-censored function ("QRILC") or random draws from a left-shifted distribution ("MinProb" and "man"). In contrast, MAR data should be imputed with methods such as k-nearest neighbor ("knn") or maximum likelihood ("MLE") functions. See the `r Biocpkg("MSnbase") ` vignette and more specifically the _impute_ function description for more information. ``` {r impute, results = "hide", message = FALSE, warning = FALSE, error = TRUE} # All possible imputation methods are printed in an error, if an invalid function name is given. impute(data_norm, fun = "") # Impute missing data using random draws from a Gaussian distribution centered around a minimal value (for MNAR) data_imp <- impute(data_norm, fun = "MinProb", q = 0.01) # Impute missing data using random draws from a manually defined left-shifted Gaussian distribution (for MNAR) data_imp_man <- impute(data_norm, fun = "man", shift = 1.8, scale = 0.3) # Impute missing data using the k-nearest neighbour approach (for MAR) data_imp_knn <- impute(data_norm, fun = "knn", rowmax = 0.9) ``` The effect of the imputation on the distributions can be visualized. ``` {r plot_imp, fig.width = 4, fig.height = 4} # Plot intensity distributions before and after imputation plot_imputation(data_norm, data_imp) ``` ## Differential enrichment analysis Protein-wise linear models combined with empirical Bayes statistics are used for the differential enrichment analysis (or differential expression analysis). The _test_diff_ function introduced here uses `r Biocpkg("limma") ` and automatically generates the contrasts to be tested. For the contrasts generation, the control sample has to be specified. Additionally, the types of contrasts to be produced need to be indicated, allowing the generation of all possible comparisons ("all") or the generation of contrasts of every sample versus control ("control"). Alternatively, the user can manually specify the contrasts to be tested (type = "manual"), which need to be specified in the argument _test_. ``` {r statistics} # Differential enrichment analysis based on linear models and empherical Bayes statistics # Test every sample versus control data_diff <- test_diff(data_imp, type = "control", control = "Ctrl") # Test all possible comparisons of samples data_diff_all_contrasts <- test_diff(data_imp, type = "all") # Test manually defined comparisons data_diff_manual <- test_diff(data_imp, type = "manual", test = c("Ubi4_vs_Ctrl", "Ubi6_vs_Ctrl")) ``` Finally, significant proteins are defined by user-defined cutoffs using _add_rejections_. ``` {r add_reject} # Denote significant proteins based on user defined cutoffs dep <- add_rejections(data_diff, alpha = 0.05, lfc = log2(1.5)) ``` ## Visualization of the results The results from the previous analysis can be easily visualized by a number of functions. These visualizations assist in the determination of the optimal cutoffs to be used, highlight the most interesting samples and contrasts, and pinpoint differentially enriched/expressed proteins. ### PCA plot The PCA plot can be used to get a high-level overview of the data. This can be very useful to observe batch effects, such as clear differences between replicates. ``` {r pca, fig.height = 3, fig.width = 4} # Plot the first and second principal components plot_pca(dep, x = 1, y = 2, n = 500, point_size = 4) ``` ### Correlation matrix A correlation matrix can be plotted as a heatmap, to visualize the Pearson correlations between the different samples. ``` {r corr, fig.height = 3, fig.width = 5} # Plot the Pearson correlation matrix plot_cor(dep, significant = TRUE, lower = 0, upper = 1, pal = "Reds") ``` ### Heatmap of all significant proteins The heatmap representation gives an overview of all significant proteins (rows) in all samples (columns). This allows to see general trends, for example if one sample or replicate is really different compared to the others. Additionally, the clustering of samples (columns) can indicate closer related samples and clustering of proteins (rows) indicates similarly behaving proteins. The proteins can be clustered by k-means clustering (_kmeans_ argument) and the number of clusters can be defined by argument _k_. ``` {r heatmap, fig.height = 5, fig.width = 3} # Plot a heatmap of all significant proteins with the data centered per protein plot_heatmap(dep, type = "centered", kmeans = TRUE, k = 6, col_limit = 4, show_row_names = FALSE, indicate = c("condition", "replicate")) ``` The heatmap shows a clustering of replicates and indicates that 4Ubi and 6Ubi enrich a similar repertoire of proteins. The k-means clustering of proteins (general clusters of rows) nicely separates protein classes with different binding behaviors. *** Alternatively, a heatmap can be plotted using the contrasts, i.e. the direct sample comparisons, as columns. Here, this emphasises the enrichment of ubiquitin interactors compared to the control sample. ``` {r heatmap2, fig.height = 5, fig.width = 3} # Plot a heatmap of all significant proteins (rows) and the tested contrasts (columns) plot_heatmap(dep, type = "contrast", kmeans = TRUE, k = 6, col_limit = 10, show_row_names = FALSE) ``` ### Volcano plots of specific contrasts Volcano plots can be used to visualize a specific contrast (comparison between two samples). This allows to inspect the enrichment of proteins between the two samples (x axis) and their corresponding adjusted p value (y axis). The _add_names_ argument can be set to FALSE if the protein labels should be omitted, for example if there are too many names. ``` {r volcano, fig.height = 5, fig.width = 5} # Plot a volcano plot for the contrast "Ubi6 vs Ctrl"" plot_volcano(dep, contrast = "Ubi6_vs_Ctrl", label_size = 2, add_names = TRUE) ``` ### Barplots of a protein of interest It can also be useful to plot the data of a single protein, for example if this protein is of special interest. ``` {r bar, fig.height = 3.5, fig.width = 3.5} # Plot a barplot for USP15 and IKBKG plot_single(dep, proteins = c("USP15", "IKBKG")) # Plot a barplot for the protein USP15 with the data centered plot_single(dep, proteins = "USP15", type = "centered") ``` ### Frequency plot of significant proteins and overlap of conditions Proteins can be differentially enriched/expressed in multiple comparisons. To visualize the distribution of significant conditions per protein and the overlap between conditions, the _plot_cond_ function can be used. ``` {r overlap, fig.height = 4, fig.width = 6} # Plot a frequency plot of significant proteins for the different conditions plot_cond(dep) ``` ## Results table To extract a table containing the essential results, the _get_results_ function can be used. ``` {r results_table} # Generate a results table data_results <- get_results(dep) # Number of significant proteins data_results %>% filter(significant) %>% nrow() ``` The resulting table contains the following columns: ``` {r results_table2} # Column names of the results table colnames(data_results) ``` Of these columns, the __p.val__ and __p.adj__ columns contain the raw and adjusted p values, respectively, for the contrast as depicted in the column name. The __ratio__ columns contain the average log2 fold changes. The __significant__ columns indicate whether the protein is differentially enriched/expressed, as defined by the chosen cutoffs. The __centered__ columns contain the average log2 fold changes scaled by protein-wise centering. ## Generate a data.frame from the resulting SummarizedExperiment object You might want to obtain an ordinary data.frame of the results. For this purpose, the package provides functions to convert SummarizedExperiment objects to data.frames. _get_df_wide_ will generate a wide table, whereas _get_df_long_ will generate a long table. ``` {r get_df} # Generate a wide data.frame df_wide <- get_df_wide(dep) # Generate a long data.frame df_long <- get_df_long(dep) ``` ## Save your results object for reuse To facilitate future analysis and/or visualization of your current data, saving your analyzed data is highly recommended. We save the final data object (__dep__) as well as intermediates of the analysis, i.e. the initial SummarizedExperiment object (__data_se__), normalized data (__data_norm__), imputed data (__data_imp__) and differentially expression analyzed data (__data_diff__). This allows us to easily change parameters in future analysis. ``` {r save_load, eval = FALSE} # Save analyzed data save(data_se, data_norm, data_imp, data_diff, dep, file = "data.RData") # These data can be loaded in future R sessions using this command load("data.RData") ``` # Workflow functions for the entire analysis The package contains workflow functions that entail the complete analysis and generate a report. ## LFQ-based DEP analysis Differential enrichment analysis of label-free proteomics data can be performed using the _LFQ_ workflow function. ``` {r LFQ, message = FALSE} # The data is provided with the package data <- UbiLength experimental_design <- UbiLength_ExpDesign # The wrapper function performs the full analysis data_results <- LFQ(data, experimental_design, fun = "MinProb", type = "control", control = "Ctrl", alpha = 0.05, lfc = 1) ``` This wrapper produces a list of objects, which can be used to create a report and/or for further analysis. The _report_ function produces two reports (pdf and html), a results table and a RData object, which are saved in a generated "Report" folder. ``` {r report2, eval = FALSE} # Make a markdown report and save the results report(data_results) ``` The results generated by the LFQ function contain, among others, the _results_ object (data.frame object) and the _dep_ object (SummarizedExperiment object). ``` {r report1} # See all objects saved within the results object names(data_results) ``` The _results_ object contains the essential results and the _dep_ object contains the full SummarizedExperiment object. The results table can be explored by selecting the _$results_ object ``` {r LFQ_results3} # Extract the results table results_table <- data_results$results # Number of significant proteins results_table %>% filter(significant) %>% nrow() ``` The full data (_dep_ object) can be used for the plotting functions as described in the chapter ["Visualization of the results"](#visualization-of-the-results), for example a heatmap. ``` {r LFQ_results4, fig.height = 5, fig.width = 3} # Extract the sign object full_data <- data_results$dep # Use the full data to generate a heatmap plot_heatmap(full_data, type = "contrast", kmeans = TRUE, k = 6, col_limit = 4, show_row_names = FALSE) ``` ## TMT-based DEP analysis Differential enrichment analysis of Tandem-Mass-Tag labeled proteomics data is also supported. Protein results tables from [IsobarQuant](http://www.nature.com/nprot/journal/v10/n10/full/nprot.2015.101.html) can be directly analyzed using the _TMT_ wrapper function. ``` {r TMT, eval = FALSE} # Need example data TMTdata <- example_data Exp_Design <- example_Exp_Design # The wrapper function performs the full analysis TMTdata_results <- TMT(TMTdata, expdesign = Exp_Design, fun = "MinProb", type = "control", control = "Control", alpha = 0.05, lfc = 1) ``` # Session information ``` {r session_info, echo = FALSE} sessionInfo() ```