--- title: "Missing value handling" author: "Arne Smits" date: "`r doc_date()`" package: "`r pkg_ver('DEP')`" output: BiocStyle::html_document vignette: > %\VignetteIndexEntry{DEP: Missing value handling} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r required packages, echo = FALSE, warning=FALSE, results="hide"} suppressPackageStartupMessages({ library("BiocStyle") library("DEP") library("dplyr") library("tidyr") library("purrr") library("ggplot2") library("SummarizedExperiment") }) ``` # Goal of this vignette Proteomics data suffer from a high rate of missing values, which need to be accounted for. Different methods have been applied to deal with this issue, including multiple imputation methods (see for example [Lazar et al](http://pubs.acs.org/doi/10.1021/acs.jproteome.5b00981)). The different options to deal with missing values in __DEP__ are described in this vignette. # Simulate data To exemplify the missing value handling, we work with a simulated dataset. This is very useful, because we know the ground truth of our dataset, meaning we know which proteins are belonging to the background (null distribution) and which proteins are differentially expressed between the control and sample conditions. ## Generate intensity values We generate a dataset with 3300 proteins, of which 300 proteins are differentially expressed. The variables used to generate the data are depicted below. ```{r variables} # Variables replicates = 3 bg_proteins = 3000 DE_proteins = 300 log2_mean_bg = 27 log2_sd_bg = 2 log2_mean_DE_control = 25 log2_mean_DE_treatment = 30 log2_sd_DE = 2 ``` ```{r simulate_data} # Loading DEP and packages required for data handling library("DEP") library("dplyr") library("tidyr") library("purrr") library("ggplot2") library("SummarizedExperiment") # Background data (sampled from null distribution) sim_null <- data_frame( name = paste0("bg_", rep(1:bg_proteins, rep((2*replicates), bg_proteins))), ID = rep(1:bg_proteins, rep((2*replicates), bg_proteins)), var = rep(c("control_1", "control_2", "control_3", "treatment_1", "treatment_2", "treatment_3"), bg_proteins), val = 2^rnorm(2*replicates*bg_proteins, mean = log2_mean_bg, sd = log2_sd_bg)) # Data for DE proteins (sampled from alternative distribution) sim_diff <- rbind( data_frame( name = paste0("DE_", rep(1:DE_proteins, rep(replicates, DE_proteins))), ID = rep((bg_proteins+1):(bg_proteins+DE_proteins), rep(replicates, DE_proteins)), var = rep(c("control_1", "control_2", "control_3"), DE_proteins), val = 2^rnorm(replicates*DE_proteins, mean = log2_mean_DE_control, sd = log2_sd_DE)), data_frame( name = paste0("DE_", rep(1:DE_proteins, rep(replicates, DE_proteins))), ID = rep((bg_proteins+1):(bg_proteins+DE_proteins), rep(replicates, DE_proteins)), var = rep(c("treatment_1", "treatment_2", "treatment_3"), DE_proteins), val = 2^rnorm(replicates*DE_proteins, mean = log2_mean_DE_treatment, sd = log2_sd_DE))) # Combine null and DE data sim <- rbind(sim_null, sim_diff) %>% spread(var, val) %>% arrange(ID) # Generate experimental design experimental_design <- data_frame( label = colnames(sim)[!colnames(sim) %in% c("name", "ID")], condition = c(rep("control", replicates), rep("treatment", replicates)), replicate = rep(1:replicates, 2)) ``` ## Introduce missing values Data can be missing at random (MAR) or missing not at random (MNAR). MAR means that values are randomly missing from all samples. In the case of MNAR, values are missing in specific samples and/or for specific proteins. For example, certain proteins might not be quantified in specific conditions, because they are below the detection limit in these specific samples. To mimick these two types of missing values, we introduce missing values randomly over all data points (MAR) and we introduce missing values in the control samples of 100 differentially expressed proteins (MNAR). The variables used to introduce missing values are depicted below. ```{r missing_values_variables} # Variables MAR_fraction = 0.05 MNAR_proteins = 100 ``` ```{r add_missing_values} # Generate a MAR matrix MAR_matrix <- matrix( data = sample(c(TRUE, FALSE), size = 2*replicates*(bg_proteins+DE_proteins), replace = TRUE, prob = c(MAR_fraction, 1-MAR_fraction)), nrow = bg_proteins+DE_proteins, ncol = 2*replicates) # Introduce missing values at random (MAR) controls <- grep("control", colnames(sim)) treatments <- grep("treatment", colnames(sim)) sim[, c(controls, treatments)][MAR_matrix] <- 0 sim$MAR <- apply(MAR_matrix, 1, any) # Introduce missing values not at random (MNAR) DE_protein_IDs <- grep("DE", sim$name) sim[DE_protein_IDs[1:MNAR_proteins], controls] <- 0 sim$MNAR <- FALSE sim$MNAR[DE_protein_IDs[1:MNAR_proteins]] <- TRUE ``` ## Generate a SummarizedExperiment The data is stored in a SummarizedExperiment, as described in the 'Introduction to DEP' vignette. ```{r generate_SE} # Generate a SummarizedExperiment object sim_unique_names <- make_unique(sim, "name", "ID", delim = ";") se <- make_se(sim_unique_names, c(controls, treatments), experimental_design) ``` # Filter proteins based on missing values A first consideration with missing values is whether or not to filter out proteins with too many missing values. ## Visualize the extend of missing values The number of proteins quantified over the samples can be visualized to investigate the extend of missing values. ```{r plot_data_noFilt, fig.width = 4, fig.height = 4} # Plot a barplot of the protein quantification overlap between samples plot_frequency(se) ``` Many proteins are quantified in all six samples and only a small subset of proteins were detected in less than half of the samples. ## Filter options We can choose to not filter out any proteins at all, filter for only the proteins without missing values, filter for proteins with a certain fraction of quantified samples, and for proteins that are quantified in all replicates of at least one condition. ```{r filter_proteins} # No filtering no_filter <- se # Filter for proteins that are quantified in all replicates of at least one condition condition_filter <- filter_proteins(se, "condition", thr = 0) # Filter for proteins that have no missing values complete_cases <- filter_proteins(se, "complete") # Filter for proteins that are quantified in at least 2/3 of the samples. frac_filtered <- filter_proteins(se, "fraction", min = 0.66) ``` To check the consequences of filtering, we calculate the number of background and DE proteins left in the dataset. ```{r filter_consequences} # Function to extract number of proteins number_prots <- function(se) { names <- rownames(get(se)) data_frame(Dataset = se, bg_proteins = sum(grepl("bg", names)), DE_proteins = sum(grepl("DE", names))) } # Number of bg and DE proteins still included objects <- c("no_filter", "condition_filter", "complete_cases", "frac_filtered") map_df(objects, number_prots) ``` For our simulated dataset, filtering for complete cases drastically reduces the number of proteins, only keeping half of the DE proteins. Filtering protein to be quantified in at least 2/3 of the samples, keeps many more DE proteins and filters out all proteins with values MNAR. Filtering for proteins quantified in all replicates of at least one condition only filters out a minimal amount of DE proteins. ## Scaling and variance stabilization The data is scaled and variance stabilized using `r Biocpkg("vsn") `. ```{r scale_vst, message = FALSE} # Scale and variance stabilize no_filter <- normalize_vsn(se) condition_filter <- normalize_vsn(condition_filter) complete_cases <- normalize_vsn(complete_cases) frac_filtered <- normalize_vsn(frac_filtered) ``` ```{r plot_norm, fig.width = 4, fig.height = 5} # Mean versus Sd plot meanSdPlot(no_filter) ``` # Data imputation of missing data A second important consideration with missing values is whether or not to impute the missing values. MAR and MNAR (see [Introduce missing values](#introduce-missing-values)) require different imputation methods. See the `r Biocpkg("MSnbase") ` vignette and more specifically the _impute_ function descriptions for detailed information. ## Explore the pattern of missing values To explore the pattern of missing values in the data, a heatmap can be plotted indicating whether values are missing (0) or not (1). Only proteins with at least one missing value are visualized. ```{r plot_missval, fig.height = 4, fig.width = 3} # Plot a heatmap of proteins with missing values plot_missval(no_filter) ``` The missing values seem to be randomly distributed across the samples (MAR). However, we do note a block of values that are missing in all control samples (bottom left side of the heatmap). These proteins might have missing values not at random (MNAR). To check whether missing values are biased to lower intense proteins, the densities and cumulative fractions are plotted for proteins with and without missing values. ``` {r plot_detect, fig.height = 4, fig.width = 4} # Plot intensity distributions and cumulative fraction of proteins # with and without missing values plot_detect(no_filter) ``` In our example data, there is no clear difference between the two distributions. ## Imputation options DEP borrows the imputation functions from `r Biocpkg("MSnbase") `. See the `r Biocpkg("MSnbase") ` vignette and more specifically the _impute_ function description for more information on the imputation methods. ```{r imputation_methods, error = TRUE} # All possible imputation methods are printed in an error, if an invalid function name is given. impute(no_filter, fun = "") ``` Some examples of imputation methods. ```{r impute, results = "hide", message = FALSE, warning = FALSE} # No imputation no_imputation <- no_filter # Impute missing data using random draws from a # Gaussian distribution centered around a minimal value (for MNAR) MinProb_imputation <- impute(no_filter, fun = "MinProb", q = 0.01) # Impute missing data using random draws from a # manually defined left-shifted Gaussian distribution (for MNAR) manual_imputation <- impute(no_filter, fun = "man", shift = 1.8, scale = 0.3) # Impute missing data using the k-nearest neighbour approach (for MAR) knn_imputation <- impute(no_filter, fun = "knn", rowmax = 0.9) ``` The effect of data imputation on the distributions can be visualized. ```{r plot_imp, fig.width = 5, fig.height = 7} # Plot intensity distributions before and after imputation plot_imputation(no_filter, MinProb_imputation, manual_imputation, knn_imputation) ``` ## Advanced imputation methods ### Mixed imputation on proteins (rows) One can also perform a mixed imputation on the proteins, which uses a MAR and MNAR imputation method on different subsets of proteins. First, we have to define a logical vector defining the rows that are to be imputed with the MAR method. Here, we consider a protein to have missing values not at random (MNAR) if it has missing values in all replicates of at least one condition. ```{r mixed_impuation, results = "hide", message = FALSE, warning = FALSE} # Extract protein names with missing values # in all replicates of at least one condition proteins_MNAR <- get_df_long(no_filter) %>% group_by(name, condition) %>% summarize(NAs = all(is.na(intensity))) %>% filter(NAs) %>% pull(name) %>% unique() # Get a logical vector MNAR <- names(no_filter) %in% proteins_MNAR # Perform a mixed imputation mixed_imputation <- impute( no_filter, fun = "mixed", randna = !MNAR, # we have to define MAR which is the opposite of MNAR mar = "knn", # imputation function for MAR mnar = "zero") # imputation function for MNAR ``` ### Mixed imputation on samples (columns) Additionally, the imputation can also be performed on a subset of samples. To peform a sample specific imputation, we first need to transform our SummarizedExperiment into a MSnSet object. Subsequently, we imputed the controls using the "MinProb" method and the samples using the "knn" method. ```{r sample_specific_impuation, results = "hide", message = FALSE, warning = FALSE} # SummarizedExperiment to MSnSet object conversion sample_specific_imputation <- no_filter MSnSet <- as(sample_specific_imputation, "MSnSet") # Impute differently for two sets of samples MSnSet_imputed1 <- MSnbase::impute(MSnSet[, 1:3], method = "MinProb") MSnSet_imputed2 <- MSnbase::impute(MSnSet[, 4:6], method = "knn") # Combine into the SummarizedExperiment object assay(sample_specific_imputation) <- cbind( MSnbase::exprs(MSnSet_imputed1), MSnbase::exprs(MSnSet_imputed2)) ``` The effect of data imputation on the distributions can be visualized. ```{r plot_imp2, fig.width = 5, fig.height = 5} # Plot intensity distributions before and after imputation plot_imputation(no_filter, mixed_imputation, sample_specific_imputation) ``` # Test for differential expression We perform differential analysis on the different imputated data sets. The following datasets are compared: * No imputation * knn imputation * MinProb imputation * Mixed imputation ```{r DE_analysis, message = FALSE, warning = FALSE} # Function that wraps around test_diff, add_rejections and get_results functions DE_analysis <- function(se) { se %>% test_diff(., type = "control", control = "control") %>% add_rejections(., alpha = 0.1, lfc = 0) %>% get_results() } # DE analysis on no, knn, MinProb and mixed imputation no_imputation_results <- DE_analysis(no_imputation) knn_imputation_results <- DE_analysis(knn_imputation) MinProb_imputation_results <- DE_analysis(MinProb_imputation) mixed_imputation_results <- DE_analysis(mixed_imputation) ``` ## Number of identified differentially expressed proteins As an initial parameter we look at the number of differentially expressed proteins identified (adjusted P ≤ 0.05). ```{r rejections} # Function to extract number of DE proteins DE_prots <- function(results) { data_frame(Dataset = gsub("_results", "", results), significant_proteins = get(results) %>% filter(significant) %>% nrow()) } # Number of significant proteins objects <- c("no_imputation_results", "knn_imputation_results", "MinProb_imputation_results", "mixed_imputation_results") map_df(objects, DE_prots) ``` For our simulated dataset, no and knn imputation result in less identified differentially expressed proteins compared to the MinProb and mixed imputation. Mixed imputation results in the identification of the most differentially expressed proteins in our simulated dataset with many proteins missing values not at random. > Note that the performance of the different imputation methods is dataset-depedent. It is recommended to always carefully check the effect of filtering and data imputation on your results. ## ROC curves To further compare the results of the different imputation methods, ROC curves are plotted. ```{r plot_performance} # Function to obtain ROC data get_ROC_df <- function(results) { get(results) %>% select(name, treatment_vs_control_p.val, significant) %>% mutate( DE = grepl("DE", name), BG = grepl("bg", name)) %>% arrange(treatment_vs_control_p.val) %>% mutate( TPR = cumsum(as.numeric(DE)) / 300, FPR = cumsum(as.numeric(BG)) / 3000, method = results) } # Get ROC data for no, knn, MinProb and mixed imputation ROC_df <- map_df(objects, get_ROC_df) # Plot ROC curves ggplot(ROC_df, aes(FPR, TPR, col = method)) + geom_line() + theme_DEP1() + ggtitle("ROC-curve") # Plot ROC curves zoom ggplot(ROC_df, aes(FPR, TPR, col = method)) + geom_line() + theme_DEP1() + xlim(0, 0.1) + ggtitle("ROC-curve zoom") ``` The ROC curves also show that mixed imputation has the best performance for our simulated dataset. Because the performance of the different imputation methods is dataset-depedent, it is again recommended to carefully check the effect of filtering and imputation on your dataset. ## Differences in response To start the investigation of imputation method performance on the dataset, we ask ourselves the following questions: Which proteins are not identified as differentially expressed proteins in the datasets with no or knn imputation? And which proteins are specifically for mixed imputation? We look at both true and false positive hits as well as the missing values. ```{r differences_results} # Function to obtain summary data get_rejected_proteins <- function(results) { get(results) %>% filter(significant) %>% left_join(., select(sim, name, MAR, MNAR), by = "name") %>% mutate( DE = grepl("DE", name), BG = grepl("bg", name), method = results) } # Get summary data for no, knn, MinProb and mixed imputation objects <- c("no_imputation_results", "knn_imputation_results", "MinProb_imputation_results", "mixed_imputation_results") summary_df <- map_df(objects, get_rejected_proteins) # Plot number of DE proteins (True and False) summary_df %>% group_by(method) %>% summarize(TP = sum(DE), FP = sum(BG)) %>% gather(category, number, -method) %>% mutate(method = gsub("_results", "", method)) %>% ggplot(aes(method, number, fill = category)) + geom_col(position = position_dodge()) + theme_DEP2() + labs(title = "True and False Hits", x = "", y = "Number of DE proteins", fill = "False or True") ``` MinProb and mixed imputation identify many more truely differentially expressed proteins (TP) and only minimally increase the number of false positivies (FP) in our simulated dataset. ```{r plot_missval_histogram} # Plot number of DE proteins with missing values summary_df %>% group_by(method) %>% summarize(MNAR = sum(MNAR), MAR = sum(MAR)) %>% gather(category, number, -method) %>% mutate(method = gsub("_results", "", method)) %>% ggplot(aes(method, number, fill = category)) + geom_col(position = position_dodge()) + theme_DEP2() + labs(title = "Category of Missing Values", x = "", y = "Number of DE proteins", fill = "") ``` The gain of identification of truely differentially expressed proteins in MinProb and mixed imputation clearly comes from increased sensitivity for data MNAR. This is also expected as these imputation methods perform left-censored imputation, which is specific for data MNAR. # Session information ``` {r session_info, echo = FALSE} sessionInfo() ```