--- title: "Outlier Analysis using blacksheepr" author: "MacIntosh Cornwell" date: "`r format(Sys.Date(), '%m/%d/%Y')`" package: blacksheepr output: BiocStyle::html_document: toc: true abstract: > blacksheep is used to perform extreme value analysis in the context of differential comparison between two populations. The basic mechanism is to test the proportions of outliers between the two populations, and assess for statistical difference between the proportions of outliers. vignette: > %\VignetteIndexEntry{Outlier Analysis using blacksheepr - Phosphoprotein} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) ``` ## Introduction Outlier analysis is a well established method for exploring extreme values in the context of the rest of the population. In biological contexts, exploring the shift in proportion of these outliers can elucidate differences between subpopulations, suggesting potential differential characteristics that can be further explored. blacksheep is a project that was developed in an effort to refine this analysis to a functional tool for outlier analysis in the context of biological data. This data can take many forms: protein, phosphoprotein, RNA, CNA, etc. The input for blacksheep is count data of some form, and annotation data indicative of the populations for comparison. The primary function call is `deva` (Differential Extreme Value Analysis). This vignette will run through a few standard use-cases, illustrating the functionality of blacksheep and hopefully answering questions and showing its usefulness in biological exploration. ## Installation Installation is similar to all Bioconductor packages. Start R and run the following lines to install: ```{r install package, eval=FALSE} if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("blacksheepr") ``` Loading the library is done by using the `library` call ```{r library call} library(blacksheepr) ``` ## Input Data ### Count data Count data should be a table of counts with samples along the x-axis, and features along the y-axis, with the labels being rownames and colnames. ```{r countdata example} data("sample_phosphodata") sample_phosphodata[1:5,1:5] ``` ### Annotation data The Annotation data should be a table with the same samples included in your count data, listed along the y-axis. Each column will then be a comparison to perform, with the values of the column being some form of binary system indicating the samples belonging to each sub group. NOTE - that these should be strings. It's more informative if your columns actually contain useful information such as "high"/"low", "mutant"/"WT", etc. as oppposed to "1/0" or any other non-descriptive binary. ```{r annotation example} data("sample_annotationdata") sample_annotationdata[1:5,] ``` ### SummarizedExperiment data blacksheepr takes in a SummarizedExperiment object into the main `deva` function call (more on this below). blacksheepr can therefore start with any SummarizedExperiment object, as long as the colData of the SummarizedExperiment contains rownames that are the same as the column names in the data as described above, and the metadata is made up of BINARY classifications. NOTE that this may take some additional formatting. The Appendix has a note on helper functions that can assist with this formatting. # Example Workflow - Phosphoprotein In the following section - we will go through an example of using outlier analysis using Phosphoprotein data. The inputted data is being supplied from [Github](https://github.com/ruggleslab/blacksheep_supp/tree/master) and is originally sourced using breast cancer data from TCGA and CPTAC. [doi:10.1038/nature18003](https://www.nature.com/articles/nature18003) ## Input Data ### Read in Annotation Read in the Annotation table. ```{r read in data - phospho, echo = TRUE} data(sample_annotationdata) comptable <- sample_annotationdata comptable[1:5,] dim(comptable) ``` ### Read in the phospho data Next we need to have the actual count data that we are analyzing. Note that this data is made up of features - proteins - that then have a number of subfeatures - phosphorylation sites. This aspect of the data will directly relate to later analysis, as blacksheep will AGGREGATE on this data and collapse the data onto the primary feature. More on this later. Note also that this data has been prenormalized. For more on the processing of data to input into `deva`, refer to the appendix. ```{r} data(sample_phosphodata) phosphotable <- sample_phosphodata phosphotable[1:5,1:5] dim(phosphotable) ``` ### Creating a SummarizedExperiment from our data. blacksheep - as a part of the Bioconductor universe - starts from a single object that contains the assay of interest, and the underlying metadata. A `SummarizedExperiment` object is easy to create - and helps ensure that there is no misalignment between the data and the metadata associated with each sample. NOTE - as demonstrated below that the count data must be formatted as a MATRIX and the annotation data must be formatted as a DATAFRAME. ```{r summarized experiment} suppressPackageStartupMessages(library(SummarizedExperiment)) blacksheep_SE <- SummarizedExperiment( assays=list(counts=as.matrix(phosphotable)), colData=DataFrame(comptable)) blacksheep_SE ``` ## Running deva (differential extreme value analysis) The `deva` function has a number of steps to it that are individually described below. Note though that the individual steps only need to be used for specific query or alteration. In the general case, the `deva` function on its own should be sufficient for the desired analysis. ```{r deva, fig.keep="none"} deva_out <- deva(se = blacksheep_SE, analyze_negative_outliers = FALSE, aggregate_features = TRUE, feature_delineator = "-", fraction_samples_cutoff = 0.3, fdrcutoffvalue = 0.1) ``` ### deva parameters * se - The SummarizedExperiment that will serve as the input for deva * analyze_negative_outliers - Is the analysis looking at positive or negative outliers. * aggregate_features - Should the analysis collapse features to a primary term. Useful for phosphosites on proteins, isotypes of proteins, etc. * feature_delineator - The character marker that delineates the primary feature when using the parameter * fraction_samples_cutoff - how many samples in a group must contain an outlier for the feature to be considered significant. This parameter is used to avoid overbiased results from single samples. (see appendix) * fdrcutoffvalue - the FDR value to be considered for significant output. ## Exploring deva output The output from blacksheep are lists with the desired results. There is an `outlier_analysis` and a `significant_heatmaps` for the desired analysis. In this example, we only ran analysis for positive outliers. So the output will be 2 items long - heatmaps and tables for the positive outlier analysis. ```{r} names(deva_out) ``` The `outlier_analysis` is a nested list of analyses - with one anaysis per comparison columns ```{r} names(deva_out$pos_outlier_analysis) ``` The `significant_heatmaps` section is a nested list of heatmap obejcts, again with one anaysis per comparison columns ```{r} names(deva_out$significant_pos_heatmaps) ``` ### The deva_results() function `deva_results()` is a utility function to help explore your data. If you use `deva_results` on the `deva_out` object, it will return a list of the performed analyses that returned significant results. ```{r} deva_results(deva_out) ``` The other parameters for deva_out are `ID` and `type`. `ID` is a keyword to grab the desired analysis. `type` is one of the following options: `table` or `heatmap` specifies which analysis object you want to grab, followed by the ID of the specific analysis. `fraction_table` is the outputted table that shows the fraction of outliers per sample per feature (NOTE that this will be the same as the outlier table if no aggregation was done). `median` and `boundary` return lists for the median value, and the outlier boundary value for each feature. ```{r} subanalysis_Her2 <- deva_results(deva_out, ID = "Her2", type = "table") head(subanalysis_Her2) ``` ### Results For each column of your comparison table that you put in (occupies `colData` in a SummarizedExperiment object), `deva` will output a table of analysis if there were any applicable results. One example of a table is as follows: ```{r} subanalysis_Her2 <- deva_results(deva_out, ID = "Her2", type = "table") head(subanalysis_Her2) ``` The output in order is: * col1 - the list of genes * col2-3 - the pvalue of the gene being significantly different, the column it appears in is indicative of the measure of the gene being higher in group1 or group2. * col4-5 - the FDR value for that pvalue * col6-9 - the raw data behind the statisical test, showing how many outliers and nonoutliers there are for each group. The heatmaps serve as a snapshot of the significant genes that met the parameterized fdr cutoff in the `deva` function call. They can be accessed in a similar manner to the analysis tables. NOTE though that there will only be a heatmap if there were ANY significant genes. If there were no genes that met the fdr cut off - then there will be no heatmap generated. ```{r, fig.width = 8, fig.height = 8} subanalysis_Her2_HM <- deva_results(deva_out, ID = "Her2", type = "heatmap") subanalysis_Her2_HM ``` The heatmap objects themselves are [ComplexHeatmap](https:// www.bioconductor.org/packages/release/bioc/html/ComplexHeatmap.html) objects - and will be drawn when called, either in the default Rplot, or can be saved out to an external file. ```{r, eval = FALSE} ## NOT RUN ## To output separately to pdf pdf("outfile.pdf") draw(subanalysis_Her2_HM) dev.off() ``` # Piecewise analysis *THIS SECTION ASSUMES THAT THE PREVIOUS STEPS FOR READING IN THE DATA AND ANNOTATIONS HAS BEEN COMPLETED* The next section demonstrates the individual steps of deva. NOTE that the user may *never* need to call the specific steps, but if specific tweaks are needed, or if an intermediate step needs to be extracted, then this workflow can be followed to see how the analysis is generated. ## Create groupings Create the subgroups based on your metadata. Note that the `comparison_groupings` function creates groups by going through the comparison columns, and creating the first subgroup for all of the comparisons, and then creates the second subgroup for all of the comparisons. The order depends on the first subcategory encounted moving down the column. This order will matter later on when you look at comparisons, but this information will be contained in the ouput table to avoid confusion, more on this later. ```{r groupings - phospho} groupings <- comparison_groupings(comptable) ## Print out the first 6 samples in each of our first 5 groupings lapply(groupings, head)[1:5] ``` ## Make Outlier table The next function `make_outlier_table` will take in the countdata and output a table that has been converted to show outliers. A value of 0 means that it was not an outlier, 1 means it was an outlier in the positive direction, and if the parameter is set to analyze negative outliers, then -1 means an outlier in the negative direction. The output from this function is a list of objects, that depends on the input and the specified parameters. Namely - it will output a $upperboundtab only if the parameter to analyze_negative_outliers is turned OFF, and $lowerboundtab if the analyze_negative_outliers parameter is turned ON ```{r make outlier table - phospho} ## Perform the function reftable_function_out <- make_outlier_table(phosphotable, analyze_negative_outliers = FALSE) ## See the names of the outputted objects names(reftable_function_out) ## Assign them to individual variables outliertab <- reftable_function_out$outliertab upperboundtab <- reftable_function_out$upperboundtab sampmedtab <- reftable_function_out$sampmedtab ## Note we will only use the outlier table - which looks like this now outliertab[1:5,1:5] ``` ## Tabulate Outliers For each of our groups, run through the outlier table and count up the total number of outliers and nonoutliers. For phospho (And this example) we are going to use the AGGREGATION FUNCTION to aggregate our counts together on a per protein basis The output from this function is a list of objects, that depends on the input and the specified parameters. For Phosphoprotein - you can have data that has several phospho sites per protein. As a part of this analysis, one option is to aggregate data on that protein - and collapse the outlier information into a single feature. Turning `aggregate_features = TRUE` will perform this function, and the `feature_delineator` is the character string to collapse on ex) Feature1-1 and Feature1-2-1 with the delineator of "-" will collapse onto Feature 1 The output with `aggregate_features = TRUE` will contain two additional objects. It will output the normal outliertable, and boundary tables, and also the "aggoutlierstab" and "fractiontab" The "aggoutlierstab" will be the collapsed table, summing up the number of outliers per feature The "fractiontab" returns the % outliers per feature per sample, given available information - this will be IMPORTANT FOR FURTHER FILTERING later on. ex) 1,0,0 >> 1/3 ex) 1,0,NA >> 1/2 ```{r groupingtablist - phospho} count_outliers_out <- count_outliers(groupings, outliertab, aggregate_features = TRUE, feature_delineator = "-") grouptablist <- count_outliers_out$grouptablist aggoutliertab <- count_outliers_out$aggoutliertab fractiontab <- count_outliers_out$fractiontab names(grouptablist) ``` Each tabulated table has the feature counts, and the stored infor the samples that went into the count ```{r} names(grouptablist$PAM50_Her2__Her2) ``` Example of what the feature counts look like: ```{r} head(grouptablist$PAM50_Her2__Her2$feature_counts) ``` Example of what the samples are that went into the analysis: ```{r} grouptablist$PAM50_Her2__Her2$samples ``` ## Run Outlier Analysis With the tabulated tables, run the outlier analysis to look for enrichment of outliers between groups. NOTE that this function has a functionality built in to write out tables to the external file, we will not use this parameter now, but it can be set by turning the to parameter `write_out_tables = TRUE` In this outlier analysis - we will have an additional filter included. The `fractiontab` outputted in the previous step has a metric that measure the number of samples in each group that have an outlier in them. In this next step we will use this information to filter for features that only have at least `0.3` (30%) of samples with an outlier. This filter is important because our aggregation step collapsed all of the sites down from each sample, and then we counted the total. If one sample had ALL of its sites as outliers - while interesting - this does not indicate the our entire ingroup has an overrepresentation - just that one sample. This filter enables a clearer analysis by only picking out features that have multiple samples with outliers. NOTE that this can be omitted by just leaving the `fraction_table` parameter out or as a `NULL` ```{r outlier analysis - phospho} outlier_analysis_out <- outlier_analysis(grouptablist = grouptablist, fraction_table = fractiontab, fraction_samples_cutoff = 0.3) names(outlier_analysis_out) head(outlier_analysis_out$ outlieranalysis_for_PAM50_Her2__Her2_vs_PAM50_Her2__not_Her2) ``` ## Plot Results using Heatmap Generating Function After you have your results, its useful to have a snapshot of your results in a figure. blacksheepr includes a utility function to output a heatmap with custom annotations and data. Use the plotting function with the original annotation data, and populate the heatmap with whatever information you want to represent. In this case, we are going to populate the table with the fractions of outliers per feature. The outlier_analysis object is used to select the differential genes. NOTE that you can write out the plot directly in the function using the given parameter `write_out_plot` or the saved object from the function is a heatmap, so you can open your own pdf and print out using the commented out code below. NOTE that the metatable must be in the ORDER desired, and that this function will NOT order it for you. ```{r heatmap plotting - phospho, fig.keep="none"} plottable <- comptable[do.call(order, c(decreasing = TRUE, data.frame(comptable[,1:ncol(comptable)]))),] hm1 <- outlier_heatmap(outlier_analysis_out = outlier_analysis_out, counttab = fractiontab, metatable = plottable, fdrcutoffvalue = 0.1) ## To output heatmap to pdf outside of the function #pdf(paste0(outfilepath, "test_hm1.pdf")) #hm1 #junk<-dev.off() ``` ```{r hm, fig.width = 8, fig.height = 8, fig.cap = "Example outputted Heatmap"} hm1$print_outlieranalysis_for_PAM50_Her2__Her2_vs_PAM50_Her2__not_Her2 ``` # Appendix ## Formatting your annotations table Formatting your annotation table is a crucial step to making sure blacksheepr runs smoothly. The key points are that that the rownames of your annotations EXACTLY match the column names of the assay data, and that the annotation columns are BINARY and therefore have only 2 subcategories per column. ### Using the make_comparison_columns utility function There is a built in utility function `make_comparison_columns` to help turn a multifactor column into several binary columns. The user can input as many columns as they want, and the function will output a table with each multifactorial column turned into a number of binary columns ``` {r format annotation data2 - phospho} dummyannotations <- data.frame(comp1 = c(1,1,2,2,3,3), comp2 = c("red", "blue", "red", "blue", "green", "green"), row.names = paste0("sample", seq_len(6))) dummyannotations ## Use the make_comparison_columns function to create binary columns expanded_dummyannotations <- make_comparison_columns(dummyannotations) expanded_dummyannotations ``` ## Running blacksheep functions for other -omics data blacksheepr is built as a generalized suite of tools with algorithms for use with any type of -omics data in the format of the previously described assay and annotations. Along those lines there are a couple common practices for other -omics data types. ### Running deva with RNAseq data The principles are the same, the only main difference is that you will not aggregate on to a primary feature, so the `aggregate_features` parameter should be left to the default `FALSE`. Also Note that if you want to output all genes regardless of significance, this can be accomplished by setting the `FDRcutoff` to 1, and the `fraction_value_cutoff` to 0. ## Processing data for running with deva `deva` at its core explores for outliers based on the difference from the median. Heavily skewed data needs to be to insure accurate analyses. Blacksheepr includes a normalization function that will perform both a Median of Ratios normalization in addition to a log2 transform. See below for an example using the `pasilla` data set: ``` {r normalize count data - phospho} library(pasilla) pasCts <- system.file("extdata", "pasilla_gene_counts.tsv", package="pasilla") cts <- as.matrix(read.csv(pasCts,sep="\t",row.names="gene_id")) norm_cts <- deva_normalization(cts, method = "MoR-log") ```