--- title: "Integrative analysis pipeline for pooled CRISPR functional genetic screens - MAGeCKFlute" author: "WubingZhang, Feizhen Wu, and Binbin Wang" date: "5 October 2018" package: "1.0.1" abstract: > Genome-wide CRISPR (clustered regularly interspaced short palindrome repeats) coupled with nuclease Cas9 (CRISPR/Cas9) screens represent a promising technology to systematically evaluate gene functions. Data analysis for CRISPR/Cas9 screens is a critical process that includes identifying screen hits and exploring biological functions for these hits in downstream analysis. We have previously developed two algorithms, MAGeCK and MAGeCK-VISPR, to analyze CRISPR/Cas9 screen data in various scenarios. These two algorithms allow users to perform quality control, read count generation and normalization, and calculate beta score to evaluate gene selection performance. In downstream analysis, biological functional analysis is required for understanding biological functions of these identified genes with different screening purposes. Here, We developed MAGeCKFlute for supporting downstream analysis, utilizing the data provided through MAGeCK and MAGeCK-VISPR. MAGeCKFlute provides several strategies to remove potential biases within sgRNA-level read counts and gene-level beta scores. The downstream analysis with the package includes identifying essential, non-essential, and target-associated genes, and performing biological functional category analysis and pathway enrichment analysis of these genes. The package also visualizes genes in the context of pathways to benefit users exploring screening data. Collectively, MAGeCKFlute enables accurate identification of essential, non-essential, and targeted genes, as well as their related biological functions. This vignette explains the use of the package and demonstrates typical workflows. MAGeCKFlute package version: `r packageVersion("MAGeCKFlute")` output: rmarkdown::html_document: highlight: pygments toc: true bibliography: library.bib vignette: > %\VignetteIndexEntry{MAGeCKFlute.Rmd} %\VignetteEngine{knitr::rmarkdown} \usepackage[utf8]{inputenc} --- ```{r setup, echo=FALSE, fig.height=6, fig.width=9, dpi=300} knitr::opts_chunk$set(tidy=FALSE, cache=TRUE, dev="png", message=FALSE, error=FALSE, warning=TRUE) ``` **Note:** if you use MAGeCKFlute in published research, please cite: ## How to get help for MAGeCKFlute Any and all MAGeCKFlute questions should be posted to the **Bioconductor support site**, which serves as a searchable knowledge base of questions and answers: Posting a question and tagging with "MAGeCKFlute" will automatically send an alert to the package authors to respond on the support site. See the first question in the list of [Frequently Asked Questions](#FAQ) (FAQ) for information about how to construct an informative post. You should **not** email your question to the package authors, as we will just reply that the question should be posted to the **Bioconductor support site**. ## Input data As input, the MAGeCKFlute package expects gene summary file as obtained by running commands `mageck test` or `mageck mle` in MAGeCK [@Wei2014] and MAGeCK-VISPR [@Wei2015], which are developed by our lab previously, to analyze CRISPR/Cas9 screen data in different scenarios[@Tim2014, @Hiroko2014, @Ophir2014, @Luke2014, @Silvana2015]. Both algorithms use negative binomial models to model the variances of sgRNAs, and use Robust Rank Aggregation (for MAGeCK) or maximum likelihood framework (for MAGeCK-VISPR) for a robust identification of selected genes. MAGeCK-MLE can be used to analyze CRISPR screen data from multi-conditioned experiments. MAGeCK-MLE also normalizes the data across multiple samples, making them comparable to each other. The most important ouput of MAGeCK MLE is “gene_summary” file, which includes the beta scores of multiple conditions and the associated statistics. The ‘beta score’ for each gene describes how the gene is selected: a positive beta score indicates a positive selection, and a negative beta score indicates a negative selection. MAGeCK RRA allows for the comparison between two experimental conditions. It can identify genes and sgRNAs are significantly selected between the two conditions. The most important output of MAGeCK RRA is the file “gene_summary.txt”. MAGeCK RRA will output both the negative score and positive score for each gene. A smaller score indicates higher gene importance. MAGeCK RRA will also output the statistical value for the scores of each gene. Genes that are significantly positively and negatively selected can be identified based on the p-value or FDR. ## Quick start Here we show the most basic steps for integrative analysis pipeline using `gene summary file`. Before using MAGeCKFlute, analysing CRISPR/Cas9 screen data using MAGeCK RRA (in MAGeCK [@Wei2014]) or MAGeCK MLE (in MAGeCK-VISPR [@Wei2015]) is neccessary, which result in the generation of the gene summary file. To run MAGeCKFlute pipeline, we need gene summary file generated by running MAGeCK RRA or MAGeCK MLE. MAGeCKFlute package provides two example data, one is `MLE_Data` and the other is `RRA_Data`. We will work with them in this document. **Downstream analysis pipeline for MAGeCK MLE** ```{r quickStart1, eval=FALSE} library(MAGeCKFlute) ##Load gene summary data in MAGeCK MLE results data("MLE_Data") ##Run the downstream analysis pipeline for MAGeCK MLE FluteMLE(MLE_Data, ctrlname=c("D7_R1", "D7_R2"), treatname=c("PLX7_R1","PLX7_R2"), prefix="BRAF_", organism="hsa") ``` All pipeline results are written into local directory "./BRAF_Flute_Results/", and all figures are integrated into file "BRAF_Flute.mle_summary.pdf". **Downstream analysis pipeline for MAGeCK RRA** ```{r quickStart2, eval=FALSE} ##Load gene summary data in MAGeCK RRA results data("RRA_Data") ##Run the downstream analysis pipeline for MAGeCK RRA FluteRRA(RRA_Data, prefix="BRAF", organism="hsa") ``` All pipeline results are written into local directory "./BRAF_Flute_Results/" too, and all figures are integrated into file "BRAF_Flute.rra_summary.pdf". ## Section I: Run pipeline from gene summary file generated by MAGeCK MLE ### Gene summary file As the input of MAGeCKFlute package, the gene summary file in MAGeCK-MLE results includes beta scores of all genes in multiple condition samples. Use function ‘data’ to load the dataset, and have a look at the file with a text editor to see how it is formatted. ```{r CheckMLERes} library(MAGeCKFlute) data("MLE_Data") head(MLE_Data) ``` ### Read beta scores Then, extract beta scores of control and treatment samples from the gene summary table(can be a file path of 'gene_summary' or data frame), and have a look at the beta score matrix. ```{r ReadBeta} data("MLE_Data") gene_summary = MLE_Data ctrlname = c("D7_R1", "D7_R2") treatname = c("PLX7_R1", "PLX7_R2") #Read beta scores from gene summary table in MAGeCK MLE results dd=ReadBeta(gene_summary, organism="hsa") head(dd) ``` ```{r} # Transform gene symbol to Entrez gene id dd$Gene = rownames(dd) tmp = TransGeneID(dd$Gene, "Symbol", "Entrez") idx = is.na(tmp) | duplicated(tmp) dd = dd[!idx,] rownames(dd) = tmp[!idx] head(dd) ``` ### Batch effect removal Is there batch effect? This is a common asked question before perform later analysis. In our package, we provide `HeatmapView` to ensure whether batch effect exists in data, and use `BatchRemove` to remove easily if same batch samples cluster together. ```{r BatchRemove, fig.height=6, fig.width=9} ##Before batch removal HeatmapView(dd[,c(ctrlname, treatname)]) batchMat = data.frame(samples = c(ctrlname, treatname), batch = c(1,2,1,2), cov = c(1,1,2,2)) dd1 = BatchRemove(dd[,c(ctrlname, treatname)], batchMat)$data ## After batch removal HeatmapView(dd1[,c(ctrlname, treatname)]) ``` ### Normalization of beta scores It is difficult to control all samples with a consistent cell cycle in an CRISPR screen experiment with multi conditions. Besides, beta score among different conditions with an inconsistent cell cycle are incomparable. So it is necessary to do the normalization when comparing the beta scores in different conditions. Essential genesare thosegenes that are indispensable for its survival. The effect generated by knocking out these genes in different cell types is consistent. Based on this, we developed cell cycle normalization method to shorten the gap of cell cycle in different conditions. In addition, a previous normalization method called loess normalization is available in this package.[@Laurent2004] ```{r NormalizeBeta} dd_essential = NormalizeBeta(dd, samples=c(ctrlname, treatname), method="cell_cycle") head(dd_essential) #OR dd_loess = NormalizeBeta(dd, samples=c(ctrlname, treatname), method="loess") head(dd_loess) ``` ### Distribution of all gene beta scores After normalization, the distribution of beta scores in different conditions should be similar. We can evaluate the distribution of beta scores using function ‘ViolinView’, ‘DensityView’, and ‘DensityDiffView’. ```{r DistributeBeta, fig.height=6, fig.width=9} ViolinView(dd_essential, samples=c(ctrlname, treatname), main="Cell cycle normalized") DensityView(dd_essential, samples=c(ctrlname, treatname), main="Cell cycle normalized") DensityDiffView(dd_essential, ctrlname, treatname, main="Cell cycle normalized") #we can also use function 'MAView' to evaluate the data quality of normalized #beta score profile. MAView(dd_essential, ctrlname, treatname, cex=1, main="Cell cycle normalized") ``` ### Estimate cell cycle time by linear fitting After normalization, the cell cycle time in different condition should be almost consistent. Here we use linear fitting to estimate the cell cycle time, and use function `CellCycleView` to view the cell cycle time of all samples. ```{r EstimateCellCycle, fig.height=6, fig.width=9} ##Fitting beta score of all genes CellCycleView(dd_essential, ctrlname, treatname, main="Cell cycle normalized") ``` ### Positive selection and negative selection The function `ScatterView` can group all genes into three groups, positive selection genes (GroupA), negative selection genes (GroupB), and others, and visualize these three grouped genes in scatter plot. We can also use function `RankView` to rank the beta score deviation between control and treatment and mark top selected genes in figure. ```{r selection, fig.height=6, fig.width=9} p1 = ScatterView(dd_essential, ctrlname, treatname, main="Cell cycle normalized") print(p1) ``` ```{r rank, fig.height=6, fig.width=9} ## Add column of 'diff' dd_essential$Control = rowMeans(dd_essential[,ctrlname]) dd_essential$Treatment = rowMeans(dd_essential[,treatname]) rankdata = dd_essential$Treatment - dd_essential$Control names(rankdata) = rownames(dd_essential) p2 = RankView(rankdata, main="Cell cycle normalized") print(p2) ``` ### Functional analysis of selected genes For gene set enrichment analysis, we provide five methods in this package, including "ORT"(Over-Representing Test [@Guangchuang2012]), "GSEA"(Gene Set Enrichment Analysis [@Aravind2005]), "DAVID" [@Huang2009], "GOstats" [@Falcon2007], and "HGT"(HyperGemetric test). And gene ontology(GO) term [@GO2014] and Kyoto encyclopedia of genes and genomes (KEGG) pathway [@Minoru2014] enrichment analysis are available in this package. The enrichment analysis can be done easily using function `enrichment_analysis`, which return a list containing `gridPlot` (ggplot object) and `enrichRes` (enrichResult instance). Alternatively, you can do enrichment analysis using function `enrich.ORT` for "ORT", `enrich.GSE` for GSEA, `enrich.DAVID` for DAVID, `enrich.GOstats` for GOstats, and `enrich.HGT` for "HGT", which return an enrichResult instance. Function `EnrichedView` and `EnrichedGSEView` (for `enrich.GSE`) can be used to generate `gridPlot` from `enrichRes`easily, as shown below. ```{r EnrichAB, fig.height=6, fig.width=9} ## Get information of positive and negative selection genes groupAB = p1$data ## select positive selection genes idx1=groupAB$group=="up" genes=rownames(groupAB)[idx1] geneList=groupAB$diff[idx1] names(geneList)=genes universe=rownames(groupAB) ## Do enrichment analysis using HGT method keggA = enrichment_analysis(geneList = genes, universe = universe, method = "HGT", type = "KEGG", organism = "human", plotTitle = "Positive selection") #same as kegg_A = enrich.HGT(genes, universe, type = "KEGG", organism = "human") keggA_grid = EnrichedView(kegg_A@result, plotTitle = "Positive selection") ## look at the results head(keggA$enrichRes@result) print(keggA$gridPlot) #should same as head(kegg_A@result) print(keggA_grid) ``` ```{r GSEA, fig.height=6, fig.width=9} ## Do enrichment analysis using GSEA method gseA = enrichment_analysis(geneList = geneList, method = "GSEA", type = "KEGG", organism="human", plotTitle="Positive selection") #same as gse_A = enrich.GSE(geneList, type = "KEGG", organism = "human") gseA_grid = EnrichedGSEView(gse_A@result, plotTitle = "Positive selection") head(gseA$enrichRes@result) print(gseA$gridPlot) #should same as head(gse_A@result) print(gseA_grid) ``` For enriched pathways, we can use function `KeggPathwayView` to visualize the beta score level in control and treatment on pathway map.[@Weijun2013] ```{r pathview, fig.height=10, fig.width=20} genedata = dd_essential[,c("Control","Treatment")] keggID = keggA$enrichRes@result$ID[1] #The pathway map will be located on current workspace KeggPathwayView(gene.data = genedata, pathway.id = keggID, species="hsa") ##Read the figure into R pngname=paste0(keggID, ".pathview.multi.png") grid.arrange(grid::rasterGrob(png::readPNG(pngname))) file.remove(paste0(keggID, c(".pathview.multi.png", ".png", ".xml"))) ``` ### Identify treatment-associated genes using 9-square model Considering the difference of beta scores in control and treatment sample, we developed a 9-square model, which group all genes into several subgroups. Among these subgroups, four subgroup genes are treatment-associated, which correspond to specific functions. Group1 and Group3 genes are not selected in control sample, while they are significantly selected in treatment sample, so they may related to drug resistance. Group2 and Group4 genes are selected in control, but they are not selected in treatment, so maybe these genes are associated with drug targets. In this package, use function `SquareView` can select these four treatment-associated subgroup genes and view them in 9-Square scatter plot. ```{r Square, fig.height=8, fig.width=9} p3 = SquareView(dd_essential, label = "Gene", main="Cell cycle normalized") print(p3) ``` ### Functional analysis for treatment-associated genes Same as section above. We can do enrichment analysis for treatment-associated genes. ```{r EnrichSquare, fig.height=5, fig.width=9} ##Get information of treatment-associated genes Square9 = p3$data ##==select group1 genes in 9-Square idx=Square9$group=="Group1" genes=rownames(Square9)[idx] universe=rownames(Square9) #====KEGG_enrichment===== kegg1=enrichment_analysis(geneList = genes, universe = universe, type = "KEGG", plotTitle = "KEGG: Group1") ## look at the results head(kegg1$enrichRes@result) print(kegg1$gridPlot) ``` Also, pathway visualization can be done using function `KeggPathwayView`, the same as section above. ```{r pathview2, eval=FALSE} genedata = dd_essential[, c("Control","Treatment")] keggID = kegg1$enrichRes@result$ID[1] KeggPathwayView(gene.data = genedata, pathway.id = keggID, species="hsa") ##Read the figure into R pngname=paste0(keggID, ".pathview.multi.png") grid.arrange(grid::rasterGrob(png::readPNG(pngname))) file.remove(paste0(keggID, c(".pathview.multi.png", ".png", ".xml"))) ``` ## Section II: Run pipeline from gene summary file generated by MAGeCK RRA ### Gene summary file For experiments with two experimental conditions, we recommend to use MAGeCK-RRA to identify essential genes from CRISPR/Cas9 knockout screens and tests the statistical significance of each observed change between two conditions. Gene summary file in MAGeCK-RRA results summarize the statistical significance of positive selection and negative selection. Use function ‘data’ to load the dataset, and have a look at the file with a text editor to see how it is formatted. ```{r CheckRRARes} data("RRA_Data") head(RRA_Data) ``` ### Read FDR of negative selection and positive selection Then, extract "neg.fdr" and "pos.fdr" from the gene summary table. ```{r ReadRRA} dd.rra = ReadRRA(RRA_Data, organism="hsa") head(dd.rra) ``` ### Negative selection and positive selection Take 0.05 as cutoff, get negative selection and positive selection genes and do enrichment analysis on KEGG pathway and GO BP terms. ```{r selection2, fig.height=5, fig.width=9} ##Negative selection universe=dd.rra$ENTREZID genes = dd.rra[dd.rra$neg.fdr<0.05, "ENTREZID"] kegg.neg=enrichment_analysis(geneList = genes, universe=universe, type = "KEGG", plotTitle="KEGG: neg") bp.neg=enrichment_analysis(geneList = genes, universe=universe, type = "BP", plotTitle="BP: neg") print(kegg.neg$gridPlot) print(bp.neg$gridPlot) ##Positive selection universe=dd.rra$ENTREZID genes = dd.rra[dd.rra$pos.fdr<0.05, "ENTREZID"] kegg.pos=enrichment_analysis(geneList = genes, universe=universe, type = "KEGG", plotTitle="KEGG: pos") bp.pos=enrichment_analysis(geneList = genes, universe=universe, type = "BP", plotTitle="BP: pos") print(kegg.pos$gridPlot) print(bp.pos$gridPlot) ``` # Session info ```{r sessionInfo} sessionInfo() ``` # References