--- title: "Single Cell Analysis" author: - Max Mattessich - Joaquin Reyna - Edel Aron - Anna Konstorum date: "Compiled: `r format(Sys.time(), '%B %d, %Y')`" output: BiocStyle::html_document: dev: 'jpeg' df_print: paged fig_retina: 1 number_sections: FALSE toc_depth: 4 toc_float: TRUE vignette: > %\VignetteIndexEntry{Single Cell Analysis} %\VignetteEncoding{UTF-8} %\VignetteEngine{knitr::rmarkdown} --- ```{r setup, include = FALSE} knitr::opts_chunk$set(echo = TRUE) ``` # Introduction In this vignette, we will cover some of the possible sources for publicly available single-cell data, how to format it for and process it with MCIA and how to run a basic analysis in order to annotate the data and use the results as metadata for exploring the decomposition results. ![Vignette 2 Pipelines by Major Section](https://github.com/Muunraker/nipalsMCIA/releases/download/v0.99.7/Vignette-2-Pipeline.png) *The lines for the data_sc_sce.Rda are dotted because we do not provide this object below; however, the code describing how to generate it is still included.* ## Installation ```{r installation-github, eval = FALSE} # install.packages("devtools") devtools::install_github("Muunraker/nipalsMCIA", ref = "code-development", force = TRUE, build_vignettes = TRUE) # devel version ``` ```{r installation-bioconductor, eval = FALSE} # after acceptance # install.packages("BiocManager") BiocManager::install("nipalsMCIA") ``` ```{r load-packages, message = FALSE} # note that the TENxPBMCData package is not included in this list as you may # decide to pull data from another source or use our provided objects library(dplyr) library(ggplot2) library(ggpubr) library(nipalsMCIA) library(piggyback) library(Seurat) # NIPALS starts with a random vector set.seed(42) ``` ```{r set-paths} # if you would like to save any of the data loaded/created locally path_data <- file.path("..", "data") # recommended location for external data path_inst <- tempdir() ``` # Data We will be using the 10x Genomics "pbmc5k-CITEseq" dataset, which was published in May 2019 and processed using Cell Ranger 3.0.2. It contains 5,247 detected cells from PBMCs and 33,538 genes along with 32 cell surface markers. We chose this dataset due to it containing both gene expression (GEX) and cell surface protein (ADT) data as well as being relatively recent, publicly available and from a widely used platform. The following code details several ways in which to load in this dataset. You may prefer to use data from other sources outside of 10x Genomics, in which case you will have to format it to work with nipalsMCIA. ## All Sources We have the objects described below available in data files which you can download and load in here if you do not want to run the following sections. **These files includes the processed object that you will need in order to run MCIA (data_blocks_sc.Rda).** ```{r data-all-list} # list all of the currently available files in the latest release piggyback::pb_list(repo = "Muunraker/nipalsMCIA", tag = "latest") ``` ```{r data-all-download} # specify `tag = ` to use a different release other than latest # files needed for running MCIA piggyback::pb_download(file = "metadata_sc.csv", dest = path_inst, repo = "Muunraker/nipalsMCIA") piggyback::pb_download(file = "data_blocks_sc.Rda", dest = path_inst, repo = "Muunraker/nipalsMCIA") # MCIA results piggyback::pb_download(file = "mcia_results_sc.Rds", dest = path_inst, repo = "Muunraker/nipalsMCIA") # marker genes for cell type annotation with Seurat piggyback::pb_download(file = "marker_genes.csv", dest = path_inst, repo = "Muunraker/nipalsMCIA") # the Seurat data's metric summary file from 10x Genomics piggyback::pb_download(file = "5k_pbmc_protein_v3_metrics_summary.csv", dest = path_inst, repo = "Muunraker/nipalsMCIA") # Seurat objects in different stages of processing piggyback::pb_download(file = "tenx_pbmc5k_CITEseq_raw.rds", dest = path_inst, repo = "Muunraker/nipalsMCIA") piggyback::pb_download(file = "tenx_pbmc5k_CITEseq_annotated.rds", dest = path_inst, repo = "Muunraker/nipalsMCIA") ``` ## Bioconductor There are several Bioconductor packages which provide single-cell data for users as part of the package. The [TENxPBMCData](https://bioconductor.org/packages/3.15/data/experiment/html/TENxPBMCData.html) contains 9 different publicly available datasets from 10x Genomics (stored as [SingleCellExperiment](https://bioconductor.org/packages/release/bioc/html/SingleCellExperiment.html) classes), including the one that we will be analyzing. Note that the CITE-Seq information is being stored as an "[alternative experiment](https://bioconductor.org/books/3.16/OSCA.intro/the-singlecellexperiment-class.html#alternative-experiments)" within the SingleCellExperiment object. Both the GEX (`"mainExpName: Gene Expression"`) and ADT (`"altExpNames(1): Antibody Capture"`) data were stored in the DelayedMatrix format since they are so large. In this object, the rows represent the features and the columns represent the cells. ```{r data-bioconductor-load, eval = FALSE} # read in the data as a SingleCellExperiment object tenx_pbmc3k <- TENxPBMCData::TENxPBMCData(dataset = "pbmc5k-CITEseq") # examine the data tenx_pbmc3k ## class: SingleCellExperiment ## dim: 33538 5247 ## metadata(0): ## assays(1): counts ## rownames(33538): ENSG00000243485 ENSG00000237613 ... ENSG00000277475 ENSG00000268674 ## rowData names(4): ENSEMBL_ID Symbol_TENx Type Symbol ## colnames: NULL ## colData names(11): Sample Barcode ... Individual Date_published ## reducedDimNames(0): ## mainExpName: Gene Expression ## altExpNames(1): Antibody Capture counts(tenx_pbmc3k) ## <33538 x 5247> sparse matrix of class DelayedMatrix and type "integer": ## [, 1] [, 2] [, 3] [, 4] ... [, 5244] [, 5245] [, 5246] [, 5247] ## ENSG00000243485 0 0 0 0 . 0 0 0 0 ## ENSG00000237613 0 0 0 0 . 0 0 0 0 ## ENSG00000186092 0 0 0 0 . 0 0 0 0 ## ENSG00000238009 0 0 0 0 . 0 0 0 0 ## ENSG00000239945 0 0 0 0 . 0 0 0 0 ## ... . . . . . . . . . ## ENSG00000277856 0 0 0 0 . 0 0 0 0 ## ENSG00000275063 0 0 0 0 . 0 0 0 0 ## ENSG00000271254 0 0 0 0 . 0 0 0 0 ## ENSG00000277475 0 0 0 0 . 0 0 0 0 ## ENSG00000268674 0 0 0 0 . 0 0 0 0 counts(altExp(tenx_pbmc3k)) ## <32 x 5247> sparse matrix of class DelayedMatrix and type "integer": ## [, 1] [, 2] [, 3] [, 4] ... [, 5244] [, 5245] [, 5246] [, 5247] ## CD3 25 959 942 802 . 402 401 6 1773 ## CD4 164 720 1647 1666 . 1417 1 46 1903 ## CD8a 16 8 21 5 . 8 222 3 9 ## CD11b 3011 12 11 11 . 15 7 1027 9 ## CD14 696 12 13 9 . 9 17 382 8 ## ... . . . . . . . . . ## HLA-DR 573 15 11 19 . 6 40 184 32 ## TIGIT 10 3 3 3 . 2 15 1 12 ## IgG1 4 4 2 4 . 1 0 2 4 ## IgG2a 1 3 0 6 . 4 0 4 2 ## IgG2b 6 2 4 8 . 0 0 2 5 # examine the metadata: head(colData(tenx_pbmc3k), n = 3) ## DataFrame with 6 rows and 11 columns ## Sample Barcode Sequence Library Cell_ranger_version Tissue_status Barcode_type ## ## 1 pbmc5k-CITEseq AAACCCAAGAGACAAG-1 AAACCCAAGAGACAAG 1 v3.0.2 NA Chromium ## 2 pbmc5k-CITEseq AAACCCAAGGCCTAGA-1 AAACCCAAGGCCTAGA 1 v3.0.2 NA Chromium ## 3 pbmc5k-CITEseq AAACCCAGTCGTGCCA-1 AAACCCAGTCGTGCCA 1 v3.0.2 NA Chromium ## Chemistry Sequence_platform Individual Date_published ## ## 1 Chromium_v3 NovaSeq HealthyDonor 2019-05-29 ## 2 Chromium_v3 NovaSeq HealthyDonor 2019-05-29 ## 3 Chromium_v3 NovaSeq HealthyDonor 2019-05-29 head(rowData(tenx_pbmc3k), n = 3) ## DataFrame with 6 rows and 4 columns ## ENSEMBL_ID Symbol_TENx Type Symbol ## ## ENSG00000243485 ENSG00000243485 MIR1302-2HG Gene Expression NA ## ENSG00000237613 ENSG00000237613 FAM138A Gene Expression FAM138A ## ENSG00000186092 ENSG00000186092 OR4F5 Gene Expression OR4F5 metadata(tenx_pbmc3k) ## list() # change the gene names from Ensembl IDs to the 10x genes rownames(tenx_pbmc3k) <- rowData(tenx_pbmc3k)$Symbol_TENx ``` In order to run MCIA, the format of this data must be slightly modified: ```{r data-bioconductor-formatting, eval = FALSE} # set up the list data_blocks_sc_sce <- list() data_blocks_sc_sce$mrna <- data.frame(as.matrix(counts(tenx_pbmc3k))) data_blocks_sc_sce$adt <- data.frame(as.matrix(counts(altExp(tenx_pbmc3k)))) summary(data_blocks_sc_sce) ## Length Class Mode ## mrna 5247 data.frame list ## adt 5247 data.frame list # convert to a Seurat object (using `as.Seurat` won't work here) obj_sce <- CreateSeuratObject(counts = data_blocks_sc_sce$mrna, # assay = "RNA" project = "pbmc5k_CITEseq") obj_sce[["ADT"]] <- CreateAssayObject(counts = data_blocks_sc_sce$adt) # name the cells with their barcodes obj_sce <- RenameCells(object = obj_sce, new.names = colData(tenx_pbmc3k)$Sequence) # add metadata from the SingleCellExperiment object obj_sce <- AddMetaData(object = obj_sce, metadata = as.data.frame(colData(tenx_pbmc3k), row.names = Cells(obj_sce))) # this object will be slightly different than from the Seurat one down below # e.g. 5297 rows vs. 4193 rows (since QC wasn't done) and different metadata head(obj_sce[[]], n = 3) ## orig.ident nCount_RNA nFeature_RNA nCount_ADT nFeature_ADT Sample Barcode ## AAACCCAAGAGACAAG SeuratProject 7375 2363 5178 31 pbmc5k-CITEseq AAACCCAAGAGACAAG-1 ## AAACCCAAGGCCTAGA SeuratProject 3772 1259 2893 29 pbmc5k-CITEseq AAACCCAAGGCCTAGA-1 ## AAACCCAGTCGTGCCA SeuratProject 4902 1578 3635 29 pbmc5k-CITEseq AAACCCAGTCGTGCCA-1 ## Sequence Library Cell_ranger_version Tissue_status Barcode_type Chemistry Sequence_platform ## AAACCCAAGAGACAAG AAACCCAAGAGACAAG 1 v3.0.2 Chromium Chromium_v3 NovaSeq ## AAACCCAAGGCCTAGA AAACCCAAGGCCTAGA 1 v3.0.2 Chromium Chromium_v3 NovaSeq ## AAACCCAGTCGTGCCA AAACCCAGTCGTGCCA 1 v3.0.2 Chromium Chromium_v3 NovaSeq ## Individual Date_published ## AAACCCAAGAGACAAG HealthyDonor 2019-05-29 ## AAACCCAAGGCCTAGA HealthyDonor 2019-05-29 ## AAACCCAGTCGTGCCA HealthyDonor 2019-05-29 # save the data locally if desired save(data_blocks_sc_sce, obj_sce, file = file.path(path_data, "data_sc_sce.Rda")) ``` ## 10x Genomics and Seurat The original dataset can be found on the [10x Genomics Datasets website](https://support.10xgenomics.com/single-cell-gene-expression/datasets/3.0.2/5k_pbmc_protein_v3) and an explanation of the file types for this version can be found [here](https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/3.0/output/overview). You can download them all to a directory of your choosing with their suggested terminal commands: ```{bash data-10x-download, eval = FALSE} # Input Files curl -O https://cf.10xgenomics.com/samples/cell-exp/3.0.2/5k_pbmc_protein_v3/5k_pbmc_protein_v3_fastqs.tar curl -O https://cf.10xgenomics.com/samples/cell-exp/3.0.2/5k_pbmc_protein_v3/5k_pbmc_protein_v3_feature_ref.csv # Output Files curl -O https://cf.10xgenomics.com/samples/cell-exp/3.0.2/5k_pbmc_protein_v3/5k_pbmc_protein_v3_possorted_genome_bam.bam curl -O https://cf.10xgenomics.com/samples/cell-exp/3.0.2/5k_pbmc_protein_v3/5k_pbmc_protein_v3_possorted_genome_bam.bam.bai curl -O https://cf.10xgenomics.com/samples/cell-exp/3.0.2/5k_pbmc_protein_v3/5k_pbmc_protein_v3_molecule_info.h5 curl -O https://cf.10xgenomics.com/samples/cell-exp/3.0.2/5k_pbmc_protein_v3/5k_pbmc_protein_v3_filtered_feature_bc_matrix.h5 curl -O https://cf.10xgenomics.com/samples/cell-exp/3.0.2/5k_pbmc_protein_v3/5k_pbmc_protein_v3_filtered_feature_bc_matrix.tar.gz curl -O https://cf.10xgenomics.com/samples/cell-exp/3.0.2/5k_pbmc_protein_v3/5k_pbmc_protein_v3_raw_feature_bc_matrix.h5 curl -O https://cf.10xgenomics.com/samples/cell-exp/3.0.2/5k_pbmc_protein_v3/5k_pbmc_protein_v3_raw_feature_bc_matrix.tar.gz curl -O https://cf.10xgenomics.com/samples/cell-exp/3.0.2/5k_pbmc_protein_v3/5k_pbmc_protein_v3_analysis.tar.gz curl -O https://cf.10xgenomics.com/samples/cell-exp/3.0.2/5k_pbmc_protein_v3/5k_pbmc_protein_v3_metrics_summary.csv curl -O https://cf.10xgenomics.com/samples/cell-exp/3.0.2/5k_pbmc_protein_v3/5k_pbmc_protein_v3_web_summary.html curl -O https://cf.10xgenomics.com/samples/cell-exp/3.0.2/5k_pbmc_protein_v3/5k_pbmc_protein_v3_cloupe.cloupe ``` If you would like to view some basic information about the data, you can open the *web_summary.html* in your favorite web browser. It will show the estimated number of cells, mean reads per cell, median genes per cell and a variety of other sample metrics. These metrics are also available in the *metrics_summary.csv* file. You will need the *filtered_feature_bc_matrix* directory for the following analysis; it can be extracted with `tar -xvzf 5k_pbmc_protein_v3_filtered_feature_bc_matrix.tar.gz` from within the relevant data directory. ```{r data-10x-object, eval = FALSE} # load the data (change the file path as needed) data <- Seurat::Read10X(data.dir = file.path(path_data, "tenx_pbmc5k_CITEseq", "filtered_feature_bc_matrix"), strip.suffix = TRUE) # remove the "-1"s from barcodes ## 10X data contains more than one type and is being returned as a list ## containing matrices of each type. # set minimum cells and/or features here if you'd like obj <- Seurat::CreateSeuratObject(counts = data$`Gene Expression`, project = "pbmc5k_CITEseq") obj[["ADT"]] <- Seurat::CreateAssayObject(counts = data$`Antibody Capture`) ## Warning: Feature names cannot have underscores ('_'), replacing with dashes ('-') # check the assays Seurat::Assays(object = obj) ## "RNA" "ADT" # list out the CITE-Seq surface protein markers rownames(obj[["ADT"]]) ## [1] "CD3-TotalSeqB" "CD4-TotalSeqB" "CD8a-TotalSeqB" ## [4] "CD11b-TotalSeqB" "CD14-TotalSeqB" "CD15-TotalSeqB" ## [7] "CD16-TotalSeqB" "CD19-TotalSeqB" "CD20-TotalSeqB" ## [10] "CD25-TotalSeqB" "CD27-TotalSeqB" "CD28-TotalSeqB" ## [13] "CD34-TotalSeqB" "CD45RA-TotalSeqB" "CD45RO-TotalSeqB" ## [16] "CD56-TotalSeqB" "CD62L-TotalSeqB" "CD69-TotalSeqB" ## [19] "CD80-TotalSeqB" "CD86-TotalSeqB" "CD127-TotalSeqB" ## [22] "CD137-TotalSeqB" "CD197-TotalSeqB" "CD274-TotalSeqB" ## [25] "CD278-TotalSeqB" "CD335-TotalSeqB" "PD-1-TotalSeqB" ## [28] "HLA-DR-TotalSeqB" "TIGIT-TotalSeqB" "IgG1-control-TotalSeqB" ## [31] "IgG2a-control-TotalSeqB" "IgG2b-control-TotalSeqB" # save the data locally if desired saveRDS(obj, file.path(path_data, "tenx_pbmc5k_CITEseq_raw.rds")) ``` You can also load the Seurat object directly, such as in the [Read in and process the data] subsection in the [Deep dive: Seurat analysis] section later in the vignette. # MCIA ## Metadata Note that the "-1"s have been removed from the cell barcodes. ```{r mcia-metadata} # read in the annotated cells metadata_sc <- read.csv(file = file.path(path_inst, "metadata_sc.csv"), header = TRUE, row.names = 1) # examples metadata_sc %>% slice_sample(n = 5) ``` ## Running the decomposition This will run on cells with the top 2000 variable features (as defined in the later analysis). ```{r mcia-decomp-load-data, eval = FALSE} # load the object setup for running MCIA [10x Genomics & Seurat] load(file = file.path(path_inst, "data_blocks_sc.Rda")) ``` ```{r mcia-decomp-run, eval = FALSE} # "largest_sv" results in a more balanced contribution # from the blocks than the default "unit_var" set.seed(42) # convert data_blocks_sc to an MAE object using the SingleCellExperiment class data_blocks_sc_mae <- MultiAssayExperiment::MultiAssayExperiment(lapply(data_blocks_sc, function(x) SingleCellExperiment::SingleCellExperiment(t(as.matrix(x)))), colData = metadata_sc) mcia_results_sc <- nipals_multiblock(data_blocks = data_blocks_sc_mae, col_preproc_method = "colprofile", block_preproc_method = "largest_sv", num_PCs = 10, tol = 1e-9, deflationMethod = "global", plots = "none") ## Performing column-level pre-processing... ## Column pre-processing completed. ## Performing block-level preprocessing... ## Block pre-processing completed. ## Computing order 1 scores ## Computing order 2 scores ## Computing order 3 scores ## Computing order 4 scores ## Computing order 5 scores ## Computing order 6 scores ## Computing order 7 scores ## Computing order 8 scores ## Computing order 9 scores ## Computing order 10 scores # saveRDS(mcia_results_sc, file = file.path(path_data, "mcia_results_sc.Rds")) ``` ```{r mcia-decomp-load} # load the results of the previous block (if already run and saved) mcia_results_sc <- readRDS(file = file.path(path_inst, "mcia_results_sc.Rds")) mcia_results_sc ``` ## Visualization This data comes from only one subject, so we will use the annotated cell types for the metadata (see the [Deep dive: Seurat analysis] section for details on their origin). ### Define colors ```{r mcia-plots-colors} # for the projection plot # technically you could just do color_pal_params = list(option = "D"), but saving # the colors is useful for other plots like in the Seurat section meta_colors_sc <- get_metadata_colors(mcia_results = mcia_results_sc, color_col = "CellType", color_pal = scales::viridis_pal, color_pal_params = list(option = "D")) # for other plots colors_omics_sc <- get_colors(mcia_results = mcia_results_sc) ``` ### Eigenvalue scree plot ```{r mcia-plots-eigenvalue-scree, fig.dim = c(5, 4)} global_scores_eigenvalues_plot(mcia_results = mcia_results_sc) ``` ### Projection plot ```{r mcia-plots-projection, fig.dim = c(5, 5)} projection_plot(mcia_results = mcia_results_sc, projection = "global", orders = c(1, 2), color_col = "CellType", color_pal = meta_colors_sc, legend_loc = "bottomright") ``` ### Global scores heatmap ```{r mcia-plots-heatmap-global, fig.dim = c(7, 5)} suppressMessages(global_scores_heatmap(mcia_results = mcia_results_sc, color_col = "CellType", color_pal = meta_colors_sc)) ``` ### Block weights heatmap The block weights heatmap shows the distribution of the different block score weights among the factors. ```{r mcia-plots-heatmap-block, fig.dim = c(4, 2.5)} block_weights_heatmap(mcia_results = mcia_results_sc) ``` ### Loadings ```{r mcia-plots-loadings, fig.dim = c(6, 4.5)} vis_load_plot(mcia_out = mcia_results_sc, axes = c(1, 4), colors_omics = colors_omics_sc) ``` ### Top features In a few factors of interest: #### Factor 1 ```{r mcia-plots-top-features-factor-1, fig.dim = c(10, 4)} # define the loadings all_pos_1 <- ord_loadings(mcia_out = mcia_results_sc, omic = "all", absolute = FALSE, descending = TRUE, factor = 1) mrna_pos_1 <- ord_loadings(mcia_out = mcia_results_sc, omic = "mrna", absolute = FALSE, descending = TRUE, factor = 1) # visualization all_pos_1_vis <- vis_load_ord(gl_f_ord = all_pos_1, omic_name = "all", colors_omics = colors_omics_sc) mrna_pos_1_vis <- vis_load_ord(gl_f_ord = mrna_pos_1, omic_name = "mrna", colors_omics = colors_omics_sc) ggpubr::ggarrange(all_pos_1_vis, mrna_pos_1_vis, widths = c(1, 1)) ``` #### Factor 4 If you don't want to rank by the absolute value and see a large number of features: ```{r mcia-plots-top-features-factor-4, fig.dim = c(10, 5)} # define the loadings all_4 <- ord_loadings(mcia_results_sc, omic = "all", absolute = FALSE, descending = TRUE, factor = 4) # visualization vis_load_ord(gl_f_ord = all_4, omic_name = "all", colors_omics = colors_omics_sc, n_feat = 60) ``` As we saw in the [Block weights heatmap], factor 4 is dominated by the mRNA data and not the ADT data. # Deep dive: Seurat analysis This section demonstrates how to take in raw data (in this case, the output of Cell Ranger v3.0) and go through a popular analysis pipeline to ultimately cluster and annotate the data. Some people prefer to use Cell Ranger's built-in dimension reduction and clustering analysis and to view the results with [Loupe Cell Browser](https://support.10xgenomics.com/single-cell-gene-expression/software/visualization/latest/what-is-loupe-cell-browser). *Credit to Seurat ([RNA](https://satijalab.org/seurat/articles/pbmc3k_tutorial.html) and [multi-modal](https://satijalab.org/seurat/articles/multimodal_vignette.html#setup-a-seurat-object-add-the-rna-and-protein-data)) for the general steps.* ## Read in and process the data ```{r seurat-load-data, message = FALSE} # load the data obj <- readRDS(file = file.path(path_inst, "tenx_pbmc5k_CITEseq_raw.rds")) # add useful metadata obj[["percent.mt"]] <- PercentageFeatureSet(object = obj, pattern = "^MT-") ``` ## Quality control ### Metrics summary ```{r seurat-qc-metrics-summary} # read in and display the summary table (metrics_summary <- read.csv( file = file.path(path_inst, "5k_pbmc_protein_v3_metrics_summary.csv") )) ``` ### GEX QC metrics #### Before filtering ```{r seurat-qc-pre-filter, fig.dim = c(8, 5)} VlnPlot(object = obj, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), pt.size = 0.01, ncol = 3) & theme(axis.text.x = element_text(angle = 0, hjust = 0.5), axis.title.x = element_blank()) ``` #### After filtering Based on the previous plots, a minimum of 200 features and a maximum of 20% mitochondrial seemed like good cutoffs. ```{r seurat-qc-post-filter, fig.dim = c(8, 5)} # adjust cutoffs as desired obj <- subset(x = obj, subset = nFeature_RNA > 200 & percent.mt < 20) VlnPlot(object = obj, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), pt.size = 0.01, ncol = 3) & theme(axis.text.x = element_text(angle = 0, hjust = 0.5), axis.title.x = element_blank()) ``` ## Standard Seurat pipeline You can set `verbose = FALSE` for many of these commands if you don't want to see outputs. **Most of these are run on the RNA**. ```{r seurat-pipeline-1, eval = FALSE} # standard log normalization for RNA and centered log for ADT obj <- NormalizeData(object = obj, normalization.method = "LogNormalize", scale.factor = 10000, assay = "RNA") ## Performing log-normalization ## 0% 10 20 30 40 50 60 70 80 90 100% ## [----|----|----|----|----|----|----|----|----|----| ## **************************************************| obj <- NormalizeData(object = obj, normalization.method = "CLR", margin = 2, assay = "ADT") # go across cells, not features ## Normalizing across cells ## |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=00s # highly variable features obj <- FindVariableFeatures(object = obj, selection.method = "vst", nfeatures = 2000) ## Calculating gene variances ## 0% 10 20 30 40 50 60 70 80 90 100% ## [----|----|----|----|----|----|----|----|----|----| ## **************************************************| ## Calculating feature variances of standardized and clipped values ## 0% 10 20 30 40 50 60 70 80 90 100% ## [----|----|----|----|----|----|----|----|----|----| ## **************************************************| # scaling so the average expression is 0 and the variance is 1 obj <- ScaleData(object = obj, features = rownames(obj)) ## Centering and scaling data matrix ## |===============================================================================================| 100% # dimensionality reduction obj <- RunPCA(object = obj) ## PC_ 1 ## Positive: LYZ, FCN1, CST3, MNDA, CTSS, PSAP, S100A9, FGL2, AIF1, GRN ## NCF2, LST1, CD68, TYMP, SERPINA1, CYBB, CLEC12A, CSTA, SPI1, TNFAIP2 ## CPVL, VCAN, MPEG1, TYROBP, KLF4, FTL, S100A8, IGSF6, CD14, MS4A6A ## Negative: CD3D, TRAC, LTB, TRBC2, IL32, CD3G, IL7R, CD69, CD247, TRBC1 ## CD2, CD7, CD27, ARL4C, ISG20, HIST1H4C, SYNE2, GZMM, ITM2A, CCR7 ## RORA, MAL, CXCR4, LEF1, TRAT1, CTSW, GZMA, KLRB1, TRABD2A, CCL5 ## PC_ 2 ## Positive: CD79A, MS4A1, IGHM, BANK1, HLA-DQA1, CD79B, IGKC, LINC00926, RALGPS2, TNFRSF13C ## VPREB3, IGHD, SPIB, CD22, FCRL1, HLA-DQB1, BLK, FAM129C, FCRLA, TCL1A ## GNG7, TCF4, COBLL1, PAX5, SWAP70, CD40, BCL11A, P2RX5, TSPAN13, ADAM28 ## Negative: NKG7, CST7, GZMA, PRF1, KLRD1, CTSW, FGFBP2, GNLY, GZMH, CCL5 ## GZMM, CD247, KLRF1, HOPX, SPON2, ADGRG1, TRDC, MATK, GZMB, FCGR3A ## S100A4, CCL4, CLIC3, KLRB1, IL2RB, TBX21, TTC38, ANXA1, PTGDR, PLEKHF1 ## PC_ 3 ## Positive: GZMB, NKG7, GNLY, CLIC3, PRF1, KLRD1, FGFBP2, KLRF1, SPON2, CST7 ## GZMH, FCGR3A, ADGRG1, GZMA, HOPX, CTSW, TRDC, CCL4, HLA-DPB1, C12orf75 ## PLAC8, TTC38, PLEK, APOBEC3G, TBX21, PRSS23, CYBA, MATK, SYNGR1, CXXC5 ## Negative: IL7R, MAL, LEF1, TRABD2A, TRAC, CCR7, LTB, CD27, FOS, LRRN3 ## FHIT, TRAT1, RGCC, CAMK4, CD3D, RGS10, CD40LG, FOSB, AQP3, SOCS3 ## FLT3LG, CD3G, SLC2A3, TSHZ2, VIM, S100A12, S100A8, CD28, PLK3, VCAN ## PC_ 4 ## Positive: FCER1A, PLD4, SERPINF1, IL3RA, CLEC10A, GAS6, LILRA4, TPM2, CLEC4C, ENHO ## FLT3, SMPD3, ITM2C, LGMN, CD1C, P2RY14, PPP1R14B, SCT, PROC, LAMP5 ## RUNX2, AC119428.2, PACSIN1, DNASE1L3, PTCRA, RGS10, UGCG, CLIC2, PPM1J, P2RY6 ## Negative: MS4A1, CD79A, LINC00926, BANK1, TNFRSF13C, VPREB3, CD79B, RALGPS2, IGHD, FCRL1 ## BLK, IGHM, CD22, PAX5, ARHGAP24, CD24, P2RX5, NCF1, S100A12, CD19 ## SWAP70, FCRLA, VNN2, TNFRSF13B, FCER2, IGKC, FCRL2, RBP7, CD40, S100A8 ## PC_ 5 ## Positive: BATF3, C1QA, TCF7L2, CTSL, CDKN1C, HLA-DQB1, HES4, SIGLEC10, CLEC10A, ABI3 ## HLA-DQA1, RHOC, CSF1R, ENHO, CAMK1, MTSS1, IFITM3, CD1C, LY6E, FCGR3A ## HLA-DPA1, CLIC2, HLA-DPB1, YBX1, RRAS, AC064805.1, NR4A1, GBP1, ZNF703, CXCL16 ## Negative: LILRA4, TPM2, CLEC4C, SMPD3, IL3RA, SERPINF1, DERL3, MZB1, SCT, JCHAIN ## PACSIN1, PROC, S100A12, PTCRA, LINC00996, PADI4, ASIP, KCNK17, ITM2C, EPHB1 ## ALOX5AP, LAMP5, DNASE1L3, MAP1A, S100A8, APP, CYP1B1, VNN2, UGCG, ZFAT # clustering (adjust dimensions and resolutions as desired) obj <- FindNeighbors(object = obj, reduction = "pca", dims = 1:20) ## Computing nearest neighbor graph ## Computing SNN obj <- FindClusters(object = obj, resolution = 0.4) ## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck ## ## Number of nodes: 4193 ## Number of edges: 154360 ## ## Running Louvain algorithm... ## 0% 10 20 30 40 50 60 70 80 90 100% ## [----|----|----|----|----|----|----|----|----|----| ## **************************************************| ## Maximum modularity in 10 random starts: 0.9097 ## Number of communities: 10 ## Elapsed time: 0 seconds obj <- RunUMAP(object = obj, reduction = "pca", dims = 1:20) ## 14:44:35 UMAP embedding parameters a = 0.9922 b = 1.112 ## 14:44:35 Read 4193 rows and found 20 numeric columns ## 14:44:35 Using Annoy for neighbor search, n_neighbors = 30 ## 14:44:35 Building Annoy index with metric = cosine, n_trees = 50 ## 0% 10 20 30 40 50 60 70 80 90 100% ## [----|----|----|----|----|----|----|----|----|----| ## **************************************************| ## 14:44:36 Writing NN index file to temp file /tmp/RtmpHhoQAU/filec346711dd9f93 ## 14:44:36 Searching Annoy index using 1 thread, search_k = 3000 ## 14:44:37 Annoy recall = 100% ## 14:44:38 Commencing smooth kNN distance calibration using 1 thread with target n_neighbors = 30 ## 14:44:39 Initializing from normalized Laplacian + noise (using RSpectra) ## 14:44:39 Commencing optimization for 500 epochs, with 174318 positive edges ## Using method 'umap' ## 0% 10 20 30 40 50 60 70 80 90 100% ## [----|----|----|----|----|----|----|----|----|----| ## **************************************************| ## 14:44:44 Optimization finished ``` ## Dimensionality reduction ### Load in the processed object To avoid having to save multiple large objects, this file already includes the annotations defined in a later section, so we will clear them out here to proceed as normal. ```{r seurat-load-processed-object} obj <- readRDS(file = file.path(path_inst, "tenx_pbmc5k_CITEseq_annotated.rds")) # reset to baseline Idents(object = obj) <- "seurat_clusters" obj$annotated_clusters <- c() ``` ### PCA **Typically this section would go in between the `RunPCA()` and `FindNeighbors()` steps above**. It has been included here because of how we save and load the objects separately for processing speed. Now that we've run PCA, we can examine an elbow plot as a simple method for selecting how many PCs to choose when identifying neighbors/clusters. ```{r seurat-elbow, fig.dim = c(6, 3)} ElbowPlot(object = obj, ndims = 30) ``` From the plot, it looked like 20 works as a cut-off for the number of PCs to include. We could have also chosen to use 10, but decided to go a little higher. You can also examine other information which was used for the PCA, such as the top twenty most variable features: ```{r seurat-top-features} head(VariableFeatures(object = obj), 20) ``` ```{r seurat-top-features-plot, fig.dim = c(6, 6), warning = FALSE, eval = FALSE} # these can be plotted if desired LabelPoints(plot = VariableFeaturePlot(object = obj), points = head(VariableFeatures(object = obj), 20), repel = TRUE, xnudge = 0, ynudge = 0) + labs(title = "Top Twenty Variable Features") + theme(legend.position = "bottom") ``` ### UMAP ```{r seurat-umap-initial, fig.dim = c(6, 6), message = FALSE} # you can also use the LabelClusters function to help label individual clusters plot_seurat_clusters <- UMAPPlot(object = obj, label = TRUE, label.size = 4, label.box = TRUE) + labs(title = "Initial Clusters") + scale_fill_manual( values = rep("white", n_distinct(obj$seurat_clusters))) + theme(plot.title = element_text(hjust = 0.5), axis.text = element_blank(), axis.ticks = element_blank()) plot_seurat_clusters ``` ## Marker overlays Note that the following marker genes **are not meant to be an exhaustive list.** They are included as examples of some of the cell types you could look for. ### Load marker genes ```{r seurat-markers-load} # sort by Cell_Type and Marker if not already sorted markers_all <- read.csv(file = file.path(path_inst, "marker_genes.csv")) markers_all %>% slice_sample(n = 10) # example markers ``` ### Dot plots #### GEX Here we only demonstrate a few of the cell types included within the provided markers database. We plot those cell types on top of the dot plot to make it easier to see which markers they correspond with, but you can comment that code out if you would rather have a simple dot plot. ```{r seurat-markers-dot-gex, fig.dim = c(16, 6), warning = FALSE} select_cell_types <- c("B", "Macrophages", "mDC", "NK", "T") # do features = unique(markers_all$Marker) to use all possible features p <- DotPlot(object = obj, features = markers_all %>% dplyr::filter(Cell_Type %in% select_cell_types) %>% distinct(Marker) %>% pull(), cols = "RdBu", col.min = -1, dot.scale = 3, cluster.idents = TRUE) # add in the cell type information # if desired, you could rename the "Cell_Type"s in the original database to be # more informative e.g. Natural_Killers instead of NK p$data <- left_join(p$data, markers_all %>% dplyr::filter(Cell_Type %in% select_cell_types) %>% dplyr::rename(features.plot = "Marker"), by = "features.plot", multiple = "all") # depending on your version of dplyr, you can set `relationship = "many-to-many"` # to surpress the warning # plot p + facet_grid(cols = vars(Cell_Type), scales = "free_x", space = "free") + theme(strip.text.x = element_text(size = 10)) + RotatedAxis() ``` #### ADT The average expression was set to a minimum of zero to better see the up-regulated features. ```{r seurat-markers-dot-adt, fig.dim = c(16, 6)} # you have to change the default assay for the dot plot DefaultAssay(object = obj) <- "ADT" DotPlot(object = obj, features = rownames(GetAssayData(object = obj, assay = "ADT", slot = "counts")), cols = "RdBu", col.min = -1, dot.scale = 3, cluster.idents = TRUE) + RotatedAxis() # reset the default DefaultAssay(object = obj) <- "RNA" ``` ### Feature plots If you would like to generate feature plots for a cell type within your markers database (if you are using one) e.g. for B cells, you could do `features = (dplyr::filter(markers_all, Cell_Type == "B"))$Marker` in the following command. Here we just show a few characteristic markers for different cell types for simplicity. ```{r seurat-markers-feature, fig.dim = c(12, 8)} FeaturePlot(object = obj, features = c("MS4A1", "CD14", # MS4A1 is another name for CD20 "NKG7", "IL3RA", "CD3D", "IL7R"), min.cutoff = 0, label = TRUE, label.size = 4, ncol = 3, raster = FALSE) ``` ### Violin plots Here the point size is set to zero so that you can just see the violins. We only show a few of the cell surface markers here, but you could plot them all with `features = rownames(obj[["ADT"]])`. You can also plot the standard GEX markers just like for the feature plots (just make sure to remove `assay = "ADT"`). ```{r seurat-markers-violin-adt, fig.dim = c(8, 8)} VlnPlot(object = obj, features = c("CD3-TotalSeqB", "CD14-TotalSeqB", "CD20-TotalSeqB", "CD335-TotalSeqB"), pt.size = 0, assay = "ADT", ncol = 2) & theme(plot.title = element_text(size = 10), axis.text.x = element_text(angle = 0, hjust = 0.5), axis.title.x = element_blank()) ``` ## Annotate cell clusters These annotations are chosen here more as a proof of general concept instead of a more highly refined and verified approach that you would typically see within a single-cell publication. There are also [automated methods for annotation](https://www.sciencedirect.com/science/article/pii/S2001037021000192). ### Annotations If you would like to further break down these clusters, you can increase the clustering resolution in the standard pipeline in order to increase the number of clusters. ```{r seurat-annotations} # mDC = myeloid dendritic cells # pDC = plasmacytoid dendritic cells obj_annotations <- rbind(c("0", "Macrophages/mDCs"), # difficult to separate c("1", "T Cells"), c("2", "T Cells"), c("3", "T Cells"), c("4", "Natural Killers"), c("5", "B Cells"), c("6", "Macrophages/mDCs"), # difficult to separate c("7", "Macrophages/mDCs"), # difficult to separate c("8", "T Cells"), # faint signal c("9", "pDCs")) colnames(obj_annotations) <- c("Cluster", "CellType") obj_annotations <- data.frame(obj_annotations) # save the annotations as a csv if you'd like # write.csv(obj_annotations, file = file.path(path_data, "obj_annotations.csv")) # prepare the annotation information annotations <- setNames(obj_annotations$CellType, obj_annotations$Cluster) # relabel the Seurat clusters # Idents(obj) <- "seurat_clusters" obj <- RenameIdents(object = obj, annotations) # alphabetize the cell types Idents(obj) <- factor(Idents(object = obj), levels = sort(levels(obj))) # useful metadata (e.g. if you want to have multiple annotation sets) obj[["annotated_clusters"]] <- Idents(object = obj) # save the processed and annotated Seurat object # saveRDS(obj, file = file.path(path_data, "tenx_pbmc5k_CITEseq_annotated.rds")) # info about the clusters obj_annotations %>% group_by(CellType) %>% transmute(Clusters = paste0(Cluster, collapse = ", ")) %>% distinct() %>% arrange(CellType) # metadata file for MCIA meta <- data.frame("CellType" = Idents(object = obj)) # save the metadata file for MCIA # write.csv(meta, file = file.path(path_data, "metadata_sc.csv")) ``` ### UMAPs Here we use the selected metadata colors from the [MCIA] section, so be sure to change that (or just remove the `cols = ...` to use the default) if you are running this section first. If you don't want the cluster labels to have the white backgrounds, remove `label.box` and `scale_fill_manual()`. ```{r seurat-umap-both, fig.dim = c(16, 8), message = FALSE} # change colors and themes as desired plot_annotated_clusters <- UMAPPlot(object = obj, label = TRUE, label.size = 4, label.box = TRUE, cols = meta_colors_sc) + labs(title = "Annotated Clusters") + scale_fill_manual( values = rep("white", length(meta_colors_sc))) + theme(plot.title = element_text(hjust = 0.5), axis.text = element_blank(), axis.ticks = element_blank()) + NoLegend() # a legend is not needed here ggpubr::ggarrange(plot_seurat_clusters + NoLegend(), plot_annotated_clusters, heights = c(1, 1), widths = c(1, 1)) ``` You should change the `DefaultAssay` to/from `ADT` like in [ADT] if you don't want to specify "adt_" before the feature name. ```{r seurat-umap-adt, fig.dim = c(6, 6)} # plot ADT information on top as an example FeaturePlot(object = obj, features = "adt_CD19-TotalSeqB", label = TRUE, label.size = 4, ncol = 1, raster = FALSE) ``` ### Check the annotations For example, if you want to visually check your T cell cluster annotations (just like before, you can also do `features = (dplyr::filter(markers_all, Cell_Type == "T"))$Marker` to use all of the T cell markers in the markers database): ```{r seurat-check-annotations-violin, fig.dim = c(8, 8)} VlnPlot(object = obj, features = c("CD3D", "CD4", "IL7R", "TRAC"), cols = meta_colors_sc, pt.size = 0.01, ncol = 2) & theme(plot.title = element_text(size = 10), axis.title.x = element_blank()) ``` You can also obtain the following bar plot using the output of `nipals_multiblock` e.g. with `ggplot(data.frame(table(nmb_get_metadata(mcia_results_sc))), aes(x = CellType, y = Freq, fill = CellType)) + geom_bar(stat = "identity")` (for the base plot). ```{r seurat-check-annotations-bar, fig.dim = c(8, 6)} # examine the total counts ggplot(obj[[]], aes(x = annotated_clusters, fill = annotated_clusters)) + geom_bar(color = "black", linewidth = 0.2) + labs(title = "Cell Type Counts", x = "Cell Type", y = "Count") + scale_fill_manual(values = meta_colors_sc) + theme_bw() + theme(plot.title = element_text(size = 14, hjust = 0.5), axis.text.x = element_text(angle = 45, hjust = 1), legend.position = "none") ``` ## Save for MCIA The file has to be structured in the form of a (large) list, with each omic saved as an element within it. If you would like to append the type of omic to the end of your features, uncomment the commented lines. ```{r seurat-save-mcia, eval = FALSE} # mrna data_rna <- GetAssayData(object = obj, slot = "data", assay = "RNA")[VariableFeatures(object = obj), ] data_rna <- as.matrix(data_rna) data_rna <- t(data_rna) # switch the rows # colnames(data_rna) <- paste(colnames(data_rna), "mrna", sep = "_") data_rna <- log2(data_rna + 1) # log-transform because the data is heavy tailed # adt data_adt <- GetAssayData(object = obj, slot = "data", assay = "ADT") data_adt <- as.matrix(data_adt) data_adt <- t(data_adt) # switch the rows # colnames(data_adt) <- paste(colnames(data_adt), "adt", sep = "_") data_adt <- log2(data_adt + 1) # log-transform because the data is heavy tailed # combined data_blocks_sc <- list(mrna = data_rna, adt = data_adt) # examine the contents data.frame(data_blocks_sc$mrna[1:5, 1:5]) data.frame(data_blocks_sc$adt[1:5, 1:5]) # save(data_blocks_sc, file = file.path(path_data, "data_blocks_sc.Rda")) ``` # Session Info
**Session Info** ```{r session-info} sessionInfo() ```