--- title: "Single-cell transcriptomics analysis for pedestrians" author: "Simon Anders" date: "2019-03" output: html_document: code_download: true --- This document demonstrates two approaches to analyse single-cell data: First, we use Seurat, and then we perform all steps using only basic R, in order to be able to study and modify each step. ```{r include=FALSE} set.seed(13245768) ``` ## Example data Our example data: - Carter et al.: "A single-cell transcriptional atlas of the developing murine cerebellum", - [Current Biology 28:2910](https://doi.org/10.1016/j.cub.2018.07.062) (2018). - We use their sample "E13a" (cells from the cerebellum of a mouse embryo at day 13) ## Seurat analysis First we use the Seurat package (Butler et al., 2018; Stuart, Butler et al, 2019) ```{r, message=FALSE} library( Seurat ) ``` We have rerun their raw data through CellRanger: https://www.zmbh.uni-heidelberg.de/anders/div/Carter_E13a.zip Read in the data; we get a sparse matrix ```{r} seu <- Read10X("E13_A/") seu[ 550:560, 2600:2630 ] ``` Create a Seurat object ```{r} seu <- CreateSeuratObject( seu, min.cells = 3, min.features=1000 ) ``` "Normalize" the data ```{r} seu <- NormalizeData( seu ) ``` Find highly variable genes ```{r} seu <- FindVariableFeatures( seu ) ``` Scale and center the data ```{r} seu <- ScaleData( seu ) ``` Run a PCA to reduce the data ```{r} seu <- RunPCA( seu ) ``` Get an embedding. ```{r} seu <- RunUMAP( seu, dims=1:20 ) ``` Plot the embedding ```{r} DimPlot( seu ) ``` Perform Louvain clustering ```{r} seu <- FindNeighbors( seu ) seu <- FindClusters( seu ) ``` Plot the embedding, and highlight the Louvain clusters: ```{r} DimPlot( seu ) ``` Find cluster markers for, eg., cluster 7 ```{r} FindMarkers( seu, ident.1 = 7 ) ``` Inspect it with Sleepwalk ```{r eval=FALSE} library( sleepwalk ) sleepwalk( seu$umap@cell.embeddings, seu$pca@cell.embeddings ) ``` ![](sleepwalk1.png) ## Pedestrian analysis Now we perform an analysis only using standard R and tidyverse packages. ```{r message=FALSE} library( tidyverse ) ``` ### Read CellRanger output First, read the CellRanger output. We read the three files one by one. First the `genes.tsv` file. It contains two columns: the gene IDs and the gene names ```{r} genes <- read_tsv( "E13_A/genes.tsv", col_names = c( "gene_id", "gene_name" ) ) head( genes ) ``` There are a few duplicated gene names. I disambiguate these few cases by adding the Ensemble ID to the name. ```{r} duplicate_gene_names <- genes$gene_name[ duplicated(genes$gene_name) ] genes$gene_name <- ifelse( genes$gene_name %in% duplicate_gene_names, str_c( genes$gene_name, "_", genes$gene_id ), genes$gene_name ) ``` Next, the cell barcodes. This is simply a file with one barcode per line. ```{r} barcodes <- readLines( "E13_A/barcodes.tsv" ) str( barcodes ) ``` These are the unfiltered files: We have over 700,000 barcodes -- presumably most of them empty droplets. I remove the superfluous `-1` suffix. ```{r} barcodes <- str_remove( barcodes, "-1" ) ``` The third file and biggest file in the CellRanger output is the matrix of counts in coordinate-sparse form, stored according to the [MatrixMarket](https://math.nist.gov/MatrixMarket/formats.html) format. ```{r} matrixTable <- read_delim( "E13_A/matrix.mtx", delim=" ", comment = "%", col_names = c( "gene", "barcode", "count" ) ) head(matrixTable) ``` Note how the first non-comment row of a MatrixMarket file contains the total number of genes, barcodes and UMIs. The first two numbers agree with the number of rows in the `genes` and `barcodes` files. From the second row on, we have actual data: the first and second columns are indices into the gene and barcode list, the third one contains the UMI counts. The `sparseMatrix` function from the base R package `Matrix` is designed to handle such data. Actually, it turns it from triplet-sparse to column-sparse (unless we set `giveCsparse=FALSE`) but that might actually be better for performance. (Seurat, by the way, keeps it in triplet-sparse format.) The sparseMatrix functions takes all the data we have just loaded: ```{r} library( Matrix ) counts <- sparseMatrix( i = matrixTable$gene[-1], j = matrixTable$barcode[-1], x = matrixTable$count[-1], dims = c( matrixTable$gene[1], matrixTable$barcode[1] ), dimnames = list( gene = genes$gene_name, barcode = barcodes ) ) counts[1:5,1:5] ``` Remove the raw data tables to save RAM ```{r} rm( barcodes, genes, matrixTable ) ``` ### Do the knee plot This is the knee plot that CellRanger does, too: ```{r} n_detected_genes_sorted <- sort( colSums( counts>0 ), decreasing = TRUE ) head(n_detected_genes_sorted) ``` Now I plot these numbers in a log-log plot ```{r} plot( seq.int(ncol(counts)), n_detected_genes_sorted, log = "xy", type="s", xlab = "barcodes", ylab = "detected genes" ) abline( h=1000, col="gray" ) ``` As before, we call everything with at least 1000 detected genes (above the gray line) a cell. ```{r} cell_barcodes <- names(which( colSums( counts>0 ) >= 1000 )) str(cell_barcodes) ``` Are these the same barcodes as the ones in the Seurat object? ```{r} all( colnames(seu@assays$RNA@counts) == cell_barcodes ) ``` They are. Even in the same order. Remove the rest to make the matrix smaller ```{r} counts <- counts[ , cell_barcodes ] dim(counts) ``` Also remove the genes that appear in none of the remaining cells ```{r} counts <- counts[ rowSums(counts) > 0, ] dim(counts) ``` ### Normalisation For each cell, simply divide the UMI count per gene by the total UMI count over all genes: $y_{ik} = k_{ij} / \sum_{i'} k_{i'j}$ The double transposition is to get R to divide the columns and not the rows by the appropriate total. ```{r} nrm_counts <- t( t(counts) / colSums(counts) ) ``` ### Highly variable genes. A preliminary: Here is a function to efficiently calculate row variances for a column-sparse matrix. This seems to be missing in the Matrix package. ```{r} colVars_spm <- function( spm ) { stopifnot( is( spm, "dgCMatrix" ) ) ans <- sapply( seq.int(spm@Dim[2]), function(j) { mean <- sum( spm@x[ (spm@p[j]+1):spm@p[j+1] ] ) / spm@Dim[1] sum( ( spm@x[ (spm@p[j]+1):spm@p[j+1] ] - mean )^2 ) + mean^2 * ( spm@Dim[1] - ( spm@p[j+1] - spm@p[j] ) ) } ) / ( spm@Dim[1] - 1 ) names(ans) <- spm@Dimnames[[2]] ans } rowVars_spm <- function( spm ) { colVars_spm( t(spm) ) } ``` With this, get the variance-to-mean ratio for each gene and plot them against the means: ```{r} gene_means <- rowMeans( nrm_counts ) gene_vars <- rowVars_spm( nrm_counts ) plot( gene_means, gene_vars / gene_means, log = "xy", cex = .3, col = adjustcolor("black", alpha=.3), xlab = "mean", ylab = "variance / mean" ) poisson_vmr <- mean( 1 / colSums( counts ) ) abline( h = 1:2 * poisson_vmr, col="lightblue" ) ``` The lowest of the light blue line marks the Poisson noise level (`poisson_vmr`). It is placed at the mean of the reciprocals of the UMI totals per cell. This is immediately not obvious; I'll write up the proof. We chose the genes three times above the Poisson level as informative genes ```{r} informative_genes <- names(which( gene_vars / gene_means > 2 * poisson_vmr )) str( informative_genes ) ``` Are the variable genes chosen by Seurat also high in this plot? ```{r} plot( gene_means, gene_vars / gene_means, log = "xy", cex = .3, col = adjustcolor("black", alpha=.3), xlab = "mean", ylab = "variance / mean" ) points( gene_means[ seu@assays$RNA@var.features ], ( gene_vars / gene_means )[ seu@assays$RNA@var.features ], col="red", cex=.2 ) ``` ## Variance-stabilizing transformation ```{r} a <- log1p( nrm_counts * 1e4) a <- sqrt( nrm_counts ) plot( rowMeans( a ), genefilter::rowVars( a ), pch=".", col=adjustcolor("black", .5) ) ``` ## Embedding I calculate a UMAP embedding, using only the informative genes ```{r} library( uwot ) my_umap <- umap( as.matrix( t( sqrt( nrm_counts[ informative_genes, ] ) ) ) ) ``` The square root here is a varianmce-stabilizing transformation. I use it instead of the log used by Seurat, because in UMI data, the dominant noise is Poissonean. (This, too, needs to be explained in more detail, along with the calculation of the Poisson noise level.) ```{r} plot( my_umap, asp=1, pch="." ) ``` Or perhaps better a PCA first? ```{r} library( irlba ) pca <- prcomp_irlba( t( sqrt( nrm_counts[ informative_genes, ] ) ), 20 ) my_umap <- umap( pca$x ) plot( my_umap, asp=1, pch="." ) ``` Plot again, now colour with the Louvain cluster found by Seurat ```{r} plot( my_umap, asp=1, col=seu@active.ident, cex=.1 ) ``` I save the environment, so I can easily restart here later. ```{r} save.image("comparison.rda") ``` Now I can compare the two embeddings. Left the one by Seurat, right the one with my workflow. ```{r eval=FALSE} sleepwalk( list( seu@reductions$umap@cell.embeddings, my_umap ), list( seu@reductions$pca@cell.embeddings, pca$x ) ) ``` ![](sleepwalk2.png) ### What does a PCA do? The rownames in the `pca` object got lost; set them again. ```{r} rownames( pca$x ) <- colnames( counts ) rownames( pca$rotation ) <- informative_genes ``` Plot the first two PCs: ```{r} plot( pca$x[,1:2], cex=.2, asp=1 ) legend("topleft", "each point a cell", cex=.7, pch=1, pt.cex=.3 ) ``` TO DO: Fill in MSigDB query ```{r} writeLines( rownames(pca$rotation)[ order(pca$rotation[,"PC1"]) ], con="pc1_genes.lst" ) ``` See PC 1 in UMAP: ```{r} ggplot( data.frame( UMAP=my_umap, pca$x ) ) + geom_point( aes( x=UMAP.1, y=UMAP.2, col=PC1 ), size=.3 ) + coord_fixed() ``` Scree plot ```{r} plot( pca$sdev ) ``` Pick a random cell, get its coordinate along the principal component scores by looking into the PCA object, and by calculating manually ```{r} cell <- "ACGAGGGAGCAAGG" pca$x[ cell, ] t(pca$rotation) %*% ( sqrt( nrm_counts[ informative_genes, cell ] ) - pca$center ) pca$x[cell,] ``` ```{r} plot( pca$rotation[,"PC1"], ( sqrt( nrm_counts[ informative_genes, cell ] ) - pca$center ), xlab = "Loadings of PC 1", ylab = paste( "centered expression of cell", cell ) ) legend( "topright", "each point an informative gene", pch=1, cex=.7 ) ``` How well is our cell's expression profile captured by the first 20 PCs? ```{r} plot( sqrt( nrm_counts[ informative_genes, cell ] ), pca$rotation %*% pca$x[cell,] + pca$center ) ``` How well is a given gene's expression captured across cells: ```{r} gene <- "Meis2" plot( sqrt( nrm_counts[ gene, ] ), pca$x %*% pca$rotation[gene,] + pca$center[gene] ) ``` Histograms of this ```{r} hist( sqrt( nrm_counts[ gene, ] ) ) hist( pca$x %*% pca$rotation[gene,] + pca$center[gene] ) ``` UMAP coloured by expression of Meis2 ```{r} ggplot( data.frame( UMAP = my_umap, Meis2 = nrm_counts["Meis2",] ) ) + geom_point( aes( x=UMAP.1, y=UMAP.2, col=Meis2 ), size=.4 ) + coord_fixed() + scale_color_gradientn( colours=rje::cubeHelix(100), trans="sqrt" ) ``` ```{r} ggplot( data.frame( UMAP = my_umap, Meis2 = ( pca$x %*% pca$rotation["Meis2",] + pca$center["Meis2"] )^2 ) ) + geom_point( aes( x=UMAP.1, y=UMAP.2, col=Meis2 ), size=.4 ) + coord_fixed() + scale_color_gradientn( colours=rje::cubeHelix(100), trans="sqrt" ) ``` ```{r} plot( sqrt( counts[ gene, ] ), sqrt( ( pca$x %*% pca$rotation[gene,] + pca$center[gene] ) * colSums(counts) ) ) ``` ### Doublet detection ```{r} cells1 <- sample( colnames(counts), 1000 ) cells2 <- sample( colnames(counts), 1000 ) dblt_counts <- counts[,cells1] + counts[,cells2] nrm_dblt_counts <- t( t(dblt_counts) / colSums(dblt_counts) ) pca_dblts <- as.matrix( t(pca$rotation) %*% ( sqrt(nrm_dblt_counts[informative_genes,]) - pca$center ) ) nn <- RANN::nn2( rbind( t(pca_dblts), pca$x ), pca$x, 20 ) dblt_score <- rowSums( nn$nn.idx < 1000 ) table( dblt_score ) plot( my_umap, cex=.1, asp=1, col = 1 + ( dblt_score > 5) ) ``` I have selected a few points with the mouse (using Sleepwalk's lasso feature). These then appear in the environment as a variable called `selPoints`. To be able to knit this document non-interactively, I list the selected point here: ```{r include=FALSE} selPoints <- c(1L, 4L, 5L, 7L, 14L, 16L, 26L, 28L, 29L, 32L, 34L, 35L, 37L, 45L, 60L, 61L, 69L, 71L, 76L, 85L, 95L, 98L, 104L, 107L, 111L, 112L, 113L, 114L, 121L, 122L, 123L, 124L, 126L, 127L, 128L, 129L, 140L, 144L, 148L, 154L, 159L, 164L, 178L, 182L, 189L, 193L, 199L, 206L, 221L, 224L, 228L, 229L, 230L, 234L, 238L, 245L, 247L, 248L, 258L, 259L, 262L, 263L, 269L, 270L, 273L, 276L, 282L, 287L, 303L, 310L, 313L, 315L, 330L, 333L, 342L, 344L, 353L, 365L, 369L, 371L, 387L, 389L, 391L, 393L, 394L, 426L, 434L, 437L, 440L, 451L, 452L, 455L, 457L, 466L, 467L, 474L, 476L, 489L, 490L, 491L, 495L, 498L, 507L, 519L, 522L, 530L, 534L, 539L, 542L, 545L, 548L, 561L, 564L, 565L, 568L, 576L, 582L, 589L, 592L, 593L, 612L, 614L, 615L, 623L, 628L, 630L, 632L, 636L, 646L, 647L, 650L, 652L, 665L, 669L, 675L, 679L, 684L, 686L, 690L, 696L, 698L, 710L, 717L, 718L, 728L, 731L, 738L, 744L, 747L, 750L, 761L, 764L, 768L, 773L, 774L, 775L, 777L, 778L, 779L, 780L, 789L, 790L, 795L, 796L, 803L, 804L, 806L, 817L, 820L, 823L, 826L, 827L, 835L, 844L, 849L, 854L, 862L, 863L, 867L, 875L, 876L, 878L, 890L, 891L, 897L, 914L, 923L, 925L, 928L, 930L, 937L, 948L, 950L, 956L, 960L, 970L, 971L, 976L, 983L, 990L, 991L, 999L, 1008L, 1023L, 1027L, 1029L, 1033L, 1036L, 1040L, 1042L, 1047L, 1051L, 1054L, 1055L, 1056L, 1065L, 1069L, 1080L, 1081L, 1104L, 1106L, 1107L, 1112L, 1114L, 1117L, 1119L, 1123L, 1125L, 1126L, 1129L, 1130L, 1140L, 1142L, 1162L, 1164L, 1167L, 1173L, 1185L, 1198L, 1202L, 1204L, 1208L, 1209L, 1212L, 1214L, 1217L, 1223L, 1231L, 1237L, 1239L, 1240L, 1271L, 1272L, 1273L, 1277L, 1288L, 1292L, 1307L, 1312L, 1316L, 1318L, 1322L, 1328L, 1330L, 1349L, 1364L, 1374L, 1414L, 1415L, 1418L, 1422L, 1424L, 1426L, 1427L, 1433L, 1435L, 1448L, 1449L, 1454L, 1459L, 1464L, 1465L, 1468L, 1470L, 1472L, 1478L, 1483L, 1485L, 1500L, 1501L, 1505L, 1506L, 1508L, 1511L, 1512L, 1516L, 1518L, 1519L, 1523L, 1529L, 1535L, 1536L, 1543L, 1544L, 1547L, 1550L, 1553L, 1557L, 1563L, 1567L, 1570L, 1578L, 1580L, 1584L, 1585L, 1586L, 1587L, 1590L, 1591L, 1599L, 1600L, 1602L, 1607L, 1611L, 1612L, 1613L, 1652L, 1653L, 1654L, 1660L, 1664L, 1671L, 1676L, 1678L, 1679L, 1687L, 1697L, 1698L, 1700L, 1705L, 1710L, 1715L, 1717L, 1720L, 1722L, 1728L, 1735L, 1737L, 1747L, 1751L, 1758L, 1762L, 1769L, 1775L, 1781L, 1788L, 1789L, 1793L, 1797L, 1798L, 1801L, 1819L, 1823L, 1824L, 1832L, 1837L, 1840L, 1841L, 1844L, 1849L, 1856L, 1857L, 1859L, 1867L, 1877L, 1880L, 1884L, 1901L, 1904L, 1911L, 1914L, 1916L, 1917L, 1918L, 1919L, 1923L, 1924L, 1927L, 1935L, 1940L, 1941L, 1947L, 1949L, 1950L, 1951L, 1953L, 1954L, 1966L, 1974L, 1980L, 1991L, 1993L, 1998L, 2002L, 2003L, 2005L, 2006L, 2010L, 2029L, 2033L, 2044L, 2046L, 2048L, 2052L, 2057L, 2061L, 2063L, 2067L, 2080L, 2099L, 2100L, 2103L, 2104L, 2105L, 2109L, 2110L, 2112L, 2113L, 2117L, 2120L, 2122L, 2128L, 2131L, 2146L, 2147L, 2148L, 2155L, 2156L, 2158L, 2163L, 2168L, 2174L, 2176L, 2185L, 2206L, 2208L, 2219L, 2221L, 2226L, 2233L, 2236L, 2241L, 2242L, 2243L, 2247L, 2248L, 2250L, 2251L, 2255L, 2257L, 2287L, 2297L, 2300L, 2302L, 2306L, 2316L, 2322L, 2324L, 2325L, 2329L, 2339L, 2352L, 2354L, 2356L, 2360L, 2362L, 2365L, 2372L, 2381L, 2385L, 2395L, 2398L, 2399L, 2400L, 2402L, 2404L, 2408L, 2431L, 2446L, 2453L, 2455L, 2458L, 2460L, 2465L, 2474L, 2482L, 2483L, 2486L, 2491L, 2493L, 2499L, 2510L, 2512L, 2514L, 2517L, 2520L, 2521L, 2526L, 2530L, 2531L, 2534L, 2535L, 2541L, 2549L, 2555L, 2564L, 2581L, 2584L, 2585L, 2586L, 2587L, 2594L, 2599L, 2602L, 2612L, 2613L, 2615L, 2627L, 2631L, 2634L, 2637L, 2642L, 2643L, 2648L, 2655L, 2665L, 2666L, 2668L, 2675L, 2684L, 2685L, 2691L, 2694L, 2697L, 2700L, 2701L, 2702L, 2703L, 2711L, 2714L, 2715L, 2716L, 2717L, 2718L, 2722L, 2727L, 2730L, 2739L, 2742L, 2743L, 2745L, 2749L, 2763L, 2766L, 2772L, 2779L, 2782L, 2792L, 2795L, 2796L, 2820L, 2822L, 2823L, 2835L, 2838L, 2840L, 2841L, 2852L, 2854L, 2860L, 2863L, 2865L, 2867L, 2869L, 2874L, 2875L, 2877L, 2878L, 2879L, 2893L, 2894L, 2895L, 2896L, 2898L, 2900L, 2901L, 2903L, 2909L, 2910L, 2914L, 2916L, 2924L, 2928L, 2930L, 2931L, 2934L, 2937L, 2939L, 2941L, 2945L, 2947L, 2960L, 2972L, 2973L, 2974L, 2976L, 2977L, 2981L, 2982L, 2992L, 2994L, 2995L, 2998L, 2999L, 3002L, 3003L, 3013L, 3015L, 3020L, 3023L, 3026L, 3028L, 3033L, 3042L, 3049L, 3051L, 3053L, 3061L, 3064L, 3066L, 3073L, 3087L, 3095L, 3098L, 3101L, 3112L, 3115L, 3123L, 3133L, 3138L, 3141L, 3145L, 3146L, 3153L, 3156L, 3164L, 3167L, 3170L, 3171L, 3184L, 3194L, 3195L, 3199L, 3209L, 3216L, 3218L, 3219L, 3223L, 3226L, 3228L, 3231L, 3234L, 3237L, 3238L, 3240L, 3244L, 3250L, 3252L, 3253L) ``` ```r selPoints <- c(1L, 4L, 5L, 7L, 14L, 16L, 26L, 28L, 29L, 32L, 34L, 35L, 37L, 45L, 60L, 61L, 69L, 71L, 76L, 85L, 95L, 98L, 104L, 107L, 111L, ... 3228L, 3231L, 3234L, 3237L, 3238L, 3240L, 3244L, 3250L, 3252L, 3253L) ``` This is the structure that I have selected with the mouse: ```{r} selPoints_bool <-seq.int(ncol(nrm_counts)) %in% selPoints plot( my_umap, asp=1, col=1+selPoints_bool ) ``` What genes are special among these cells? I like to use AUROC for this, as it is non-parametric ```{r} library( pROC ) aucs <- sapply( informative_genes, function(g) auc( roc( selPoints_bool, nrm_counts[g,], algorithm=3 ) ) ) head( sort( aucs, decreasing = TRUE ), 40 ) ``` According to the cheat sheet I got from Kevin, Meis2, Lhx2 and Lhx9 are markers for the precursor cells for the cerebellar nuclei. Let's check out the top gene, Meis2 ```{r fig.height=6,fig.width=7} plot( colSums(counts), jitter(counts["Meis2",]), log="x", cex=.3, col = 1+selPoints_bool ) ``` In this plot, the UMI counts for the gene of interest are on the y axis and the UMI totals for the cell on the x axis. This allows me to distinguish "informative" zeroes (bottom right) from zeroes due to low coverage (bottom left). I have also added vertical jitter to reduce overplotting. The red points are the cells in the selected structure. I prefer this plot over the usual violin plot: ```{r} ggplot( data.frame( selected = selPoints_bool, expr = sqrt(nrm_counts["Meis2",]) ) ) + geom_violin(aes(x=selected, y=expr)) ``` Nevertheless, the violin plot helps understanding what the AUROC criterion does and how the ROC curver for the gene is derived. ```{r} plot( roc( selPoints_bool, nrm_counts["Meis2",] ) ) ``` I next plot the gene in the UMAP embedding. I highlight all cells with a least two UMI counts for the gene. ```{r} plot( my_umap$layout, col = 1 + ( counts["Meis2",] >= 2 ), asp=1, cex=.1 ) ``` Note how the tip of the structure does not seem to be part of whatever this gene marks. Also note that the boundary between the black tip and the red main part do not correspond to the Louvain cluster boundary. Was 2 UMI counts really a good cut-off? Our [LinkedCharts](https://anders-biostat.github.io/linked-charts/rlc/) library allows to add a sigmoid colour slider to the scatter plot. This can help here: ```{r eval=FALSE} library( rlc ) openPage( useViewer = FALSE ) lc_scatter( dat( x = my_umap$layout[,1], y = my_umap$layout[,2], colourValue = nrm_counts["Meis2",], size = 1 ), id = "scatter") lc_colourSlider( chart="scatter" ) ``` ![](rlc1.png) Next, I have selected the long arm with the Sleepwalk lasso: ```{r include=FALSE} selPoints <- c(2L, 3L, 8L, 9L, 10L, 11L, 12L, 13L, 15L, 17L, 18L, 21L, 22L, 24L, 25L, 30L, 33L, 36L, 40L, 42L, 43L, 44L, 49L, 50L, 53L, 54L, 58L, 59L, 66L, 68L, 74L, 78L, 79L, 81L, 84L, 90L, 91L, 96L, 101L, 102L, 103L, 105L, 106L, 108L, 110L, 115L, 117L, 131L, 135L, 138L, 143L, 145L, 146L, 151L, 153L, 157L, 166L, 167L, 169L, 170L, 175L, 176L, 181L, 183L, 184L, 185L, 186L, 188L, 190L, 194L, 198L, 201L, 205L, 208L, 210L, 211L, 212L, 213L, 214L, 215L, 216L, 217L, 219L, 220L, 223L, 226L, 227L, 232L, 233L, 237L, 239L, 240L, 241L, 243L, 246L, 249L, 251L, 252L, 253L, 254L, 255L, 256L, 257L, 261L, 265L, 267L, 271L, 272L, 274L, 279L, 280L, 281L, 283L, 284L, 286L, 288L, 289L, 290L, 292L, 295L, 296L, 297L, 299L, 300L, 302L, 305L, 307L, 308L, 317L, 318L, 320L, 322L, 323L, 325L, 328L, 329L, 334L, 339L, 349L, 354L, 355L, 356L, 357L, 358L, 360L, 362L, 372L, 374L, 379L, 380L, 381L, 382L, 383L, 384L, 385L, 386L, 388L, 397L, 401L, 403L, 404L, 405L, 406L, 407L, 410L, 411L, 412L, 413L, 415L, 419L, 420L, 422L, 423L, 424L, 430L, 431L, 433L, 435L, 436L, 442L, 443L, 446L, 449L, 450L, 453L, 454L, 459L, 462L, 463L, 465L, 471L, 477L, 479L, 483L, 484L, 488L, 494L, 496L, 501L, 503L, 510L, 513L, 515L, 517L, 518L, 520L, 525L, 527L, 531L, 532L, 533L, 536L, 543L, 544L, 546L, 547L, 552L, 556L, 557L, 559L, 560L, 562L, 563L, 569L, 570L, 571L, 572L, 573L, 577L, 579L, 581L, 583L, 584L, 585L, 590L, 594L, 595L, 598L, 599L, 600L, 601L, 604L, 606L, 607L, 608L, 609L, 610L, 613L, 616L, 617L, 618L, 621L, 622L, 625L, 631L, 633L, 634L, 635L, 637L, 638L, 639L, 640L, 641L, 643L, 644L, 651L, 655L, 658L, 659L, 660L, 664L, 672L, 673L, 674L, 676L, 677L, 683L, 687L, 692L, 694L, 697L, 699L, 701L, 707L, 711L, 712L, 713L, 714L, 715L, 724L, 726L, 727L, 734L, 735L, 736L, 737L, 740L, 743L, 745L, 746L, 751L, 753L, 754L, 756L, 759L, 767L, 769L, 770L, 771L, 772L, 781L, 784L, 786L, 788L, 797L, 800L, 801L, 802L, 807L, 808L, 810L, 811L, 814L, 815L, 816L, 821L, 829L, 830L, 832L, 833L, 836L, 839L, 840L, 841L, 843L, 846L, 847L, 848L, 850L, 855L, 859L, 864L, 865L, 868L, 870L, 872L, 874L, 880L, 881L, 884L, 886L, 887L, 889L, 892L, 894L, 898L, 899L, 902L, 904L, 906L, 909L, 910L, 912L, 913L, 915L, 916L, 920L, 921L, 922L, 926L, 927L, 929L, 931L, 932L, 933L, 934L, 935L, 938L, 940L, 942L, 947L, 952L, 954L, 955L, 961L, 962L, 963L, 965L, 966L, 967L, 969L, 973L, 975L, 978L, 979L, 980L, 981L, 982L, 984L, 986L, 988L, 994L, 995L, 997L, 1001L, 1004L, 1005L, 1006L, 1011L, 1012L, 1013L, 1019L, 1024L, 1026L, 1030L, 1031L, 1034L, 1037L, 1038L, 1039L, 1041L, 1049L, 1052L, 1057L, 1060L, 1063L, 1070L, 1073L, 1076L, 1082L, 1084L, 1085L, 1088L, 1090L, 1091L, 1092L, 1094L, 1096L, 1097L, 1101L, 1102L, 1103L, 1105L, 1108L, 1113L, 1116L, 1118L, 1120L, 1121L, 1124L, 1127L, 1132L, 1133L, 1134L, 1135L, 1141L, 1143L, 1145L, 1146L, 1149L, 1153L, 1154L, 1155L, 1156L, 1160L, 1161L, 1163L, 1165L, 1168L, 1170L, 1174L, 1177L, 1178L, 1182L, 1183L, 1186L, 1189L, 1190L, 1193L, 1195L, 1199L, 1201L, 1205L, 1207L, 1213L, 1216L, 1222L, 1225L, 1226L, 1227L, 1229L, 1232L, 1235L, 1238L, 1241L, 1244L, 1245L, 1246L, 1247L, 1248L, 1249L, 1250L, 1253L, 1256L, 1259L, 1260L, 1262L, 1264L, 1265L, 1267L, 1268L, 1274L, 1278L, 1281L, 1283L, 1284L, 1286L, 1287L, 1289L, 1295L, 1296L, 1297L, 1298L, 1301L, 1302L, 1304L, 1305L, 1306L, 1311L, 1314L, 1317L, 1320L, 1321L, 1323L, 1324L, 1325L, 1326L, 1331L, 1334L, 1335L, 1337L, 1339L, 1340L, 1343L, 1344L, 1351L, 1356L, 1357L, 1361L, 1363L, 1365L, 1366L, 1367L, 1369L, 1370L, 1371L, 1372L, 1376L, 1377L, 1379L, 1380L, 1382L, 1384L, 1392L, 1393L, 1396L, 1399L, 1400L, 1403L, 1404L, 1406L, 1407L, 1409L, 1411L, 1413L, 1419L, 1420L, 1421L, 1423L, 1429L, 1432L, 1434L, 1436L, 1439L, 1441L, 1443L, 1445L, 1446L, 1447L, 1450L, 1451L, 1452L, 1453L, 1458L, 1460L, 1461L, 1463L, 1466L, 1469L, 1471L, 1475L, 1476L, 1477L, 1480L, 1482L, 1487L, 1490L, 1491L, 1492L, 1499L, 1504L, 1507L, 1520L, 1521L, 1522L, 1524L, 1525L, 1526L, 1528L, 1530L, 1531L, 1534L, 1540L, 1541L, 1542L, 1545L, 1546L, 1549L, 1555L, 1556L, 1558L, 1564L, 1566L, 1568L, 1569L, 1571L, 1572L, 1575L, 1579L, 1582L, 1583L, 1588L, 1592L, 1593L, 1594L, 1595L, 1596L, 1604L, 1606L, 1609L, 1614L, 1616L, 1619L, 1620L, 1622L, 1623L, 1624L, 1628L, 1630L, 1631L, 1633L, 1634L, 1636L, 1637L, 1642L, 1643L, 1645L, 1646L, 1648L, 1649L, 1657L, 1658L, 1659L, 1661L, 1663L, 1665L, 1667L, 1670L, 1672L, 1673L, 1674L, 1681L, 1682L, 1686L, 1690L, 1691L, 1693L, 1694L, 1695L, 1701L, 1706L, 1707L, 1712L, 1713L, 1718L, 1723L, 1726L, 1732L, 1739L, 1742L, 1746L, 1750L, 1752L, 1754L, 1756L, 1761L, 1763L, 1764L, 1765L, 1767L, 1770L, 1771L, 1772L, 1773L, 1774L, 1776L, 1780L, 1782L, 1785L, 1786L, 1790L, 1794L, 1795L, 1796L, 1800L, 1802L, 1806L, 1807L, 1809L, 1811L, 1812L, 1814L, 1816L, 1817L, 1820L, 1821L, 1822L, 1825L, 1826L, 1828L, 1830L, 1831L, 1833L, 1834L, 1835L, 1838L, 1839L, 1843L, 1847L, 1850L, 1852L, 1853L, 1854L, 1860L, 1861L, 1862L, 1863L, 1864L, 1866L, 1868L, 1870L, 1873L, 1874L, 1875L, 1881L, 1886L, 1888L, 1889L, 1890L, 1891L, 1893L, 1895L, 1897L, 1900L, 1902L, 1903L, 1913L, 1922L, 1925L, 1931L, 1933L, 1934L, 1936L, 1942L, 1945L, 1946L, 1948L, 1958L, 1959L, 1962L, 1965L, 1967L, 1970L, 1971L, 1976L, 1979L, 1981L, 1982L, 1983L, 1985L, 1986L, 1987L, 1988L, 1989L, 1996L, 2000L, 2001L, 2007L, 2008L, 2011L, 2013L, 2016L, 2017L, 2023L, 2024L, 2026L, 2027L, 2028L, 2030L, 2036L, 2040L, 2042L, 2043L, 2054L, 2056L, 2060L, 2062L, 2065L, 2066L, 2071L, 2072L, 2073L, 2074L, 2075L, 2076L, 2078L, 2081L, 2084L, 2085L, 2086L, 2088L, 2089L, 2090L, 2095L, 2096L, 2097L, 2098L, 2101L, 2102L, 2108L, 2111L, 2114L, 2119L, 2121L, 2123L, 2125L, 2126L, 2127L, 2129L, 2130L, 2132L, 2133L, 2134L, 2135L, 2136L, 2138L, 2139L, 2142L, 2143L, 2144L, 2152L, 2153L, 2157L, 2162L, 2164L, 2169L, 2172L, 2175L, 2178L, 2183L, 2186L, 2187L, 2188L, 2189L, 2190L, 2193L, 2194L, 2195L, 2197L, 2200L, 2201L, 2211L, 2215L, 2222L, 2227L, 2228L, 2229L, 2232L, 2240L, 2244L, 2245L, 2253L, 2254L, 2256L, 2258L, 2259L, 2260L, 2263L, 2264L, 2265L, 2266L, 2271L, 2276L, 2277L, 2280L, 2282L, 2283L, 2288L, 2289L, 2290L, 2291L, 2292L, 2293L, 2294L, 2295L, 2296L, 2299L, 2305L, 2308L, 2309L, 2310L, 2311L, 2312L, 2318L, 2319L, 2320L, 2331L, 2332L, 2337L, 2338L, 2340L, 2343L, 2344L, 2345L, 2347L, 2349L, 2355L, 2357L, 2359L, 2364L, 2366L, 2367L, 2368L, 2371L, 2377L, 2378L, 2379L, 2380L, 2382L, 2383L, 2384L, 2387L, 2388L, 2390L, 2393L, 2394L, 2403L, 2405L, 2410L, 2411L, 2413L, 2415L, 2421L, 2424L, 2425L, 2427L, 2432L, 2433L, 2438L, 2440L, 2442L, 2443L, 2445L, 2447L, 2449L, 2451L, 2459L, 2461L, 2462L, 2463L, 2464L, 2466L, 2468L, 2470L, 2471L, 2473L, 2475L, 2476L, 2477L, 2478L, 2480L, 2481L, 2488L, 2490L, 2495L, 2497L, 2500L, 2503L, 2507L, 2511L, 2516L, 2522L, 2524L, 2525L, 2527L, 2528L, 2537L, 2538L, 2539L, 2545L, 2550L, 2551L, 2553L, 2554L, 2559L, 2561L, 2562L, 2563L, 2572L, 2573L, 2574L, 2575L, 2576L, 2577L, 2578L, 2580L, 2582L, 2583L, 2588L, 2590L, 2591L, 2592L, 2593L, 2595L, 2597L, 2598L, 2601L, 2603L, 2608L, 2610L, 2614L, 2617L, 2620L, 2622L, 2624L, 2625L, 2629L, 2630L, 2632L, 2636L, 2638L, 2639L, 2640L, 2645L, 2650L, 2651L, 2656L, 2657L, 2659L, 2667L, 2670L, 2673L, 2674L, 2677L, 2678L, 2679L, 2680L, 2681L, 2682L, 2689L, 2692L, 2696L, 2704L, 2705L, 2706L, 2710L, 2712L, 2713L, 2720L, 2721L, 2723L, 2724L, 2725L, 2728L, 2729L, 2731L, 2732L, 2733L, 2736L, 2740L, 2746L, 2748L, 2752L, 2755L, 2756L, 2757L, 2758L, 2759L, 2761L, 2762L, 2765L, 2769L, 2771L, 2774L, 2783L, 2784L, 2787L, 2789L, 2790L, 2793L, 2797L, 2798L, 2799L, 2802L, 2803L, 2805L, 2806L, 2807L, 2814L, 2817L, 2818L, 2821L, 2827L, 2828L, 2836L, 2844L, 2846L, 2853L, 2855L, 2858L, 2866L, 2871L, 2873L, 2876L, 2881L, 2882L, 2884L, 2888L, 2891L, 2899L, 2902L, 2904L, 2905L, 2907L, 2912L, 2918L, 2921L, 2925L, 2926L, 2929L, 2932L, 2935L, 2946L, 2948L, 2949L, 2950L, 2953L, 2954L, 2955L, 2957L, 2959L, 2961L, 2962L, 2964L, 2965L, 2966L, 2967L, 2968L, 2970L, 2971L, 2975L, 2980L, 2983L, 2984L, 2989L, 2991L, 2996L, 2997L, 3000L, 3004L, 3008L, 3010L, 3011L, 3012L, 3014L, 3018L, 3022L, 3024L, 3025L, 3027L, 3029L, 3030L, 3031L, 3032L, 3036L, 3037L, 3039L, 3040L, 3043L, 3044L, 3047L, 3050L, 3055L, 3059L, 3062L, 3063L, 3065L, 3068L, 3071L, 3076L, 3077L, 3079L, 3080L, 3084L, 3088L, 3090L, 3091L, 3093L, 3096L, 3097L, 3100L, 3103L, 3104L, 3106L, 3107L, 3108L, 3110L, 3113L, 3114L, 3116L, 3119L, 3121L, 3124L, 3125L, 3128L, 3131L, 3136L, 3137L, 3139L, 3140L, 3142L, 3144L, 3150L, 3151L, 3152L, 3154L, 3157L, 3158L, 3159L, 3168L, 3169L, 3174L, 3177L, 3179L, 3180L, 3181L, 3182L, 3183L, 3185L, 3196L, 3197L, 3198L, 3200L, 3202L, 3204L, 3205L, 3206L, 3208L, 3215L, 3221L, 3227L, 3230L, 3233L, 3235L, 3236L, 3239L, 3241L, 3247L, 3249L, 3251L, 3254L) ``` ```r selPoints <- c(2L, 3L, 8L, 9L, 10L, 11L, 12L, 13L, 15L, 17L, 18L, 21L, 22L, ... 3236L, 3239L, 3241L, 3247L, 3249L, 3251L, 3254L) ``` These cells: ```{r} selPoints_bool <-seq.int(ncol(nrm_counts)) %in% selPoints plot( my_umap$layout, asp=1, col=1+selPoints_bool ) ``` Check again by AUROC ```{r} aucs <- sapply( informative_genes, function(g) auc( roc( selPoints_bool, nrm_counts[g,] ) ) ) head( sort( aucs, decreasing = TRUE ), 40 ) ``` (This sapply is not well done. I should presort the counts to speed it up.) I spot Foxp2, a marker for Purkinje cells. Where else do we see that one? ```{r eval=FALSE} openPage( useViewer = FALSE ) lc_scatter( dat( x = my_umap$layout[,1], y = my_umap$layout[,2], colourValue = nrm_counts["Foxp2",], size = 1 ), id = "scatter") lc_colourSlider( chart="scatter" ) ``` ![](rlc2.png) Are there other genes that increase in expression along the length of my selected structure? As the structure is elongated in vertical direction on the UMAP embedding, I can find them by checking, for each gene, the correlation of this gene in a cell with the cell's y cordinate in the umap embedding. I calculate the correlation coefficient only using the selected counts Correlate all genes in these cells with y coordinate of UMAP: ```{r} cors <- sapply( informative_genes, function(g) suppressWarnings(cor( nrm_counts[ g, selPoints_bool ], # expression of gene 'g' in the selected cells my_umap$layout[ selPoints, 2 ], # umap y corrdinate of the selected cells method="spearman" ) )) cat("\n") head( sort( cors, decreasing = TRUE ), 40 ) ``` (The `supressWarnings` is only to remove the warnings about genes with zero SD.) Let's plot one of these genes. This time we use the [cubeHelix](https://www.mrao.cam.ac.uk/~dag/CUBEHELIX/) colour palette. ```{r} as_tibble( my_umap$layout ) %>% add_column( expr = nrm_counts[ "Cdkn1c", ] ) %>% ggplot + geom_point(aes( x=V1, y=V2, col=expr ), size=.3 ) + coord_fixed() + scale_color_gradientn( colours=rev(rje::cubeHelix(50))[15:40] ) ``` ### Further work This is just a demosntration. For a real analysis, we should, of course, check more than just one marker for each cell type, see whether they mark the same cell types, look more closely at the gradualy changing genes, and work towards some biological hypotheses. And, not to forget: There are many more samples in the Zeisel data. ### Session Info The versions of R and of all packages: ```{r} sessionInfo() ```