--- title: "Reading in data and data formatting in SPIAT" author: "Yuzhou Feng" date: "`r Sys.Date()`" output: BiocStyle::html_document: self_contained: yes toc_float: true toc_depth: 4 package: "`r pkg_ver('SPIAT')`" bibliography: "`r file.path(system.file(package='SPIAT', 'vignettes'), 'introduction.bib')`" vignette: > %\VignetteIndexEntry{Reading in data and data formatting in SPIAT} %\VignetteEncoding{UTF-8} %\VignetteEngine{knitr::rmarkdown} editor_options: markdown: wrap: 72 --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) ``` First we load the SPIAT library. ```{r message=FALSE} library(SPIAT) ``` # Reading in data `format_image_to_spe()` is the main function to read in data to SPIAT. `format_image_to_spe()` creates a `SpatialExperiment` object which is used in all subsequent functions. The key data points of interest for SPIAT are cell coordinates, marker intensities and cell phenotypes for each cell. `format_image_to_spe()` has specific options to read in data generated from inForm, HALO, CODEX and cellprofiler. However, we advise pre-formatting the data before input to SPIAT so that accepted by the 'general' option (shown below). This is due to often inconsistencies in the column names or data formats across different versions or as a result of different user options when using the other platforms. ## Reading in data through the 'general' option (RECOMMENDED) Format "general" allows you to input a matrix of intensities (`intensity_matrix`), and a vector of `phenotypes`, which should be in the same order in which they appear in the `intensity_matrix`. They must be of the form of marker combinations (e.g. "CD3,CD8"), as opposed to cell names (e.g. "cytotoxic T cells"), as SPIAT does matching with the marker names. `phenotypes` is an optional parameter and can be set as `NULL` if no phenotypes are available. The user also needs to provide separate vectors with the X and Y coordinates of the cells (`coord_x` and `coord_y`). The cells must be in the same order as in the `intensity_matrix`. If you have `Xmin`, `Xmax`,`Ymin` and `Ymax` columns in the raw data, we advise calculating the average to obtain a single X and Y coordinate, which you can then use as input to `coord_x` and `coord_y`. Specifically, if `intensity_matrix` is available, please make sure the `colnames` of the intensity matrix are the cell IDs as some SPIAT functions (like `identify_bordering_cells()`) require the constructed image object to have cell IDs as `rownames` of the colData and `colnames` of the intensity matrix. If `intensity_matrix` is `NULL`, the function will automatically assign IDs to the cells. Here we use some dummy data to illustrate how to read "general" format. ```{r} # Construct a dummy marker intensity matrix ## rows are markers, columns are cells intensity_matrix <- matrix(c(14.557, 0.169, 1.655, 0.054, 17.588, 0.229, 1.188, 2.074, 21.262, 4.206, 5.924, 0.021), nrow = 4, ncol = 3) # define marker names as rownames rownames(intensity_matrix) <- c("DAPI", "CD3", "CD4", "AMACR") # define cell IDs as colnames colnames(intensity_matrix) <- c("Cell_1", "Cell_2", "Cell_3") # Construct a dummy metadata (phenotypes, x/y coordinates) # the order of the elements in these vectors correspond to the cell order # in `intensity matrix` phenotypes <- c("OTHER", "AMACR", "CD3,CD4") coord_x <- c(82, 171, 184) coord_y <- c(30, 22, 38) general_format_image <- format_image_to_spe(format = "general", intensity_matrix = intensity_matrix, phenotypes = phenotypes, coord_x = coord_x,coord_y = coord_y) ``` The formatted image now contains phenotypes, locations, and marker intensity information of 3 cells. Note that if users want to define cell IDs, the cell IDs should be defined as the colnames of the intensity matrix. The order of the rows of the metadata should correspond to the order of the colnames of the intensity matrix. The function will automatically assign rownames to the `spatialCoords()` and `colData()` of the image (now as a `spatialExperiment` object). Use the following code to inspect the formatted SpatialExperiment object. ```{r} # phenotypes and cell properties (if available) colData(general_format_image) # cell coordinates spatialCoords(general_format_image) # marker intensities assay(general_format_image) ``` ## Reading in data pre-formatted by other software If you prefer to use data directly generated from inForm, HALO, CODEX or cellprofiler, these can be specified by `format` param in `format_image_to_spe()`. We will show examples for the inForm and HALO formats. For reading in input generated with CODEX or cellprofiler see the documentations (`?format_image_to_spe`). ### Reading in data from inForm To read in data from inForm, you need the table file generated by inForm containing the cell IDs, cell locations, phenotypes (if available) and marker intensities. You also need to extract a vector of marker names and marker locations ("Nucleus", "Cytoplasm", or "Membrane"). `format_image_to_spe()` uses the "Cell X Position" and "Cell Y Position" columns and the "Phenotype" column in the inForm raw data. The phenotype of a cell can be a single marker, for example, "CD3", or a combination of markers, such as "CD3,CD4". As a convention, SPIAT assumes that cells marked as "OTHER" in "inForm" refer to cells positive for DAPI but no other marker. The phenotypes must be based on the markers (e.g. CD3,CD4), rather than names of cells (e.g. cytotoxic T cells). The names of the cells (e.g. cytotoxic T cells) can be added later using the `define_celltypes()` function. The following cell properties columns are also required to be present in the inForm input file: Entire Cell Area (pixels), Nucleus Area (pixels), Nucleus Compactness, Nucleus Axis Ratio, and Entire Cell Axis Ratio. If not present in the raw data, these can be columns with NAs. To read in inForm data, you need to specify the following parameters: - `format`: "inForm" - `path`: path to the raw inForm image data file - `markers`: names of markers used in the OPAL staining. These must be in the same order as the marker columns in the input file, and must match the marker names used in the input file. One of the markers must be "DAPI". - `locations`: locations of the markers in cells, either "Nucleus", "Cytoplasm" or "Membrane." These must be in the same order as `markers`. The locations are used to auto-detect the intensity (and dye) columns. A small example of inForm input is included in SPIAT containing dummy marker intensity values and all the other required columns (see below). This example file is just for demonstrating importing a raw data file, later in the [Inspecting the SpaitalExperiment object](#example-data) section we will load a larger preformatted dataset. Users are welcome to use this formatting option (`format = 'inForm'`) if it is closer to the format of their files. ```{r message=FALSE} raw_inform_data <- system.file("extdata", "tiny_inform.txt.gz", package = "SPIAT") markers <- c("DAPI", "CD3", "PD-L1", "CD4", "CD8", "AMACR") locations <- c("Nucleus","Cytoplasm", "Membrane","Cytoplasm","Cytoplasm", "Cytoplasm") # The order is the same as `markers`. formatted_image <- format_image_to_spe(format="inForm", path=raw_inform_data, markers=markers, locations=locations) ``` Alternatively, rather than specifying the `locations`, you can also specify the specific intensity columns with the parameter `intensity_columns_interest` as shown below. ```{r message=FALSE} raw_inform_data <- system.file("extdata", "tiny_inform.txt.gz", package = "SPIAT") markers <- c("DAPI", "CD3", "PD-L1", "CD4", "CD8", "AMACR") intensity_columns_interest <- c( "Nucleus DAPI (DAPI) Mean (Normalized Counts, Total Weighting)", "Cytoplasm CD3 (Opal 520) Mean (Normalized Counts, Total Weighting)", "Membrane PD-L1 (Opal 540) Mean (Normalized Counts, Total Weighting)", "Cytoplasm CD4 (Opal 620) Mean (Normalized Counts, Total Weighting)", "Cytoplasm CD8 (Opal 650) Mean (Normalized Counts, Total Weighting)", "Cytoplasm AMACR (Opal 690) Mean (Normalized Counts, Total Weighting)" ) # The order is the same as `markers`. formatted_image <- format_inform_to_spe(path=raw_inform_data, markers=markers, intensity_columns_interest=intensity_columns_interest) class(formatted_image) # The formatted image is a SpatialExperiment object dim(colData(formatted_image)) dim(assay(formatted_image)) ``` ### Reading in data from HALO To read in data from HALO, you need the table file generated by HALO. The biggest difference between inForm and HALO formats is the coding of the cell phenotypes. While inForm encodes phenotypes as the combination of positive markers (e.g. "CD3,CD4"), HALO uses a binary system where 1 means the cell is positive for the marker and 0 otherwise. `format_image_to_spe()` for "HALO" format collapses HALO encoded phenotypes into an inForm-like format to create the `Phenotype` column. For example, if HALO has assigned a cell a marker status of 1 for CD3 and 1 for CD4, SPIAT will give it the Phenotype "CD3,CD4". Cells that have a marker status of 1 for DAPI but no other marker are given the phenotype "OTHER". `format_image_to_spe()` takes the average of the HALO X min and X max columns for each cell to create the `Cell.X.Position` column. It takes the average of the Y min and Y max to create the `Cell.Y.Position` column. To read in HALO data, you need to specify the following parameters: - `format`: "HALO" - `path`: path to the raw HALO image data file - `markers`: names of markers used in the OPAL staining. These must be in the same order as the marker columns in the input file, and must match the marker names used in the input file. One of the markers must be DAPI. - `locations`: locations of the markers in cells, either "Nucleus", "Cytoplasm" or "Membrane." These must be in the order of the `markers`. The locations are used to auto-detect the intensity (and dye) columns. - `intensity_columns_interest` use if `locations` is not specified. Vector with the names of the columns with the level of each marker. Column names must match the order of the `markers` parameter. - `dye_columns_interest` Use if locations is not specified. Vector of names of the columns with the marker status (i.e. those indicating 1 or 0 for whether the cell is positive or negative for the marker). Column names must match the order of the `markers` parameter. Users can specify the `locations` to auto-detect the columns as shown above for inForm. Alternatively, if users want to specify the columns instead, you can do so with `intensity_columns_interest`, as shown in the example below. Note that then you also must specify `dye_columns_interest`. The following cell properties columns are also required to be present in the HALO input file: Cell Area, Nucleus Area, Cytoplasm Area. If these are not present in the user's data, we recommend adding these columns with NA values. ```{r message=FALSE} raw_halo_data <- system.file("extdata", "tiny_halo.csv.gz", package = "SPIAT") markers <- c("DAPI", "CD3", "PD-L1", "CD4", "CD8", "AMACR") intensity_columns_interest <- c("Dye 1 Nucleus Intensity", "Dye 2 Cytoplasm Intensity", "Dye 3 Membrane Intensity", "Dye 4 Cytoplasm Intensity", "Dye 5 Cytoplasm Intensity", "Dye 6 Cytoplasm Intensity") dye_columns_interest <- c("Dye 1 Positive Nucleus", "Dye 2 Positive Cytoplasm", "Dye 3 Positive Membrane", "Dye 4 Positive Cytoplasm", "Dye 5 Positive Cytoplasm", "Dye 6 Positive Cytoplasm") formatted_image <- format_halo_to_spe( path=raw_halo_data, markers=markers, intensity_columns_interest=intensity_columns_interest, dye_columns_interest=dye_columns_interest) class(formatted_image) # The formatted image is a SpatialExperiment object dim(colData(formatted_image)) dim(assay(formatted_image)) ``` # Inspecting the SpaitalExperiment object ## Structure of a SPIAT SpatialExperiment object {#example-data} In this vignette we will use an inForm data file that's already been formatted for SPIAT with `format_image_to_spe()`, which we can load with `data`. ```{r} data("simulated_image") ``` This is in `SpatialExperiment` format. ```{r} class(simulated_image) ``` This example data has 5 markers and 4951 cells. ```{r} dim(simulated_image) ``` `assay()` stores the intensity level of every marker (rows) for every cell (columns). ```{r} # take a look at first 5 columns assay(simulated_image)[, 1:5] ``` `colData()` stores the phenotype and cell properties. Note that the `sample_id` column was added by `SpatialExperiment` data structure and can be ignored here. ```{r} # take a look at first 5 rows colData(simulated_image)[1:5, ] ``` `spatialCoords()` stores cell coordinates. ```{r} # take a look at first 5 rows spatialCoords(simulated_image)[1:5, ] ``` We can check what phenotypes are there. ```{r} unique(simulated_image$Phenotype) ``` The phenotypes in this example data can be interpreted as follows: - Tumour_marker = cancer cells - Immune_marker1,Immune_marker2 = immune cell type 1 - Immune_marker1,Immune_marker3 = immune cell type 2 - Immune_marker1,Immune_marker2,Immune_marker4 = immune cell type 3 - OTHER = other cell types ## Nomenclature In SPIAT We define as **markers** proteins whose levels where queried by OPAL, CODEX or other platforms. Examples of markers are "AMACR" for prostate cancer cells, "panCK" for epithelial tumour cells, "CD3" for T cells or "CD20" for B cells. The combination of markers results in a specific **cell phenotype**. For example, a cell positive for both "CD3" and "CD4" markers has the "CD3,CD4" **cell phenotype**. **The phenotype has to be strictly formatted in such** **way where each positive marker has to be separated by a comma, with no space in** **between, and the order of the positive markers has to be the same as the order ** **in `assay()`.** Finally, we define a **cell type** as a name assigned by the user to a cell phenotype. For example, a user can name "CD3,CD4" cells as "helper T cells". We would refer to "helper T cells" therefore as a **cell type**. ## Splitting images In the case of large images, or images where there are two independent tissue sections, it is recommended to split images into sections defined by the user. This can be performed with `image_splitter()` after `format_image_to_spe()`. ```{r} split_image <- image_splitter(simulated_image, number_of_splits=3, plot = FALSE) ``` ## Predicting cell phenotypes SPIAT can predict cell phenotypes using marker intensity levels with `predict_phenotypes()`. This can be used to check the phenotypes that have been assigned by inForm and HALO. It can also potentially be used to automate the manual phenotyping performed with inForm/HALO. The underlying algorithm is based on the density distribution of marker intensities. We have found this algorithm to perform best in OPAL data. Further phenotyping methods for other data formats are under development. This algorithm does not take into account cell shape or size, so if these are required for phenotyping, using HALO or inForm or a machine-learning based method is recommended. `predict_phenotypes()` produces a density plot that shows the cutoff for calling a cell positive for a marker. If the dataset includes phenotypes obtained through another software, this function prints to the console the concordance between SPIAT's prediction and pre-defined phenotypes as the number of true positives (TP), true negatives (TN), false positives (FP) and false negatives (FN) phenotype assignments. It returns a table containing the phenotypes predicted by SPIAT and the actual phenotypes from inForm/HALO (if available). ```{r, fig.height=6, fig.width = 4} predicted_image <- predict_phenotypes(spe_object = simulated_image, thresholds = NULL, tumour_marker = "Tumour_marker", baseline_markers = c("Immune_marker1", "Immune_marker2", "Immune_marker3", "Immune_marker4"), reference_phenotypes = TRUE) ``` We can use `marker_prediction_plot()` to plot the predicted cell phenotypes and the phenotypes generated from the platforms for comparison. ```{r, fig.width = 9, fig.height=2, out.width="100%"} marker_prediction_plot(predicted_image, marker="Immune_marker1") ``` The plot shows Immune_marker1+ cells in the tissue. On the left are the Immune_marker1+ cells defined by the simulated image and on the right are the Immune_marker1+ cells predicted using SPIAT. Since we know that the simulated phenotypes are the truth, we leave the phenotypes as they are. The next example shows how to replace the original phenotypes with the predicted ones. Note that for this tutorial, we still use the original phenotypes. ```{r, eval=FALSE} predicted_image2 <- predict_phenotypes(spe_object = simulated_image, thresholds = NULL, tumour_marker = "Tumour_marker", baseline_markers = c("Immune_marker1", "Immune_marker2", "Immune_marker3", "Immune_marker4"), reference_phenotypes = FALSE) ``` ```{r, include=FALSE} predicted_image2 <- predict_phenotypes(spe_object = simulated_image, thresholds = NULL, tumour_marker = "Tumour_marker", baseline_markers = c("Immune_marker1", "Immune_marker2", "Immune_marker3", "Immune_marker4"), reference_phenotypes = FALSE) ``` ## Specifying cell types SPIAT can define cell types with the `define_celltypes()` function. By default the new column for cell types is called `Cell.Type`. The cell types can be defined based on `Phenotype` column, as well as other columns. ```{r} formatted_image <- define_celltypes( simulated_image, categories = c("Tumour_marker","Immune_marker1,Immune_marker2", "Immune_marker1,Immune_marker3", "Immune_marker1,Immune_marker2,Immune_marker4", "OTHER"), category_colname = "Phenotype", names = c("Tumour", "Immune1", "Immune2", "Immune3", "Others"), new_colname = "Cell.Type") ``` # You can access the vignettes for other modules of SPIAT here: - [Overview of SPIAT](SPIAT.html) - [Quality control and visualisation](quality-control_visualisation.html) - [Basic analysis](basic_analysis.html) - [Cell colocalisation](cell-colocalisation.html) - [Spatial heterogeneity](spatial-heterogeneity.html) - [Tissue structure](tissue-structure.html) - [Cellular neighborhood](neighborhood.html) # Reproducibility ```{r} sessionInfo() ``` # Author Contributions AT, YF, TY, ML, JZ, VO, MD are authors of the package code. MD and YF wrote the vignette. AT, YF and TY designed the package.