%\VignetteIndexEntry{R packages: CellScore} %\VignetteDepends{Biobase, CellScore, hgu133plus2CellScore} %\VignetteEngine{knitr::knitr} %\VignetteKeywords{CellScore} %\VignettePackage{CellScore} \documentclass[a4paper, 10pt]{article} \usepackage[toc,page]{appendix} \usepackage{hyperref} % \href \usepackage{framed} % framed element/text \usepackage{rotating} % sidewaysfigure \usepackage{pdflscape} % \landscape \usepackage[labelfont=bf]{caption} % Figure label in bold \usepackage{etoolbox} % \ifboolexpr %% new commands for string formating, for consistent and %% systematic change across the entire text %% \pkg{} - for formating R package names \newcommand{\pkg}[1]{{\normalfont\fontseries{b}\selectfont #1}} %% \prog{} - for formating programming language names like R \let\prog=\texttt %% \prog{} - for formating names of R data objects in the text \newcommand{\code}[1]{\texttt{\detokenize{#1}}} %% \vars{} - for new concepts, names of classes ... \newcommand{\vars}[1]{\textit{\detokenize{#1}}} <>= #suppressPackageStartupMessages(library(Biobase)) library(Biobase) pkg.ver <- package.version("CellScore") @ \title{\pkg{CellScore} \Sexpr{pkg.ver}: Evaluation of Cell Identity} \author{Nancy Mah, Katerina Ta\v{s}kova\\ \texttt{nancy.l.mah@googlemail.com, katerina@tashkova.org}} \date{\today} \begin{document} \maketitle %% for the problem of too long words (usualy variable names) %% getting over the text margin \emergencystretch=5em <>= ## Save the current working directory dir.main <- getwd() ## Set the name of the directory in which figures will be saved (if any) dir.figures <- 'figures' ## global chunk options library(knitr) opts_chunk$set( concordance=FALSE, cashe=2, ## cache is only valid with a specific version of R and session info ## cache will be kept for at most a month (re-compute the next month) cache.extra=list(R.version, sessionInfo(), format(Sys.Date(), '%Y-%m') ), autodep=TRUE, fig.path=paste0(dir.figures,"/"), tidy=FALSE, size="small", message=FALSE, warning=FALSE ) options(width=70, promp="R> ", continue="+ ", useFancyQuotes=FALSE) @ \tableofcontents \newpage \section{Introduction} The \pkg{CellScore} package contains functions to evaluate the cell identity of a test sample undergoing a cell transition, given a starting (donor) cell type and a desired target cell type. The evaluation is based upon a scoring system, which uses a set of standard samples of known cell types as the reference set. It combines the benefits of two metrics, cosine similarity of expression profiles and fractions of expressed cell type specific genes, into a single score for the cell identity, called CellScore. The CellScore evaluation has been carried out on a large set of microarray data from one platform (Affymetrix Human Genome U133 Plus 2.0). In principle, the method could be applied to any expression dataset, provided that there are a sufficient number of standard samples and that the data are properly normalized (to account for study-specific and platform-specific effects). \section{Installation} This vignette assumes that you have already installed \prog{R} ($\geq$ \Sexpr{paste(R.version$major, R.version$minor, sep=".")}) and that you have basic working knowledge of \prog{R}. You will additionally need to install the core Bioconductor packages if these have not already been installed: <>= if (!requireNamespace("BiocManager", quietly=TRUE)) install.packages("BiocManager") BiocManager::install() @ To complete this tutorial, you will to install the \pkg{hgu133plus2CellScore} and \pkg{CellScore} packages from the Bioconductor repository: <>= BiocManager::install(c("hgu133plus2CellScore", "CellScore")) @ \section{Data preparation for analysis with \pkg{CellScore}} In order to use the package functionality, you will need normalized expression data from both the reference samples (from standard cell types) and from the test samples (from engineered cell types). A pre-selected and formatted dataset of reference samples from the Affymetrix Human Genome U133 Plus 2.0 platform is provided as a separate data package (\pkg{hgu133plus2CellScore}). The reference dataset (stored as \code{eset.std}) includes 837 samples covering over 100 tissues or distinct cell types. To illustrate the usage of the package, a test dataset is also distributed with the package. For an explicit description of the format of the data and how to generate your own reference data, please see the Appendix \ref{appendix:inputdata}. \section{Example analysis with \pkg{CellScore}} In this section, we include a full analysis on the provided test dataset to demonstrate the functionality of \pkg{CellScore}. The typical workflow includes the following steps: \begin{enumerate} \item {\it Prepare the data}: A formatted expression dataset from reference cell types profiled with a specific Affymetrix platform is provided as a data package. You will also need to prepare the expression profiles of your test samples in the same way as the reference data, and provide an additional table with the information about the cell transitions that should be evaluated (described in Appendix \ref{appendix:inputdata}). \item {\it Calculate the on/off score}: The first metric of CellScore is based on present/absent probe calls between donor and target cell types. \item {\it Calculate the cosine similarity}: The second metric of CellScore is the cosine similarity between the standard and the test cell types, based on the normalized expression data. \item {\it Generate the CellScore}: This calculates the overall cell identity score of a given test sample and a given cell transition. \item {\it Generate the CellScore report}: The PDF report contains plots and heatmaps of the CellScore metrics to help you visualize the progress/direction of the evaluated cell transitions. \end{enumerate} \subsection{Load the data} <>= options(BIOCINSTALLER_ONLINE_DCF=FALSE) pc_info <- ifelse(grepl(getwd(),"kate"), "Intel Core i7-7500U CPU @ 2.70GHz", "Intel Core i7-5600U CPU @ 2.60GHz") @ Load the packages and standard dataset: <>= library(Biobase) library(CellScore) library(hgu133plus2CellScore) # loads eset.std @ For this vignette, you will need the data from two external files distributed with the \pkg{CellScore} package source: \begin{itemize} \item {\code{eset48.RData:}} An \prog{R} data file that stores \code{eset48}, an \vars{ExpressionSet} object with the normalized gene expression values of 48 samples from engineered (derived) cells representing several cell transitions. \item {\code{cell_change_test.tsv:}} A tab-delimited text file with the information on the cell transitions that should be evaluated; each row depicts a single cell transition. \end{itemize} These data can be loaded with the code below, which also combines the gene expression values of the test samples (\code{eset48}) with the ones of the standard reference samples (\code{eset.std}) in one single \vars{ExpressionSet} object \code{eset}. <>= ## Locate the external data files in the CellScore package rdata.path <- system.file("extdata", "eset48.RData", package="CellScore") tsvdata.path <- system.file("extdata", "cell_change_test.tsv", package="CellScore") if (file.exists(rdata.path) && file.exists(tsvdata.path)) { ## Load the normalized expressions of 48 test samples load(rdata.path) ## Import the cell change info for the loaded test samples cell.change <- read.delim(file=tsvdata.path, sep="\t", header=TRUE, stringsAsFactors=FALSE) print("Content of cell.change") print(cell.change) ## Combine the standards and the test data eset <- combine(eset.std, eset48) print("dim(eset) returns") print(dim(eset)) } @ The particular examples covered in this tutorial (\code{cell.change}) include two cell transitions producing three different derived cell types: induced pluripotent stem cells derived from fibroblasts or keratinocytes (iPS-FIB and iPS-KER, respectively), and partially induced pluripotent stem cells from fibroblast (piPS-FIB). \subsection{Calculate the on/off score} The first CellScore metric we will calculate is the on/off score. This score is based on the fraction of donor markers lost and the fraction of target markers gained by the derived cells in a given cell transition. It can be calculated on a sample level (for each sample individually) or on a group level (for each derived cell type, aggregating the information across all samples). The function call below calculates the on/off scores for each derived cell type listed in \code{cell.change}. The outcome is a list of two data frames. <>= group.OnOff <- OnOff(eset, cell.change, out.put="marker.list") @ <>= <> summary(group.OnOff) @ In this example, we see that as a group, the iPS-KER samples have the best on/off score (high donor marker loss and high target marker gain). <>= head(group.OnOff$scores) @ The OnOff() function also outputs a data frame of marker genes used for the calculation of the on/off score. A data frame of the marker genes can be found here: <>= head(group.OnOff$markers) @ Next, we calculate the on/off score for all samples in the expression dataset. This is provided as a separate step since it can take some time to calculate, depending on how many samples there are in the \code{eset} object. In this case, it was fast and took a few seconds of CPU time on an \Sexpr{pc_info}. <>= individ.OnOff <- OnOff(eset, cell.change, out.put="individual") @ The calculations are done and it is a good idea to save the calculated scores: <>= save(file="OnOffscore.RData", individ.OnOff, group.OnOff) @ \textbf{Things to check at this point}: Looking at the on/off score of individual samples that make up the standards, check if there are any clear outliers. For example, these outliers may be caused by wrong annotation or modifications to the cell line that make them unsuitable as standards. We can either eliminate these samples form the analysis, or keep them but score them for different cell transitions. For this purpose, the values of the \code{category} column in \code{eset@phenoData@data} (the \vars{phenotype data frame}, described in Appendix \ref{appendix:inputdata}) can be set to ``NA'' or ``unknown''. If the category is ``NA'', the corresponding sample will not be scored. If the category is ``unknown'', the corresponding sample will be scored for all available cell transitions specified in \code{cell.change}. For example, look at all samples that are embryonic stem cells (ESC) in the transition from FIB to ESC: <>= sel.transition <- individ.OnOff$scores$start == "FIB" & individ.OnOff$scores$target == "ESC" sel.esc <- grepl("embryonic stem cell", individ.OnOff$scores$cell_type) onoff.sel <- individ.OnOff$scores[sel.esc & sel.transition, c(4,6,7,12,13,19,20,21)] @ Ideally, standard ESC samples in the transition from FIB to ESC should have an on/off score close to 2. However, the scores of the ESC cells ranges from \Sexpr{round(min(onoff.sel$OnOffScore), digits=2)} to \Sexpr{round(max(onoff.sel$OnOffScore), digits=2)}: <>= summary(onoff.sel$OnOffScore) @ These samples are clear outliers and therefore already have been removed from the ESC standards by assigning them to the category ``test''. <>= onoff.sel[order(onoff.sel$OnOffScore)[1:4],] @ Finally, you can plot the group-wise on/off scores in a pyramid barplot (see Figure ~\ref{fig:pyramidbarplot}). The function \code{BarplotOnOff()} also returns the data (a list of two data frames) used for making the barplot: <>= barplot.out <- BarplotOnOff(eset, group.OnOff$scores) @ <>= <> barplot.out @ <>= <> @ The plot in Figure ~\ref{fig:pyramidbarplot} can be easily saved in a PDF format as follows. <>= pdf(file="GroupOnOffScore_Barplot.pdf") <> dev.off() @ \subsection{Calculate the cosine similarity} The second metric for cell scoring is the cosine similarity that can be calculated with the function \code{CosineSimScore()}. First, the expression matrix (containing only standard samples) is filtered for the most variable probes/genes, as defined by the interquartile range (IQR). By default we keep only the top 10\% of the genes ranked by IQR, and this cutoff can be changed with the function argument \code{iqr.cutoff}. Then the mean centroid of the expression values for each standard cell type (defined by \code{eset$general_cell_type}) is calculated. Finally, the cosine similarity is calculated between each centroid and sample. Note that this function can take few minutes to calculate, depending on how many samples there are in the \code{eset} object. So, start the execution and go for a coffee break ... <>= tmp.time <- system.time(cs <- CosineSimScore(eset, cell.change, iqr.cutoff=0.1)) tmp.time @ In our case, the calculation took approximately \Sexpr{round(tmp.time[1], digits=0)} seconds of CPU time on an \Sexpr{pc_info}. The result of the calculations is a list of five data objects: \begin{itemize} \item \code{cs$pdataSub:} A data frame with the phenotype of the reference samples. \item \code{cs$esetSub.IQR:} An \vars{ExpressionSet} object with the gene-filtered expression profiles of the reference samples. \item \code{cs$cosine.general.groups:} A matrix with the values of cosine similarity between all general groups defined by \code{eset$general_cell_type} \item \code{cs$cosine.subgroups:} A matrix with the values of cosine similarity between all sub groups defined by \code{eset$sub_cell_type1} \item \code{cs$cosine.samples:} A matrix with the values of cosine Similarity between all samples, general groups and subgroups \end{itemize} Now we can visualize the cosine similarity values in a heatmap. The function \code{PlotCosineSimHeatmap()} will generate a PDF file. For example, we can plot the cosine similarity between \begin{itemize} \item the mean centroid of all general cell types, annotated by \code{eset$general_cell_type} <>= PlotCosineSimHeatmap(cs$cosine.general.groups, "general groups", width=20, height=20, x=-20, y=3) @ \item the mean centroid of all subgroup cell types, annotated by \code{eset$sub_cell_type1} <>= PlotCosineSimHeatmap(cs$cosine.subgroups, "subgroups", width=14, height=14, x=-14, y=3) @ \item all samples and subgroups; note that this plot can be enormous, depending on the number of samples, so you need to adjust the page size to make it viewable. <>= PlotCosineSimHeatmap(cs$cosine.samples, "samples", width=50, height=50, x=-50, y=10) @ \end{itemize} Rather than plotting the cosine similarity for all samples and subgroups, you can pick which standards and test samples to plot. For example, plot the general groups (fibroblast and embryonic stem cells) for the transition from FIB to ESC and the relevant test cells for this cell transition. <>= ## Get the names (IDs) of the sample and their description samples.cs <- colnames(cs$cosine.samples) samples.eset <- sampleNames(eset) ## Select the samples of interest and their corresponding score sel.ips <- eset$category == "test" & eset$sub_cell_type1 %in% c("piPS-FIB", "iPS-FIB") sel <- samples.cs %in% c(c("FIB", "ESC"), samples.eset[sel.ips]) cs.sel <- cs$cosine.samples[sel, sel] ## Rename columns/rownames to more descriptive labels ## as cs.sel is a symetric matrix, these are identical ids <- match(colnames(cs.sel), samples.eset) ids.na <- is.na(ids) ids.rest <- na.omit(ids) new.colnames <- c(colnames(cs.sel)[ids.na], paste(eset$sub_cell_type1[ids.rest], eset$sample_id[ids.rest], sep="_") ) colnames(cs.sel) <- rownames(cs.sel) <- new.colnames @ <>= ## Plot the heatmap PlotCosineSimHeatmap(cs.sel, "piPS", width=10, height=10, x=-10, y=3) @ %% the chunk 'HeatmapCode' will generate a PDF in the working directory, %% move the figure in the figure to dir.figures %% TODO change this function to output the plot in a given folder <>= heatmap.filename <- "CosineSimilarityHeatmap_piPS.pdf" heatmap.newpath <- file.path(dir.figures, heatmap.filename) system(paste("mv", heatmap.filename, heatmap.newpath)) @ <>= <> <> @ %% include the figure in the vignette \begin{figure}[!ht] \includegraphics[width=1.2\linewidth]{{\Sexpr{heatmap.newpath}}} \caption{\textbf{Heatmap of cosine similarity.} The triangular heatmap shows the cosine similarity calculated between the centroids of the donor (fibroblasts; FIB) and desired target cells (embryonic stem cells; ESC). In addition, it shows the similarity between the individual samples of two cell transitions, that is iPS-FIB and piPS-FIB.} \label{fig:cosinesim} \end{figure} Notice that in the last plot, many of the partial iPS cell lines are more similar to FIB rather than to ESC, which is to be expected (see Figure ~\ref{fig:cosinesim}) As an additional sanity check, we can perform a Principal Component Analysis (PCA) and plot the data in the space of the the first two principal components. For example, the PCA plot of the standard reference samples (used to calculate the cosine similarity) should show that the similar cell types cluster closer together. The \code{PcaStandards()} function can be used to generate such a plot. The same allows the samples to be colored according to different properties of the samples: \begin{itemize} \item experiment ID <>= PcaStandards(cs$pdataSub$experiment_id, "Experiment ID", cs$esetSub.IQR) @ \item general cell type (see Figure ~\ref{fig:Pca}): <>= PcaStandards(cs$pdataSub$general_cell_type, "General Cell Type", cs$esetSub.IQR) @ \end{itemize} \clearpage \begin{landscape} <>= <> @ \end{landscape} To identify specific sample (points) on the PCA plot you could plot the figure including a text label for each sample. Samples with missing label (``NA'') will have no text annotation on the plot: <>= pdf(file="StandardSamples_PCA_Labels.pdf", width=28, height=14) PcaStandards(cs$pdataSub$general_cell_type, "General Cell Type", cs$esetSub.IQR) PcaStandards(cs$pdataSub$general_cell_type, "General Cell Type", cs$esetSub.IQR, text.label=cs$pdataSub$general_cell_type) dev.off() @ \subsection{Generate the CellScore} Now that we've calculated the on/off scores and cosine similarities, it's straightforward to calculate the CellScores for every sample in the expression matrix. <>= cellscore <- CellScore(eset, cell.change, individ.OnOff$scores, cs$cosine.samples) @ Finally, we can save all the scores and data in one file for later manipulations. <>= save(file="VignetteResults.RData", eset, # the combined expression dataset group.OnOff, individ.OnOff, # the on/off score values cs, # the cosine similaritiy values cellscore # the CellScore values ) @ \subsection{Generate CellScore reports} In the final step, we will generate some diagnostic plots and save all the plots in a summary report outputted to a PDF file. \subsubsection{Scatter plot of donor-like and target-like scores} In the scatter plot of all test samples, you can quickly identify samples that have not completely transitioned to the desired cell type. Ideally, samples with good transition to the target cell type are located in the upper left-hand corner. Samples with poor transition will retain more donor-like expression and tend to be located on the right-hand side of the plot (see Figure ~\ref{fig:Scatterplot}): The following will generate a two-paneled plot. <>= ScatterplotCellScoreComponents(cellscore, cell.change, FALSE) @ \begin{landscape} <>= <> @ \end{landscape} If there are too many points to plot, you could also choose which transitions to plot. For example, the code below will only plot the partial iPS cells: <>= ScatterplotCellScoreComponents(cellscore, cell.change[2,], FALSE) @ \subsubsection{Boxplot of CellScore values} The \code{BoxplotCellScore} function plots an overview of the CellScore values across all the test samples, grouped by subgroup. The minimum score is about -1.2, and the maximum score is about 1.2 (see Figure ~\ref{fig:Boxplot}): <>= BoxplotCellScore(cellscore, cell.change) @ \subsubsection{Rug plot of CellScore values} The \code{RugplotCellScore()} function generates a rug plot of the CellScore values by experiment, and the samples will be colored by a secondary property from the \vars{phenotype data frame}. In this case, we choose to color the samples by the values in the \code{transition_induction_method} column. (Figure ~\ref{fig:Rugplot}). <>= secondary.property <- "transition_induction_method" RugplotCellScore(cellscore, cell.change, secondary.property) @ \begin{landscape} <>= <> @ \end{landscape} \subsubsection{The summary report} Finally, the last plotting function will generate a CellScore report for each study and cell transition that has been defined in the \code{cell.change} data frame. As the report consists of many pages, it is best to simply plot everything to a single PDF file. <>= pdf(file="CellScoreReport_PerTransition.pdf", width=7, height=11) CellScoreReport(cellscore, cell.change, group.OnOff$markers, eset) dev.off() @ The report is composed of several plots, including \begin{itemize} \item \vars{scatter plot of the donor- and target-like scores}: Test samples, along with the standards for donor and target, are shown on the scatter plot for easy comparison (see Figure ~\ref{fig:ReportFig1}). \item \vars{density plot of the CellScore values}: The CellScore values of the test samples should be located somewhere between the CellScore values for the donor and target standards (see Figure ~\ref{fig:ReportFig2}, upper panel). \item \vars{rug plot of the CellScore values}: This plot has a closer look at the test samples and the target standards. Dashed vertical lines indicate the number of standard deviations from the mean target CellScore (see Figure ~\ref{fig:ReportFig2}, lower panel). \item \vars{heatmap of the donor and target markers}: It gives a quick overview of the number of marker genes being expressed above detection level (as defined by the present calls in the \vars{calls matrix}) in the test samples (see Figure ~\ref{fig:ReportFig3}). \end{itemize} %% GENERATE A SUBEST-TEST DATA FOR THE REPORT PLOTS <>= # get the test cellscores from valid transitions # defined by cell.change table plot.data <- extractTransitions(cellscore, cell.change) # Extract CellScores of test that should be plotted # on the same page into list plotgroup <- paste(plot.data$experiment_id, plot.data$cxkey.subcelltype, sep="_") temp <- data.frame(plot.data, plotgroup, stringsAsFactors=FALSE) # sort the table, show when the list is made, everything # is already in the right order: # o by target # o by donor # o by study ind <- order(temp$target, temp$donor_tissue, temp$experiment_id) plot.data.ordered <- temp[ind,] tg <- unique(paste(plot.data.ordered$experiment_id, plot.data.ordered$cxkey.subcelltype, sep="_") ) # get test data from plotgroup test.data <- plot.data.ordered[plot.data.ordered$plotgroup %in% tg[1], ] @ %% REPORT PLOT 1 <>= mat <- matrix(c(1,1,2,2), nrow=1, ncol=4, byrow=TRUE) layout(mat) scatterplotDonorTargetTest(test.data, cellscore, FALSE) @ %% REPORT PLOT 2 <>= mat <- matrix(c(1,1,1,2,2,2), nrow=2, ncol=3, byrow=TRUE) layout(mat) rugplotDonorTargetTest(test.data, cellscore) @ %% REPORT PLOT 3 <>= mat <- matrix(c(1,1,1,1,1,1), nrow=2, ncol=3, byrow=TRUE) layout(mat) calls <- assayDataElement(eset, "calls") rownames(calls) <- fData(eset)[, "probe_id"] heatmapOnOffMarkers(test.data, group.OnOff$markers, pData(eset), calls) @ \clearpage \subsection{R session information} <>= sessionInfo() @ \clearpage % Make an appendix \begin{appendices} \section{Specifications for the input datasets} \label{appendix:inputdata} \subsection{Reference dataset} If you have test samples from another microarray platform, you may want to build your own reference data object to suit your own needs. In this case you have to select your own reference samples. This can be as extensive as including many cell types and biological replicates, or as little as only providing some biological replicates for standard (donor and target) and test cell types. You must have at least three samples in each standard (donor/target), and at least one sample from a test cell type. The method will perform better if there are more samples, given that the quality of the samples is reasonable. The advantage of having more samples is to capture a robust signal, which should be representative for all samples of the same comparison. Limiting samples to only a handful of replicates runs the risk of focussing on study-specific effects. The expression data should be transformed into an object of the \vars{ExpressionSet} class (defined in the Biconductor package \pkg{Biobase}). For in-depth description of the \vars{ExpressionSet} class please refer to the \pkg{Biobase} package docmentation. In the following, we will describe only the relevant data structures for \pkg{CellScore}. \begin{enumerate} \item \vars{Calls matrix} with probe IDs in rows, samples in columns; 1=present, 0=absent. The probe IDs should be located in the rownames of the matrix. \item \vars{Normalized expression matrix} with probe IDs in rows, samples in columns. The probe IDs should be located in the rownames of the matrix. This matrix MUST have the same dimensions and row order as the \vars{calls matrix}. \item \vars{Annotation data frame} for the probe IDs in the \vars{calls matrix}. The same probe IDs used in the \vars{calls matrix} MUST be used as the rownames of this data frame, and their order should match the order of the \vars{calls matrix}. There should be no duplicate probe IDs. This data frame must contain the following five columns: \begin{itemize} \item \code{probe_id} \item \code{platform_id} \item \code{gene_symbol} \item \code{gene_name} \item \code{entrezgene_id} \end{itemize} Other columns will be ignored. \item \vars{Phenotype data frame} with information about the samples in the expression matrix. Samples are in rows and columns are attributes of the sample. The rownames of the data frame must be the sample IDs and must exactly match the column names of the \vars{calls matrix} and \vars{normalized expression matrix}. This data frame must contain the following columns: \begin{itemize} \item \code{experiment_id}: This should be a unique identifier for an experiment, for example a GEO experiment ID or an ArrayExpress experiment ID. \item \code{sample_id}: Sample IDs should be unique, such as the GSM accession numbers from GEO. \item \code{platform_id}: Use a unique ID for each platform. \item \code{cell_type}: Use a short text description here. \item \code{category}: Each sample must be assigned to one of ``standard''/``test''/``NA'' \item \code{general_cell_type}: Use an abbreviation to label the general cell type. For example, FIB for fibroblast. \item \code{donor_tissue}: If the sample is a derived cell type, then enter the abbreviation for the donor cell type. If the sample is standard cell type, then enter the donor tissue from which it was isolated. Otherwise, enter ``NA''. \item \code{sub_cell_type1}: The sub-cell type is a compound term of the general cell type and its donor tissue. \end{itemize} Additional columns are optional. The properties in these columns could be used to color the individual samples in the rug plots (Figure ~\ref{fig:Rugplot}). For example, the samples could be colored by \begin{itemize} \item \code{transition_induction_method}, the method used to engineer the derived cell types, or \item \code{donor_cell_body_location}, the anatomical area from which the donor cell was taken. \end{itemize} \end{enumerate} \subsection{Test dataset} The test dataset refers to the set of test samples, that is the samples of experimentally derived cell that we would like to evaluate by means of CellScore. Technically, every sample not included in the reference dataset could be supplied as a test sample. The expression profiles of the test samples should be normalized in the same manner as the reference dataset, and stored in an \vars{ExpressionSet} object in the same manner as the reference dataset. The order of the genes (i.e. probes in the rows) in both (reference and test) data objects must be the same. \subsection{Input expression matrix for \pkg{CellScore} functions} The \vars{ExpressionSet} objects for the standards and the test samples should just be combined to yield one \vars{ExpressionSet} object. \subsection{Input table of cell transitions for \pkg{CellScore} functions} A separate data frame should define the cell transitions to be evaluated, based on the donor and target cell type. This data frame must contain the following columns: \begin{itemize} \item \code{start}: the donor cell type, using the same abbreviations as in the \code{general_cell_type} column of the \vars{phenotype data frame}. \item \code{test}: the engineered cell type, using the same abbreviations as in the \code{sub_cell_type1} column of the \vars{phenotype data frame}. \item \code{target}: the target cell type, using the same abbreviations as in the \code{general_cell_type} column of the \vars{phenotype data frame}. \end{itemize} \end{appendices} \end{document}