%\VignetteIndexEntry{flowCL package} %\VignetteKeywords{Preprocessing,statistics} \documentclass{article} \usepackage{graphicx} \usepackage{amsmath} \usepackage{cite, hyperref} \usepackage{Sweave} \usepackage[nolist]{acronym} \usepackage{url} \usepackage{verbatim} \usepackage{etoolbox} \makeatletter \preto{\@verbatim}{\topsep=5pt \partopsep=5pt } \makeatother \usepackage{layout} \oddsidemargin = 15pt \textwidth = 440pt \title{{\it \textbf{flowCL}}: Semantic labelling of flow cytometric cell populations} \author{Justin Meskas} \begin{document} \setkeys{Gin}{width=1.0\textwidth, height=1.1\textwidth} \maketitle \begin{center} {\tt jmeskas@bccrc.ca} \end{center} \textnormal{\normalfont} \tableofcontents \newpage \section{Licensing} Under the Artistic License, you are free to use and redistribute this software. \section{Loading the Library} To install {\it \textbf{flowCL}} type {\it source(``http://bioconductor.org/biocLite.R'')} into R and then type \\{\it biocLite(``flowCL'')}. For more information on installation guidelines, see the Bioconductor and the CRAN websites. Once installed, to load the library, type the following into R: <>= library("flowCL") @ \section{Running {\it \textbf{flowCL}}} \subsection{Introduction} Given a cell type, one can use a cell ontology to discover which markers are used to uniquely identify that particular cell type. In the reverse situation, how can one find an appropriate cell type with a given phenotype (or list of markers)? {\it \textbf{flowCL}} matches a phenotype to a cell type from the cell ontology. If the match is not unique, then the best alternatives are returned. Markers are input followed with a \textbf{+}, \textbf{-}, \textbf{hi} or \textbf{lo}. The \textbf{+} stands for {\it has plasma membrane part}, the \textbf{-} stands for {\it lacks plasma membrane part}, the \textbf{hi} stands for {\it has high plasma membrane amount}, and the \textbf{lo} stands for {\it has low plasma membrane amount}. For example, CD8\textbf{+} will search for cell types that {\it has plasma membrane part} of CD8. Synonyms of tags are also accepted: \textbf{low}, \textbf{dim} and \textbf{- -} for \textbf{lo}, and \textbf{high}, \textbf{bri}, \textbf{bright} and \textbf{++} as synonyms for \textbf{hi}. {\it \textbf{flowCL}} executes queries against the Cell Ontology (CL), available at http://cellontology.org. The CL file is hosted on a triplestore, i.e., a database for storage and retrieval of Resource Description Framework (RDF) triples. The SPARQL endpoint at http://cell.ctde.net:8080/openrdf-sesame/repositories/CL is used to execute the SPARQL queries retrieving the correct matches from the CL. While other SPARQL endpoints can be used, users should be aware that in our case the CL file has been reasoned upon, and resulting extra inferred axioms have been added to the triplestore, providing a more complete result set. \subsection{Archive} A folder called ``flowCL$\_$results'' will be created in the current directory. In this folder a file called ``listPhenotypes.csv'' is created. This file lists the results of {\it \textbf{flowCL}}, and is the same as what is returned in the \$Table section of the returned data. In addition, tree diagrams are created and saved as ``.pdf'' files and these show the cell dependency of the matched cell types. The code will slowly build an archive of information in ``$[$current directory$]$/flowCL$\_$results/'' inside the folders of ``parents'', ``parents$\_$query'', ``results'' and ``reverse$\_$query''. Accessing this archive is much faster than querying every time, however, if the ontology gets updated the code will use the now old data from these folders. To make sure the code is as up to date as possible the archive should be reset every once in a while. For the sake of time, a pre-loaded archive can be created from a data file in {\it \textbf{flowCL}}. To load this archive, enter the following: \begin{small} <>= flowCL("Archive") @ \end{small} This pre-loaded archive has the possibility of being outdated. This archive will be updated by the maintainer, along with the vignette, whenever the ontology is updated. More discussion of this is in the ``Last Comments'' section. The current version was updated last on March $18^{th}$ 2014. To check that the version is up to date run: \begin{small} <>= flowCL("Date") @ \end{small} If the returned date matches the one above, then the pre-loaded archive is the most up to date. Please note that the user can skip this section altogether and {\it \textbf{flowCL}} will always give the most current archive. The down side is the code will take time every time there is something new to query while building an archive. \subsection{Simple Example} The simplest example of {\it \textbf{flowCL}} is to enter in one phenotype: \begin{small} <>= Res <- flowCL("CCR7+CD45RA+") Res$Table tmp <- Res$'CCR7+CD45RA+' plot(tmp[[1]], nodeAttrs=tmp[[2]], edgeAttrs=tmp[[3]], attrs=tmp[[4]]) @ \end{small} \begin{figure}[htbp] \begin{center} \includegraphics{flowCL-CCR7+CD45RA+_with_visual} \caption{Tree diagram of the cell hierarchy when querying with phenotype: ``CCR7+CD45RA+''} \label{fig:one} \end{center} \end{figure} A list containing N + 5 elements was returned into Res, where N is the number of phenotypes queried, in this case N = 1. Each of these N elements contains information for plotting the results. The other five elements show the cell labels (\$Cell\_Labels), the matching markers in a list form (\$Marker\_Groups) and a bracket form (\$Markers), ranking scores (\$Ranking) and a table (\$Table). The cell labels element lists the cell labels in order of highest score based on their ranking, which is in a form easily extracted and used by other R packages and functions. \$Marker\_Groups and \$Markers list markers that were queried and markers that are used to define the certain cell types. In \$Markers these markers are displayed in the form of \{ \textbf{A} \} \textbf{B} ( \textbf{C} ) [ \textbf{D} ]. \textbf{A} and \textbf{B} together make up the markers input for the query. \textbf{B}, \textbf{C} and \textbf{D} together are the markers that make up the definition of the particular cell type. \textbf{C} lists the markers that were part of the experiment that were not part of the query, while \textbf{D} lists all other markers that make up the cell type that were not part of the experimental markers. \textbf{A} lists all the markers in the input for the query that were not required for that particular cell type. The table from $Res\$Table$ shows many properties of the phenotype queried. The name of the markers in the ontology are shown, in this case ``C-C chemokine receptor type 7'' and ``receptor-type tyrosine-protein phosphatase C isoform CD45RA'' for CCR7 and CD45RA, respectively. The experiment markers are shown as well. These are a list of all the markers used in the experiment. {\it \textbf{flowCL}} will inform the user of which additional markers, which are in the experiment markers and are not in the input markers, that would better define a particular cell type. Hence, the user can refine her/his cell population to achieve a more accurate population. A successful match is also specified given that there are cell types that contain all the markers queried and only those markers. The 1) - 5) in the next six sections of the table represent the five cell types that are considered matches. In Figure \ref{fig:one}, a tree diagram shows the cell hierarchy that is dependent on both CCR7+ and CD45RA+. There are five cell types that contain both markers. The black arrows are defined as ``is a'' (ex. native cell ``is a'' cell). The coloured arrows are the inverse of {\it has/lacks plasma membrane part} or {\it has high/low plasma membrane amount} (ex. native regulatory T cell {\it has plasma membrane part} CD45RA). Each marker is associated with its own colour, which makes it easier for the user to tell which cell type contains which markers. The blue nodes are the {\it has plasma membrane part} markers, while the red nodes are the {\it lacks plasma membrane part} markers. The navy blue nodes are the {\it has high plasma membrane amount} markers and the grey blue nodes are the {\it has low plasma membrane amount} markers. The green nodes are the exact matches, while the beige nodes are the partial matches. \subsection{Visual-Skip, Reset-Archive and Keep-Archive Example} The ``VisualSkip'' argument in {\it \textbf{flowCL}} can be used if the user does not want the visual results. \begin{small} <>= Res <- flowCL("CCR7+CD45RA+", VisualSkip = TRUE, ResetArch=FALSE, KeepArch=TRUE) @ \end{small} This can reduce the computational time. If the argument called ``ResetArch'' is set to TRUE, it will first delete the entire archive and start adding data to a new one every time something is queried. Therefore, the code is slower after the archive is reset, however, the code will become faster on average with every query. The argument called ``KeepArch'' allows the user to remove the archive at the end of the simulation if it is set to FALSE. This is useful when the user rather not have files stored on the hard drive. \subsection{Exact-Match with Ontology-Names and Computational Information} An example of a exact match is shown with ``CD3+CD4+CD8-CCR7-CD45RA+''. \begin{small} <>= Res <- flowCL("CD3+CD4+CD8-CCR7-CD45RA+", CompInfo = TRUE, OntolNamesTD = TRUE) tmp <- Res$'CD3+CD4+CD8-CCR7-CD45RA+' plot(tmp[[1]], nodeAttrs=tmp[[2]], edgeAttrs=tmp[[3]], attrs=tmp[[4]]) @ \end{small} \begin{figure}[htbp] \begin{center} \includegraphics{flowCL-CCR7+CD45RA+CD8+} \caption{Tree diagram of the cell hierarchy when querying with phenotype: ``CD3+CD4+CD8-CCR7-CD45RA+''} \label{fig:two} \end{center} \end{figure} The tree diagram in Figure \ref{fig:two} only has one exact match. The ``OntolNamesTD'' option allows for the marker nodes to display their ontology names in the tree diagrams instead of their short names (ex. ``T cell receptor co-receptor CD8'' is displayed instead of ``CD8''). The default for ``OntolNamesTD'' is FALSE. The ``CompInfo'' option allows the user to view the computational updates while the code is in progress. The default for ``CompInfo'' is FALSE. \subsection{More Examples} To show how to use the results from {\it \textbf{flowCL}}, these two examples were chosen. \begin{small} <>= Res <- flowCL(c("CD3-CD19-CD20-CD14-HLA-DR-CD16-CD56hi", "CD3+CD4-CD8+CCR7-CD45RA+")) tmp <- Res$'CD3-CD19-CD20-CD14-HLA-DR-CD16-CD56hi' plot(tmp[[1]], nodeAttrs=tmp[[2]], edgeAttrs=tmp[[3]], attrs=tmp[[4]]) Res$Cell_Labels$'CD3-CD19-CD20-CD14-HLA-DR-CD16-CD56hi' Res$Ranking$'CD3-CD19-CD20-CD14-HLA-DR-CD16-CD56hi' Res$Markers$'CD3-CD19-CD20-CD14-HLA-DR-CD16-CD56hi' @ \end{small} \begin{figure}[htbp] \begin{center} \includegraphics{flowCL-HIPC1} \caption{Tree diagram of the cell hierarchy when querying with phenotype: ``CD3-CD19-CD20-CD14-HLA-DR-CD16-CD56hi''} \label{fig:three} \end{center} \end{figure} The ``CD3-CD19-CD20-CD14-HLA-DR-CD16-CD56hi'' returned two cell types called ``CD16-\\negative, CD56-bright natural killer cell'' and ``decidual natural killer cell''. The ranking score for both of these cell types are 0.8750000 and 0.5714286, respectively. These are calculated by adding all the markers that were queried that are also part of the cell type, subtracted by a penalty, all divided by the number of markers that define that cell type. Looking at the bracket form from \$Markers, the ``CD16-negative, CD56-bright natural killer cell'' cell type has a score of (7-0)/8. The ``decidual natural killer cell'' has a score of (6-2)/7. The penalty of -2 is given for every marker in the \{ \} section, being all the markers that were queried that do not define that cell type. In Figure \ref{fig:three}, neither cell type has a green node. This is because neither have a ranking score of 1. This example has a ``hi'' marker and can be seen as a navy blue node. \begin{small} <>= tmp <- Res$'CD3+CD4-CD8+CCR7-CD45RA+' plot(tmp[[1]], nodeAttrs=tmp[[2]], edgeAttrs=tmp[[3]], attrs=tmp[[4]]) Res$Cell_Labels$'CD3+CD4-CD8+CCR7-CD45RA+' Res$Ranking$'CD3+CD4-CD8+CCR7-CD45RA+' Res$Markers$'CD3+CD4-CD8+CCR7-CD45RA+' @ \end{small} \begin{figure}[htbp] \begin{center} \includegraphics{flowCL-HIPC2} \caption{Tree diagram of the cell hierarchy when querying with phenotype: ``CD3+CD4-CD8+CCR7-CD45RA+''} \label{fig:four} \end{center} \end{figure} The ``CD3+CD4-CD8+CCR7-CD45RA+'' only returned one cell type called ``effector CD8-\\positive, alpha-beta T cell''. The ranking score for this cell type is 1. Therefore, every marker in the query was used in the definition of the cell type. This is a exact match. \subsection{Cell Label Example} In many cases {\it \textbf{flowCL}} will only be used to extract the best possible cell label given a set of markers. The following code shows an example of this: \begin{small} <>= x <-"CD3+CD4+CD8-CCR7-CD45RA-" Res <- flowCL(x) Res$Cell_Label[[x]][[1]] @ \end{small} The user should note that the returned value could be one of many possible best cell labels. To access the other possible best cell labels the ``1'' in $Res\$Cell\_Label[[x]][[1]]$ can be changed to a higher index. \subsection{Last Comments} \begin{itemize} \item If the user wants all possible HIPC results and all HIPC tree diagrams, then the command of ``flowCL(``'')'' or ``flowCL(``HIPC'')'' will achieve this. All the defaults are set to run the HIPC phenotypes and their individual markers. Running the full code with no archive can take up to 20 minutes. \item There are two packages that {\it \textbf{flowCL}} depends upon. The first is {\it \textbf{SPARQL}}, which is used when retrieving cell ontology data. The other is {\it \textbf{Rgraphviz}}, which is used when creating the tree diagrams. \item The user should note that since the ontology can be and will be updated, the R results shown in this vignette, both the figures and the static text, may not match a live run of {\it \textbf{flowCL}}. Also the data files containing a pre-loaded archive can be outdated. The maintainer will try to keep the vignette and data files as up to date as possible. The data files as well as text and the figures in the examples from this vignette describe results from the cell ontology that was last updated on March $18^{th}$ 2014. \end{itemize} %\bibliographystyle{plain} %\bibliography{flowCL} \end{document}