%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %% Luigi Marchionni %% May 12 2013 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % \VignetteIndexEntry{Working with the mammaPrintData package} % \VignetteDepends{Biobase, limma, gdata} % \VignetteKeywords{ExperimentData, Cancer, RNAExpressionData} % \VignettePackage{mammaPrintData} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%% Begin Document \documentclass[11pt]{article} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%% Preamble \usepackage{setspace} %% For graphics, fonts, figures \usepackage{latexsym} \usepackage{amsmath} \usepackage{amssymb} \usepackage{amsfonts} \usepackage{amsxtra} %% For graphics, fonts, figures \usepackage{graphicx, helvet, comment, color} \usepackage[usenames,dvipsnames]{xcolor} \usepackage{wrapfig, subfigure} \usepackage{vmargin} \usepackage{picinpar, fancyhdr} \usepackage{epsfig} \usepackage{epstopdf} \usepackage[T1]{fontenc} \usepackage[utf8]{inputenc} \usepackage{authblk} %% to control floats \usepackage{float} %% Set margins \usepackage{geometry} %% Activate to begin paragraphs with an empty line \usepackage[parfill]{parskip} %% to make Celsius \usepackage{textcomp} %% For bibliography \usepackage{natbib} %% For URLs formatting \usepackage{url} %%to make hyperlink to html \usepackage{hyperref} %\usepackage[dvips, colorlinks=true]{hyperref} %% For long table floating across pages \usepackage{longtable} %% For page orientation \usepackage{lscape} %%% additional packages \usepackage{Sweave} %%% Sweave option \SweaveOpts{keep.source=TRUE} %%% Document layout \parindent 0in \setpapersize{USletter} \setmarginsrb{0.75truein}{0.5truein}{0.75truein}{0.5truein}{16pt}{30pt}{0pt}{20truept} \setlength{\emergencystretch}{2em} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%% New commands for R stuff \newcommand{\R}{{\texttt{R}}} \newcommand{\Bioc}{{\texttt{Bioconductor}}} \newcommand{\Robject}[1]{{\texttt{#1}}} \newcommand{\Rfunction}[1]{{\texttt{#1}}} \newcommand{\Rpackage}[1]{{\texttt{#1}}} \newcommand{\Rclass}[1]{{\texttt{#1}}} \newcommand{\Rmethod}[1]{{\texttt{#1}}} \newcommand{\Rfunarg}[1]{{\texttt{#1}}} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%% fancy Sweave \DefineVerbatimEnvironment{Sinput}{Verbatim}{xleftmargin=1em,fontshape=sl,formatcom=\color{MidnightBlue}} %\DefineVerbatimEnvironment{Sinput}{Verbatim} {xleftmargin=1em} \DefineVerbatimEnvironment{Soutput}{Verbatim}{xleftmargin=1em,fontshape=sl,formatcom=\color{OliveGreen}} %\DefineVerbatimEnvironment{Soutput}{Verbatim}{xleftmargin=1em} \DefineVerbatimEnvironment{Scode}{Verbatim}{xleftmargin=1em,fontshape=sl,formatcom=\color{BrickRed}} %\DefineVerbatimEnvironment{Scode}{Verbatim}{xleftmargin=1em} \fvset{listparameters={\setlength{\topsep}{0pt}}} \renewenvironment{Schunk}{\vspace{\topsep}}{\vspace{\topsep}} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%% Title \title{The mammaPrintData package: annotated gene expression data from the Glas and Buyse breast cancer cohorts} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%% Authors and Affiliations \author[1]{Luigi Marchionni} \affil[1]{The Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins University School of Medicine} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%% Document \begin{document} \singlespacing \maketitle \tableofcontents \newpage <>= ### setCacheDir("cacheSweave") ### if (! file.exists("./cacheSweave") ) { ### dir.create("./cacheSweave") ### } ### setCacheDir("cacheSweave") options(width=75) options(continue=" ") rm(list=ls()) myDate <- format(Sys.Date(), "%b %d, %Y") @ \section{Overview} This vignette documents the {\R} code used to retrieve from the original sources, annotate, and then assemble into separate \Rclass{RGList} instances the gene expression data of the Glas and Buyse cohorts \cite{Glas2006,Buyse2006}. Such data was used to translate the 70-gene breast cancer prognostic signature \cite{van't2002,Vijver2002} into the MammaPrint assays \cite{Glas2006}, and to independently validate it \cite{Buyse2006}. While the original studies by van't Veer and Van de Vijver and colleagues used the two-colors Agilent/Rosetta 24k array, the redeveloped assays is based on the 1.9k MammaPrint array. For a compete review of these breast cancer studies see Marchionni et al \cite{Marchionni2007b,Marchionni2008a}. Briefly, this document contains the complete {\R} code used to: \begin{enumerate} \item Download the following gene expression data from ArrayExpress \cite{Brazma2003}: \begin{enumerate} \item The Glas data set (ArrayExpress E-TABM-115 series), which combined a large proportion of cases from the original van't Veer and Van de Vijver cohorts, and was used as training set; \item The Buyse data set (ArrayExpress E-TABM-77), which analyzed a European multicenter cohort, and was used as validation set; \end{enumerate} \item Pre-process and normalize all gene expression data; \item Cross-reference the 70-gene list to the new MammaPrint microarray; \item Create the corresponding \Rpackage{limma} package \Rclass{RGList} instances. \end{enumerate} All the files, among those mentioned above, needed to compile this vignette and create the \Rpackage{mammaPrintData} \R package are included inside the \texttt{inst/extdata} directory of the source package. Finally, the {\R} code chunk below was used todownload and install from \Bioc the packages required to prepared the present vignette. if the are not available (e.g \Rpackage{gdata}). <>= ###Get the list of available packages installedPckgs <- installed.packages()[,"Package"] ###Define the list of desired libraries pckgListBIOC <- c("Biobase", "limma", "gdata") ###Install BiocManager if needed if (!requireNamespace("BiocManager", quietly=TRUE)) install.packages("BiocManager") ###Load the packages, install them from Bioconductor if needed for (pckg in pckgListBIOC) { if (! pckg %in% installedPckgs) BiocManager::install(pckg) require(pckg, character.only=TRUE) } @ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{The 70-gene signature} The list of genes originally identified in the van't Veer study \cite{van't2002} can be obtained from the \href{http://www.nature.com/nature/journal/v415/n6871/suppinfo/415530a.html} {supplementary data section} of the original publication, at the following URL: \begin{itemize} \item \url{http://www.nature.com/nature/journal/v415/n6871/extref/415530a-s9.xls} \end{itemize} The Excel spreadsheet named \texttt{415530a-s9.xls} corresponds to Table S2, and contains the identifiers and annotation for the 231 genes othat proved to be associated with metastatic recurrence at 5 years. According to the information contained in the legend for Table S2 (\href{http://www.nature.com/nature/journal/v415/n6871/extref/415530a-s7.doc} {see``Supplementary Methods'' section}) the optimal set of genes constituting the 70-gene signature can be identified from this table by selecting the top 70 features/reporters with highest absolute correlation coefficients. Below is shown the code chunk used to download the \texttt{415530a-s9.xls} Excel file from the Nature website. This file was part of the Supplementary Information section associated with the original van't Veer's et manuscript. The following code chunk was not run to compile this vignette, which was prepared using the local copies of the data stored in the \texttt{inst/extdata} directory. <>= ###Chunk not executed: files are already included ### Shown here just to show the source of the data dir.create("../extdata/seventyGenes", showWarnings = TRUE, recursive = TRUE) ###Define the url for Supplementary data on the Nature Website url <- "http://www.nature.com/nature/journal/v415/n6871/extref/415530a-s9.xls" ###Dowload the Excel file from Nature download.file(url, destfile="../extdata/seventyGenes/415530a-s9.xls", quiet = FALSE, mode = "w", cacheOK = TRUE) @ Below is the {\R} code chunk used to directly process the \texttt{415530a-s9.xls} Excel file and create the corresponding data.frame containing the features annotation. <>= ###Load the library to process Excell files require(gdata) ###Read the Excel file with the 70-genes signature information for van't Veer dataset myFile <- system.file("./extdata/seventyGene", "415530a-s9.xls", package = "mammaPrintData") gns231 <- read.xls(myFile, skip=0, header=TRUE, stringsAsFactors=FALSE) ###Remove special characters in the colums header ###These are due to white spaces present in the Excel file colnames(gns231) <- gsub("\\.\\.", "", colnames(gns231)) ###Remove GO annotation gns231 <- gns231[, -grep("sp_xref_keyword_list", colnames(gns231))] ###Check the structure of the data.frame str(gns231) @ Below is the {\R} code chunk used to generate a list of identifiers corresponding to the 70-gene and the 231-gene lists. <>= ###Reorder the genes in decreasing order by absolute correlation gns231 <- gns231[order(abs(gns231$correlation), decreasing=TRUE),] ###Select the feature identifiers corresponding to the top 70 genes gns70 <- gns231[1:70,] ###Create a list of features and gene symbols for later use progSig <- list(gns231acc = unique(gns231$accession), gns231name = unique(gns231$gene.name), gns231any = unique(c(gns231$accession, gns231$gene.name)), gns70acc = unique(gns70$accession), gns70name = unique(gns70$gene.name), gns70any = unique(c(gns70$accession, gns70$gene.name)) ) ###Remove empty elements progSig <- lapply(progSig, function(x) x[x!=""]) ###Create a data.frame of correlations with combined accession ###and gene symbols as identifiers gns231Cors <- data.frame(stringsAsFactors=FALSE, ID=c(gns231$gene.name[ gns231$gene.name %in% progSig$gns231any ], gns231$accession), gns231Cors=c(gns231$correlation[ gns231$gene.name %in% progSig$gns231any ], gns231$correlation) ) ####Keep unique only progSig$gns231Cors <- gns231Cors[!duplicated(gns231Cors$ID), ] progSig$gns231Cors <- gns231Cors[gns231Cors$ID != "", ] ###Check the structure of the list just created str(progSig) @ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Glas and Buyse cohorts: \Rclass{RGList} preparation} The original data sets used to redevelop \cite{Glas2006} and validate \cite{Buyse2006} the MammaPrint assays can be obtained from the ArrayEpress database \cite{Brazma2003}: \begin{itemize} \item \url{http://www.ebi.ac.uk/arrayexpress/experiments/E-TABM-77} \item \url{http://www.ebi.ac.uk/arrayexpress/experiments/E-TABM-115} \end{itemize} These gene expression series account for clinical information, microarray feature annotation, and complete unprocessed raw gene expression data. In both studies a two-color design with dye-swap replication was applied, therefore two distinct \Rclass{RGList} objects were prepared for each data set, and included in the \Rpackage{mammaPrintData} {\R-\Bioc} package, as detailed below. The pre-processed, normalized, and summarized values used by the authors in their original analyses were not included in the submission to ArrayExpress, hence data pre-processing and normalization of the \Rclass{RGList} contained in this package is required prior any analysis. \subsection{Glas cohort - ArrayExpress E-TABM-115 Series} The Glas data set combines a large proportion of cases from the original van't Veer and Van de Vijver cohorts, and was used for the implementation of the original 70-gene signature into the MammaPrint assay \cite{Glas2006}. The following chunk of {\R} code was used to download the all the files of the ArrayExpress the E-TABM-115 series. The files needed to compile this vignette and create the \Rpackage{mammaPrintData} \R package are included inside the \texttt{inst/extdata} directory. <>= ###Chunk nor executed: files are already included in the package ###This chunk is just to show the source of the data ###Load the ArrayExpress library require(ArrayExpress) ###Create a working directory dir.create("../extdata/ETABM115", showWarnings = FALSE) ###Obtain the data package E-TABM-115 from ArrayExpress etabm115 <- getAE("E-TABM-115", type = "full", path="../extdata/ETABM115", extract=FALSE) ###Save save(etabm115, file="../extdata/ETABM115/etabm115.rda") @ Below is the {\R} code chunk used to process the phenotypic information retrieved form ArrayExpress and associated with the E-TABM-115 series on \Sexpr{print(myDate)} <>= ###Load the Biobase library require(Biobase) ###List the files obtained from ArrayExpress etabm155filesLoc <- system.file("extdata/ETABM115", package = "mammaPrintData") head(dir(etabm155filesLoc)) ###Read the phenotypic information from "E-TABM-115.sdrf.txt" targets <- read.table(dir(etabm155filesLoc, full.names=TRUE, pattern="E-TABM-115.sdrf.txt"), comment.char="", sep="\t", header=TRUE, stringsAsFactors=FALSE) ###Check and process the phenotypic information: keep non-redundant information targets <- targets[, apply(targets, 2, function(x) length(unique(x)) != 1 )] ###Reorder by file name targets <- targets[order(targets$Array.Data.File, decreasing=TRUE),] ###Remove points in colnames colnames(targets) <- gsub("\\.$","",gsub("\\.\\.",".",colnames(targets))) ###Show the available phenotypic iformation str(targets) @ The Glas cohort accounts for 162 unique cases, which were analyzed using a dye-swap two-colors design, comparing each sample against the same reference RNA. According to ArrayEspress instruction for SDRF files (available at \url{http://tab2mage.sourceforge.net/docs/magetab_docs.html}), a complete SDRF annotation file consists of a table in which each hybridization channel is represented by an individual row. Therefore the complete SDRF annotation files for all hybridizations of the Glas cohort should for 648 total rows: 162 samples for 2 channels (cy5 and Cy3) for 2 replicated hybridizations (dye-swap). As shown below, the \texttt{E-TABM-115.sdrf.txt} file currently available from ArrayExpress (\Sexpr{print(myDate)}) is incomplete, since it contains only \Sexpr{print(nrow(targets))} rows. An inspection of the information contained in the downloaded SDRF file reveales that it contains a row for each hybridization (n=\Sexpr{print(nrow(targets))}), rather than a row for each channel (n=\Sexpr{print(nrow(targets)*2)}). <>= ###Check targets data.frame dimentions dim(targets) ###Count the unique file names length(unique(targets$Array.Data.File)) ###Count the number of rows associated with each channel table(CHANNEL=targets$Label) ###Count the number of rows for the reference RNA (named "MRP") in each channel table(CHANNEL=targets$Label, REFERENCE=targets$Sample.Name == "MRP") ###Count the number of distinct RNA analyzed (excluding the refernce RNA) sum(REFERENCE=targets$Sample.Name!="MRP") ###Count the number of metastatic events table(METASTASES=targets$Characteristics.EventDistantMetastases) ###Count the number of rown for which metastatic events is missing table(MISSING=is.na(targets$Characteristics.EventDistantMetastases)) ###Check if missing clinical information is exactly associated with ###hybridizations where the "MRP" reference RNA was used in the Cy5 channel table(MISSING=is.na(targets$Characteristics.EventDistantMetastases), REFERENCE=targets$Sample.Name!="MRP") @ Therefore the following conclusions can be drawn from the evaluation of the information contained in \texttt{E-TABM-115.sdrf.txt} file shown above: \begin{itemize} \item All hybridizations that were performed have a corresponding row in the SDRF table; \item Half of the rows of the SDRF table, for which the RNA samples described were associated with the Cy5 channel, contain a complete set of clinical information; \item The other half of the rows, for which the RNA samples described were associated with the Cy3 channel, report the information for the reference RNA, and the clinical information is missing; \item Nevertheless, given the dye swap design of the experiment, the SDRF table contains the complete clinical information associated with all the 162 breast cancer patients analyzed in the Glas study; \item It is possible to use the clinical information provided only with one set of dye-swap hybridizations, since for the other set ony the ``MRP'' reference RNA information was provided. \end{itemize} In conclusion, the complete data set can be analyzed only knowing which hybridizations constitute a dye-swap pair. This information can be obtained upon request from Dr. Glas, the corrisponding author of the of the original study. Below is the {\R} code chunk used to create the \Rclass{RGList} objects using the raw gene expression data contained in the compressed archive \texttt{E-TABM-115.raw.1.zip} downloaded from ArrayExpress. <>= ###List files for E-TABM-115 experiment etabm155filesLoc <- system.file("extdata/ETABM115", package = "mammaPrintData") etabm155files <- list.files(etabm155filesLoc, pattern="^US") ###Check whether all available files correspond to the targets$Array.Data.File all(paste(targets$Array.Data.File, ".gz", sep="") %in% etabm155files) ###Check whether they are ordered in the same way all(paste(targets$Array.Data.File, ".gz", sep="") == etabm155files) ###Load the required library require(limma) ###Define the columns which will be read from the raw files selection colsList <- list(Rf="Feature Extraction Software:rMedianSignal", Rb="Feature Extraction Software:rBGMedianSignal", Gf="Feature Extraction Software:gMedianSignal", Gb="Feature Extraction Software:gBGMedianSignal", logRatio="Feature Extraction Software:LogRatio", logRatioError="Feature Extraction Software:LogRatioError") ###Subset the targets data.frame for the hybridization in which the Reference RNA was labeled with Cy3 targetsCy3info <- targets[ targets$Source.Name == "MRP" & targets$Label == "Cy3", ] dim(targetsCy3info) ###Subset the targets data.frame for the hybridization in which the Reference RNA was labeled with Cy5 targetsCy5info <- targets[ targets$Source.Name != "MRP" & targets$Label == "Cy5", ] dim(targetsCy5info) ###The two sets of hybridizations above should be mutually exclusive all(targetsCy3info$Array.Data.File != targetsCy5info$Array.Data.File) ###Create an "RGList" using specific columns for the first set of swapped hybridizations: RGcy3 <- read.maimages(paste(targetsCy3info$Array.Data.File, ".gz", sep=""), source="generic", columns=colsList, annotation="Reporter identifier", path=etabm155filesLoc, verbose=FALSE) ###Add targets information RGcy3$targets <- targetsCy3info ###Format Reporter identifiers RGcy3$genes$ID <- gsub(".+A-MEXP-318\\.", "", RGcy3$genes[, "Reporter identifier"]) ###Check dimensions dim(RGcy3) ###Create an "RGList" using specific columns for the second set of swapped hybridizations: RGcy5 <- read.maimages(paste(targetsCy3info$Array.Data.File, ".gz", sep=""), source="generic", columns=colsList, annotation="Reporter identifier", path=etabm155filesLoc, verbose=FALSE) ###Add targets information RGcy5$targets <- targetsCy5info ###Format Reporter identifiers RGcy5$genes$ID <- gsub(".+A-MEXP-318\\.", "", RGcy5$genes[, "Reporter identifier"]) ###Check dimensions dim(RGcy5) @ Below is the {\R} code chunk used to check whether all data read is of the correct mode (i.e. numeric values are of mode ``numeric'', and so on). <>= ###Check mode of read data for first set of hybridizations sapply(RGcy3, mode) ###Since RGc3$logRatio is character convert to numeric RGcy3$logRatio <- apply(RGcy3$logRatio, 2, as.numeric) ###Check mode of read data for second set of hybridizations sapply(RGcy5, mode) ###Since RGc3$logRatio is character convert to numeric RGcy5$logRatio <- apply(RGcy5$logRatio, 2, as.numeric) @ Below is the {\R} code chunk used to process the microarray feature annotation for the 1.9k MammaPrint array, as it is provided in the ArrayExpress database (file \texttt{A-MEXP-318.adf.txt}. We will then map the microarray features from thes new array to the 231-gene and the 70-gene prognostic signatures. We will finally add annotation information to the \Rclass{RGList} objects created above. <>= ###Read genes information for MammaPrint array from the A-MEXP-318.adf.txt genes <- read.table(dir(etabm155filesLoc, full.names=TRUE, pattern="A-MEXP-318.adf.txt"), skip=21, sep="\t", header=TRUE, stringsAsFactors=FALSE) ###Remove special characters like points in the colums header colnames(genes) <- gsub("\\.$","",gsub("\\.\\.",".",colnames(genes))) ###Use the annotation information previously processed str(progSig) ###Count the features mapped to the 231-gene prognostic signature by accession number apply(genes, 2, function(x, y) { sum(unique(x) %in% y) }, y = progSig$gns231acc) ###Count the features mapped to the 231-gene prognostic signature by gene symbol apply(genes, 2, function(x, y) { sum(unique(x) %in% y) }, y = progSig$gns231name) ###Count the features mapped to the 231-gene prognostic signature by gene symbol or accession number apply(genes, 2, function(x, y) { sum(unique(x) %in% y) }, y = progSig$gns231any) ###Add mapping information for 229 features using both gene symbols and accession numbers genes$genes231 <- genes$Comment.AEReporterName %in% progSig$gns231any ###Count the features mapped to the 70-gene prognostic signature by accession number apply(genes, 2, function(x, y) { sum(unique(x) %in% y) }, y = progSig$gns70acc) ###Count the features mapped to the 70-gene prognostic signature by gene symbol apply(genes, 2, function(x, y) { sum(unique(x) %in% y) }, y = progSig$gns70name) ###Count the features mapped to the 70-gene prognostic signature by gene symbol or accession number apply(genes, 2, function(x, y) { sum(unique(x) %in% y) }, y = progSig$gns70any) ###Add mapping information for 69 features using both gene symbols and accession numbers genes$genes70 <- genes$Comment.AEReporterName %in% progSig$gns70any ###The resulting gene annotation file str(genes) ###Add original correlation from the van't Veer study gns231Cors <- progSig$gns231Cors gns231Cors <- gns231Cors[gns231Cors$ID %in% genes$Comment.AEReporterName,] ###Merge and remove duplicates genes <- merge(genes, gns231Cors, all.x=TRUE, all.y=FALSE, by.x="Comment.AEReporterName", by.y="ID") genes <- genes[!duplicated(genes$Reporter.Name),] ###Check genes str(genes) @ Below is the {\R} code chunk used to add the feature annotation the \Rclass{RGList} objects containing the expression data (first set of hybridizations). <>= ###Compare genes between platform and RGList instances: first set of hybridizations if (nrow(genes)==nrow(RGcy3)) { ##Compare the identifiers' order if ( !all(genes$Reporter.Name == RGcy3$genes$ID) ) { ##Reorder if needed RGcy3 <- RGcy3[order(RGcy3$genes$ID),] genes <- genes[order(genes$Reporter.Name),] ##Check for order AGAIN if ( all(genes$Reporter.Name == RGcy3$genes$ID) ) { ##Substitute genes or stop RGcy3$genes <- genes } else { stop("Wrong gene order, check objects") } } else { print("Updating gene annotation") RGcy3$genes <- genes } } else { stop("Wrong number of features, check objects") } ###Rename the object glasRGcy3 <- RGcy3 @ Below is the {\R} code chunk used to add the feature annotation the \Rclass{RGList} objects containing the expression data (second set of hybridizations). <>= ###Compare genes between paltform and RGList instances: second set of hybridizations if (nrow(genes)==nrow(RGcy5)) { ##Compare the identifiers' order if ( !all(genes$Reporter.Name == RGcy5$genes$ID) ) { ##Reorder if needed RGcy5 <- RGcy5[order(RGcy5$genes$ID),] genes <- genes[order(genes$Reporter.Name),] ##Check for order AGAIN if ( all(genes$Reporter.Name == RGcy5$genes$ID) ) { ##Substitute genes or stop RGcy5$genes <- genes } else { stop("Wrong gene order, check objects") } } else { print("Updating gene annotation") RGcy5$genes <- genes } } else { stop("Wrong number of features, check objects") } ###Rename the object glasRGcy5 <- RGcy5 @ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{Buyse cohort - ArrayExpress E-TABM-77 Series} The Buyse data set (ArrayExpress E-TABM-77) accounts for the large European multicenter breast cancer cohort that was used to validate the MammaPrint assay \cite{Buyse2006}. The following chunk of {\R} code was used to download the all the files of the ArrayExpress the E-TABM-77 series. The files needed to compile this vignette and create the \Rpackage{mammaPrintData} \R package are included inside the \texttt{inst/extdata} directory. <>= ###Chunk not executed: files are already included in the package ###This chunk is just to show the source of the data dir.create("../extdata/ETABM77", showWarnings = FALSE) ################################################### ###Obtain the data package E-TABM-115 from ArrayExpress etabm77 <- getAE("E-TABM-77", type = "full", path="../extdata/ETABM77", extract=FALSE) ################################################### ###Save save(etabm77,file="../extdata/ETABM77/etabm77.rda") @ Below is the {\R} code chunk used to process the phenotypic information retrieved form ArrayExpress and associated with the E-TABM-77 series on \Sexpr{print(myDate)}. <>= ###List the files obtained from ArrayExpress etabm77filesLoc <- system.file("extdata/ETABM77", package = "mammaPrintData") head(dir(etabm77filesLoc)) ###Read the phenotypic information from "E-TABM-77.sdrf.txt" targets <- read.table(dir(etabm77filesLoc, full.names=TRUE, pattern="E-TABM-77.sdrf.txt"), comment.char="", sep="\t", header=TRUE, stringsAsFactors=FALSE) ###Check and process the phenotypic information: keep non-redundant information targets <- targets[, apply(targets, 2, function(x) length(unique(x)) != 1 )] ###Reorder by file name targets <- targets[order(targets$Array.Data.File, decreasing=TRUE),] ###Remove points in colnames colnames(targets) <- gsub("\\.$","",gsub("\\.\\.",".",colnames(targets))) ###Show the available phenotypic iformation str(targets) @ The Buyse cohort accounts for 307 unique cases, which were analyzed using a dye-swap two-colors design, comparing each sample against the same reference RNA. According to ArrayEspress instruction for SDRF files (available at \url{http://tab2mage.sourceforge.net/docs/magetab_docs.html}), a complete SDRF annotation file consists of a table in which each hybridization channel is represented by an individual row. Therefore the complete SDRF annotation files for all hybridizations of the Buse cohort should for 1228 total rows: 307 samples for 2 channels (cy5 and Cy3) for 2 replicated hybridizations (dye-swap). As shown below, the \texttt{E-TABM-77.sdrf.txt} file currently available from ArrayExpress (\Sexpr{print(myDate)}) is complete and contains all nececessary information. An inspection of the information contained in the downloaded SDRF file reveales that it contains two row for each hybridization (n=\Sexpr{print(nrow(targets))}) as expected. <>= ###Check targets data.frame dimentions dim(targets) ###Count the unique file names length(unique(targets$Array.Data.File)) ###Count the number of rows associated with each channel table(CHANNEL=targets$Label) ###Count the number of rows for the reference RNA (named "MRP") in each channel table(CHANNEL=targets$Label, REFERENCE=targets$Sample.Name == "MRP") ###Count the number of distinct RNA analyzed (excluding the refernce RNA) sum(REFERENCE=targets$Sample.Name!="MRP") ###Count the MammaPrint predictions table(MAMMAPRINT=targets$Factor.Value.MammaPrint.prediction) @ The \texttt{E-TABM-77.sdrf.txt} file is therefore valid and can be used to assemble the two \Rclass{RGList} objects corresponding to the two sets of dye-swap hybridization that were performed, as shown in the {\R} code below. <>= ###List files for E-TABM-77 experiment etabm77filesLoc <- system.file("extdata/ETABM77", package = "mammaPrintData") etabm77files <- list.files(etabm77filesLoc, pattern="^US") ###Check whether all available files correspond to the targets$Array.Data.File all(paste(targets$Array.Data.File, ".gz", sep="") %in% etabm77files) ###Check whether they are ordered in the same way all(paste(targets$Array.Data.File, ".gz", sep="") == etabm77files) ####Subset targets table extracting the rows for CANCER and the separate by channel caList <- !targets$Sample.Name=="MRP" targetsCa <- targets[caList,] ###Cy5 channel targetsCa.ch1 <- targetsCa[targetsCa$Label=="Cy5",] colnames(targetsCa.ch1) <- paste(colnames(targetsCa.ch1),"Cy5",sep=".") ###Cy3 channel targetsCa.ch2 <- targetsCa[targetsCa$Label=="Cy3",] colnames(targetsCa.ch2) <- paste(colnames(targetsCa.ch2),"Cy3", sep=".") ####Subset targets table extracting the rows for REFERENCE and the separate by channel refList <- targets$Sample.Name=="MRP" targetsRef <- targets[refList,] ###Cy5 channel targetsRef.ch1 <- targetsRef[targetsRef$Label=="Cy5",] colnames(targetsRef.ch1) <- paste(colnames(targetsRef.ch1),"Cy5",sep=".") ###Cy3 channel targetsRef.ch2 <- targetsRef[targetsRef$Label=="Cy3",] colnames(targetsRef.ch2) <- paste(colnames(targetsRef.ch2),"Cy3", sep=".") @ Below is the {\R} code chunk used to assemble the phenotypic information associated with the first set of hybridizations. <>= ###Now combine channel information to create the new targets object ###Here is for the FIRST set of swapped hybridization (keep only useful columns) if ( all(targetsCa.ch1$Array.Data.File.Cy5==targetsRef.ch2$Array.Data.File.Cy3) ) { colsSel <- apply(targetsCa.ch1==targetsRef.ch2,2,function(x) sum(x) != length(x) ) targetsSwap1 <- cbind(targetsCa.ch1,targetsRef.ch2[,colsSel],stringsAsFactors=FALSE) targetsSwap1 <- targetsSwap1[,apply(targetsSwap1,2,function(x) length(unique(x)) != 1 )] ##add Cy5 and Cy3 columns targetsSwap1$Cy5 <- targetsSwap1$Sample.Name targetsSwap1$Cy3 <- "Ref" targetsSwap1 <- targetsSwap1[,order(colnames(targetsSwap1))] ##remove dye extension to Scan name colnames(targetsSwap1) <- gsub("\\.Cy5$","",colnames(targetsSwap1)) ##reorder by sample name in Cy5 targetsSwap1 <- targetsSwap1[order(targetsSwap1$Cy5),] print("Assembling the targets object for the first set of swapped hybridizations") } else { stop("Check objects") } @ Below is the {\R} code chunk used to assemble the phenotypic information associated with the second set of hybridizations. <>= ###create new target objects for SECOND set of swapped hybs and keep useful columns if ( all(targetsCa.ch2$Array.Data.File.Cy3==targetsRef.ch1$Array.Data.File.Cy5) ) { colsSel <- apply(targetsCa.ch2==targetsRef.ch1,2,function(x) sum(x) != length(x) ) targetsSwap2 <- cbind(targetsCa.ch2,targetsRef.ch1[,colsSel],stringsAsFactors=FALSE) targetsSwap2 <- targetsSwap2[,apply(targetsSwap2,2,function(x) length(unique(x)) != 1 )] ##add Cy5 and Cy3 columns targetsSwap2$Cy5 <- "Ref" targetsSwap2$Cy3 <- targetsSwap2$Sample.Name targetsSwap2 <- targetsSwap2[,order(colnames(targetsSwap2))] ##remove dye extension to Scan name colnames(targetsSwap2) <- gsub("\\.Cy3$","",colnames(targetsSwap2)) ##reorder by sample name in Cy3 targetsSwap2 <- targetsSwap2[order(targetsSwap2$Cy3),] print("Assembling the targets object for the second set of swapped hybridizations") } else { stop("Check objects") } @ Below is the {\R} code chunk used to assemble the \Rclass{RGlist} object for the first set of hybridizations of the \texttt{ETABM-77} series. <>= ##Define the columns which will be read from the raw files selection colsList <- list(Rf="Feature Extraction Software:rMedianSignal", Rb="Feature Extraction Software:rBGMedianSignal", Gf="Feature Extraction Software:gMedianSignal", Gb="Feature Extraction Software:gBGMedianSignal", logRatio="Feature Extraction Software:LogRatio", logRatioError="Feature Extraction Software:LogRatioError") ###Swap set 1 has Reference RNA in Cy3 channel table(targetsSwap1$Cy3) ###This creates an instance of class "RGList", selecting specific columns from the raw data: RGcy3 <- read.maimages(paste(targetsSwap1$Array.Data.File, ".gz", sep=""), source="generic", columns=colsList, annotation="Reporter identifier", path=etabm77filesLoc, verbose=FALSE) ###Add targets information RGcy3$targets <- targetsSwap1 ###Add sample names as column names with prefix colnames(RGcy3) <- paste("BC",targetsSwap1$Sample.Name,sep=".") ###Format Reporter identifiers RGcy3$genes$ID <- gsub(".+A-MEXP-318\\.", "", RGcy3$genes[, "Reporter identifier"]) ###Check dimensions dim(RGcy3) @ Below is the {\R} code chunk used to assemble the \Rclass{RGlist} object for the second set of hybridizations of the \texttt{ETABM-77} series. <>= ###This creates an instance of class "RGList", selecting specific columns from the raw data: RGcy5 <- read.maimages(paste(targetsSwap1$Array.Data.File, ".gz", sep=""), source="generic", columns=colsList, annotation="Reporter identifier", path=etabm77filesLoc, verbose=FALSE) ###Swap set 2 has Reference RNA in Cy5 channel table(targetsSwap2$Cy5) ###Add targets information RGcy5$targets <- targetsSwap2 ###Format Reporter identifiers RGcy5$genes$ID <- gsub(".+A-MEXP-318\\.", "", RGcy5$genes[, "Reporter identifier"]) ###Add sample names as column names with prefix colnames(RGcy5) <- paste("BC",targetsSwap2$Sample.Name,sep=".") ###Check dimensions dim(RGcy5) @ Below is the {\R} code to check whether all data read is of teh correct mode (i.e. numeric values are of mode ``numeric'', and so on) <>= ###Check mode of read data for first set of hybridizations sapply(RGcy3, mode) ###Since RGc3$logRatio is character convert to numeric RGcy3$logRatio <- apply(RGcy3$logRatio, 2, as.numeric) ###Check mode of read data for second set of hybridizations sapply(RGcy5, mode) ###Since RGc3$logRatio is character convert to numeric RGcy5$logRatio <- apply(RGcy5$logRatio, 2, as.numeric) @ Below is the {\R} code chunkto process the microarray feature annotation for the 1.9k MammaPrint array, as it is provided in the ArrayExpress database (file \texttt{A-MEXP-318.adf.txt}. We will then map the microarray features from thes new array to the 231-gene and the 70-gene prognostic signatures. We will finally add annotation information to the \Rclass{RGList} objects created above. <>= ###Read genes information for MammaPrint array from the A-MEXP-318.adf.txt genes <- read.table(dir(etabm77filesLoc, full.names=TRUE, pattern="A-MEXP-318.adf.txt"), comment.char="", skip=21, sep="\t", header=TRUE, stringsAsFactors=FALSE) ###Remove special characters like points in the colums header colnames(genes) <- gsub("\\.$","",gsub("\\.\\.",".",colnames(genes))) ###Use the annotation information previously processed str(progSig) ###Count the features mapped to the 231-gene prognostic signature by accession number apply(genes, 2, function(x, y) { sum(unique(x) %in% y) }, y = progSig$gns231acc) ###Count the features mapped to the 231-gene prognostic signature by gene symbol apply(genes, 2, function(x, y) { sum(unique(x) %in% y) }, y = progSig$gns231name) ###Count the features mapped to the 231-gene prognostic signature by gene symbol or accession number apply(genes, 2, function(x, y) { sum(unique(x) %in% y) }, y = progSig$gns231any) ###Add mapping information for 229 features using both gene symbols and accession numbers genes$genes231 <- genes$Comment.AEReporterName %in% progSig$gns231any ###Count the features mapped to the 70-gene prognostic signature by accession number apply(genes, 2, function(x, y) { sum(unique(x) %in% y) }, y = progSig$gns70acc) ###Count the features mapped to the 70-gene prognostic signature by gene symbol apply(genes, 2, function(x, y) { sum(unique(x) %in% y) }, y = progSig$gns70name) ###Count the features mapped to the 70-gene prognostic signature by gene symbol or accession number apply(genes, 2, function(x, y) { sum(unique(x) %in% y) }, y = progSig$gns70any) ###Add mapping information for 229 features using both gene symbols and accession numbers genes$genes70 <- genes$Comment.AEReporterName %in% progSig$gns70any ###The resulting gene annotation file str(genes) ###Add original correlation from the van't Veer study gns231Cors <- progSig$gns231Cors gns231Cors <- gns231Cors[gns231Cors$ID %in% genes$Comment.AEReporterName,] ###Merge and remove duplicates genes <- merge(genes, gns231Cors, all.x=TRUE, all.y=FALSE, by.x="Comment.AEReporterName", by.y="ID") genes <- genes[!duplicated(genes$Reporter.Name),] ###Check genes str(genes) @ Below is the {\R} code chunk used to add the feature annotation the \Rclass{RGList} objects containing the expression data (first set of hybridizations). <>= ###Compare genes between paltform and RGList instances: first set of hybridizations if (nrow(genes)==nrow(RGcy3)) { ##Compare the identifiers' order if ( !all(genes$Reporter.Name == RGcy3$genes$ID) ) { ##Reorder if needed RGcy3 <- RGcy3[order(RGcy3$genes$ID),] genes <- genes[order(genes$Reporter.Name),] ##check for order AGAIN if ( all(genes$Reporter.Name == RGcy3$genes$ID) ) { ##Substitute genes or stop RGcy3$genes <- genes } else { stop("Wrong gene order, check objects") } } else { print("Updating gene annotation") RGcy3$genes <- genes } } else { stop("Wrong number of features, check objects") } ###Rename buyseRGcy3 <- RGcy3 @ Below is the {\R} code chunk used to add the feature annotation the \Rclass{RGList} objects containing the expression data (second set of hybridizations). <>= ###Compare genes between paltform and RGList instances: second set of hybridizations if (nrow(genes)==nrow(RGcy5)) { ##Compare the identifiers' order if ( !all(genes$Reporter.Name == RGcy5$genes$ID) ) { ##Reorder if needed RGcy5 <- RGcy5[order(RGcy5$genes$ID),] genes <- genes[order(genes$Reporter.Name),] ##check for order AGAIN if ( all(genes$Reporter.Name == RGcy5$genes$ID) ) { ##Substitute genes or stop RGcy5$genes <- genes } else { stop("Wrong gene order, check objects") } } else { print("Updating gene annotation") RGcy5$genes <- genes } } else { stop("Wrong number of features, check objects") } ###Rename buyseRGcy5 <- RGcy5 @ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Phenotypic information processing } %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{Glas cohort: patients' phenotype curation} The chunk of {\R} code below was used to process the clinical information associated with the 162 patients of the ArrayExpress ``E-TABM-115'' series, and define the final ``good'' and ``bad'' prognosis patients groups, as well as other end-points that were investigated in the our study. To this end we first identified the putative 78 patients from the van't Veer cohort \cite{van't2002}, and the 84 cases from the Van de Vijver cohort \cite{Vijver2002} that were combined to assemble the final Glas data set. Since this information was not provided with the ArrayExpress ``E-TABM-115'' series, we analyzed the raw data hybridization file names, and used the 3-digits suffix code present in exactly 84 files, as follows: <>= ###Retrieve the phenotypic information from one of the RGList intances phenoGlasCy3 <- glasRGcy3$targets phenoGlasCy5 <- glasRGcy5$targets ###Identification of the 78 samples from van't Veer and the 84 from VanDeVijver table(gsub(".+_", "", phenoGlasCy3$Scan.Name) == gsub(".+_", "", phenoGlasCy5$Scan.Name)) ################################################## ###There are 78 hybridizations with 1-digit suffixes, and 84 of them with 3-digit suffixes table(nchar(gsub(".+_", "", phenoGlasCy5$Scan.Name))) table(nchar(gsub(".+_", "", phenoGlasCy3$Scan.Name))) ###Use the hybridization file names suffixes to define the putative cohort of origin: FIRST SET phenoGlasCy5$putativeCohort <- factor(nchar(gsub(".+_", "", phenoGlasCy5$Scan.Name))) levels(phenoGlasCy5$putativeCohort) <- c("putativeVantVeer", "putativeVanDeVijver") ###Use the hybridization file names suffixes to define the putative cohort of origin: SECOND SET phenoGlasCy3$putativeCohort <- factor(nchar(gsub(".+_", "", phenoGlasCy3$Scan.Name))) levels(phenoGlasCy3$putativeCohort) <- c("putativeVantVeer", "putativeVanDeVijver") ###Show counts FIRST SET table(phenoGlasCy5$putativeCohort) ################################################## ###Show counts SECOND SET table(phenoGlasCy3$putativeCohort) @ We subsequently processed the ``Factor.Value.overall\_survival'' character strings contained in the ``E-TABM-115'' SDRF table to define the actual overall survival (OS) time, the OS event status, and the 10-year survival status for each patient, as shown below. This was possible only for the set of hybridization for which the clinical information was provided in ArrayExpress (see \Robject{glasRGcy5\$targets} below). <>= ###Process the overall survival (OS) character strings and make numeric OS <- gsub("\\s.+", "", phenoGlasCy5$Factor.Value.overall_survival) phenoGlasCy5$OS <- as.numeric(OS) ####Process OS event and make numeric: missing data, labeled with "n/a" strings are turned into NA phenoGlasCy5$OSevent <- as.numeric(phenoGlasCy5$Factor.Value.Event_Death) ###Count missing information by counting the number of NA sum(is.na(phenoGlasCy5$OSevent)) ####Create binary OS at 10 years groups phenoGlasCy5$TenYearSurv <- phenoGlasCy5$OS < 10 & phenoGlasCy5$OSevent == 1 phenoGlasCy5$TenYearSurv[is.na(phenoGlasCy5$OSevent)] <- NA @ The overall survival information summary for the Glas cohort is shown below: <>= ###Count OS events table(phenoGlasCy5$OSevent) ###Count survival at 10 years status table(phenoGlasCy5$TenYearSurv) @ We subsequently processed the ``Factor.Value.Time\_to\_development\_of\_distant\_metastases'' character strings contained in the ``E-TABM-115'' SDRF table to define the actual time to metastasis (TTM) time, the TTM event status, and the 5-year distant recurrence status for each patient, as shown below. This was possible only for the set of hybridization for which the clinical information was provided in ArrayExpress (see \Robject{glasRGcy5\$targets} below). <>= ###Process time metastases (TTM) character strings and make numeric ###Note that missing data, labeled with "n/a" strings are turned into NA TTM <- phenoGlasCy5$Factor.Value.Time_to_development_of_distant_metastases phenoGlasCy5$TTM <- as.numeric(gsub("\\s.+", "", TTM)) ####Process TTM event and make numeric phenoGlasCy5$TTMevent <- as.numeric(phenoGlasCy5$Factor.Value.Event_distant_metastases) ###Use follow-up time from OS as TTM for patients withouth metastasis event phenoGlasCy5$TTM[is.na(phenoGlasCy5$TTM)] <- phenoGlasCy5$OS[is.na(phenoGlasCy5$TTM)] ####Create binary TTM at 5 years groups phenoGlasCy5$FiveYearMetastasis <- phenoGlasCy5$TTM < 5 & phenoGlasCy5$TTMevent == 1 @ The time to metastasis information summary, and the five year distant recurrence status, which corresponds to the ``good'' and ``bad'' prognosis groups used in the Glas study, are shown below. <>= ###Count TTM events table(phenoGlasCy5$TTMevent) ###Count metastasis events by putative cohort of origin table(phenoGlasCy5$FiveYearMetastasis, phenoGlasCy5$putativeCohort) @ We finally replaced the processed phenotypic information in the corresponding \Rclass{RGList-class} instances: <>= ###Substitute in the RGLists glasRGcy3$targets <- phenoGlasCy3 glasRGcy5$targets <- phenoGlasCy5 @ Below is the {\R} code chunk used to save the final \Rclass{RGList} objects. <>= ###Save the final ExpressionSet object: not run dataDirLoc <- system.file("data", package = "mammaPrintData") save(glasRGcy5, glasRGcy3, file=paste(dataDirLoc, "/glasRG.rda", sep="")) @ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{Buyse cohort: patients' phenotype curation} The chunk of {\R} code below was used to process the clinical information associated with the 307 analyzed patients that were retrieved from the ArrayExpress ``E-TABM-77'' series, to define the final ``good'' and ``bad'' prognosis patients groups, as well as other end-points that were investigated in the present study. We hence processed the ``Factor.Value.SurvivalTime'' character strings contained in the ``E-TABM-77'' SDRF table to define the actual overall survival (OS) time, the OS event status, and the 10-year survival group for each patient, as shown below <>= ###Retrieve the phenotypic information from one of the RGList intances phenoBuyseCy3 <- buyseRGcy3$targets phenoBuyseCy5 <- buyseRGcy5$targets ###Process overall survival (OS) character strings and make numeric: BOTH SETS OScy5 <- phenoBuyseCy5$Factor.Value.SurvivalTime phenoBuyseCy5$OS <- as.numeric(gsub("\\s.+", "", OScy5)) OScy3 <- phenoBuyseCy3$Factor.Value.SurvivalTime phenoBuyseCy3$OS <- as.numeric(gsub("\\s.+", "", OScy3)) ###Create the OS event by processing the OS character string with grep ###and the "plus" pattern making it a numeric vector: BOTH SETS phenoBuyseCy5$OSevent <- 1*( ! 1:nrow(phenoBuyseCy5) %in% grep("plus", OScy5)) phenoBuyseCy3$OSevent <- 1*( ! 1:nrow(phenoBuyseCy3) %in% grep("plus", OScy3)) ####Create binary OS at 10 years groups: BOTH SETS phenoBuyseCy5$TenYearSurv <- phenoBuyseCy5$OS < 10 & phenoBuyseCy5$OSevent == 1 phenoBuyseCy3$TenYearSurv <- phenoBuyseCy3$OS < 10 & phenoBuyseCy3$OSevent == 1 @ The overall survival information summary for the Buyse cohort is shown below: <>= ###Count OS survival events: BOTH SETS table(phenoBuyseCy5$OSevent) table(phenoBuyseCy3$OSevent) ###Count survival at 10 years status: BOTH SETS table(phenoBuyseCy5$TenYearSurv) table(phenoBuyseCy3$TenYearSurv) @ We subsequently processed the ``Factor.Value.Disease.Free.Survival'' character strings contained in the ``E-TABM-77'' SDRF table to define the actual disease free survival (DFS) time, and the DFS event status for each patient, as shown below. <>= ###Process disease free survival (DFS) character strings and make numeric: BOTH SETS DFScy5 <- phenoBuyseCy5$Factor.Value.Disease.Free.Survival phenoBuyseCy5$DFS <- as.numeric(gsub("\\s.+", "", DFScy5)) DFScy3 <- phenoBuyseCy3$Factor.Value.Disease.Free.Survival phenoBuyseCy3$DFS <- as.numeric(gsub("\\s.+", "", DFScy3)) ###Create the DFS event by processing the DFS character string with grep ###and the "plus" pattern making it a numeric vector: BOTH SETS phenoBuyseCy5$DFSevent <- 1*( ! 1:nrow(phenoBuyseCy5) %in% grep("plus", DFScy5)) phenoBuyseCy3$DFSevent <- 1*( ! 1:nrow(phenoBuyseCy3) %in% grep("plus", DFScy3)) ####Create binary DFS at 5 years groups:BOTH SETS phenoBuyseCy5$FiveYearDiseaseFree <- (phenoBuyseCy5$DFS < 5 & phenoBuyseCy5$DFSevent == 1) phenoBuyseCy3$FiveYearDiseaseFree <- (phenoBuyseCy3$DFS < 5 & phenoBuyseCy3$DFSevent == 1) @ The disease free survival information summary for the Buyse cohort is shown below: <>= ###Count DFS survival events: BOTH SETS table(phenoBuyseCy5$DFSevent) table(phenoBuyseCy3$DFSevent) ###Count recurrence at 5 years status: BOTH SETS table(phenoBuyseCy5$FiveYearDiseaseFree) table(phenoBuyseCy3$FiveYearDiseaseFree) @ We subsequently processed the ``Factor.Value.DistantMetastasis.Free.Survival'' character strings contained in the ``E-TABM-77'' SDRF table to define the actual time to metastasis (TTM) time, the TTM event status, and the 5-year recurrence status for each patient, as shown below. <>= ###Process time metastases (TTM) character strings and make numeric: BOTH SETS TTMcy5 <- phenoBuyseCy5$Factor.Value.DistantMetastasis.Free.Survival phenoBuyseCy5$TTM <- as.numeric(gsub("\\s.+", "", TTMcy5)) TTMcy3 <- phenoBuyseCy3$Factor.Value.DistantMetastasis.Free.Survival phenoBuyseCy3$TTM <- as.numeric(gsub("\\s.+", "", TTMcy3)) ###Create the TTM event by processing the TTM character string with grep ###and the "plus" pattern making it a numeric vector: BOTH SETS phenoBuyseCy5$TTMevent <- 1*( ! 1:nrow(phenoBuyseCy5) %in% grep("plus", TTMcy5)) phenoBuyseCy3$TTMevent <- 1*( ! 1:nrow(phenoBuyseCy3) %in% grep("plus", TTMcy3)) ####Create binary TTM at 5 years groups: BOTH SETS phenoBuyseCy5$FiveYearRecurrence <- (phenoBuyseCy5$TTM < 5 & phenoBuyseCy5$TTMevent == 1) phenoBuyseCy3$FiveYearRecurrence <- (phenoBuyseCy3$TTM < 5 & phenoBuyseCy3$TTMevent == 1) @ The time to metastasis information summary, and the five year distant recurrence status, which corresponds to the ``good'' and ``bad'' prognosis groups used in the Buyse study, are shown below. <>= ###Count TTM survival events: BOTH SETS table(phenoBuyseCy5$TTMevent) table(phenoBuyseCy3$TTMevent) ###Count recurrence at 5 years status: BOTH SETS table(phenoBuyseCy5$FiveYearRecurrence) table(phenoBuyseCy3$FiveYearRecurrence) ###Count metastasis events by European center: BOTH SETS table(phenoBuyseCy5$FiveYearRecurrence, phenoBuyseCy5$Characteristics.BioSourceProvider) table(phenoBuyseCy3$FiveYearRecurrence, phenoBuyseCy3$Characteristics.BioSourceProvider) @ We finally identified the patients that were excluded from the analysis in the original study by Buyse and colleagues \cite{Buyse2006} beacuse of missing clinical information. <>= ###Exclude patients with missing ER status: BOTH SETSXS phenoBuyseCy5$toExclude <- phenoBuyseCy5$Factor.Value.ER.status=="unknown" phenoBuyseCy3$toExclude <- phenoBuyseCy3$Factor.Value.ER.status=="unknown" @ We finally replaced the processed phenotypic information in the corresponding \Rclass{RGList-class} instances: <>= ###Substitute in the RGLists buyseRGcy3$targets <- phenoBuyseCy3 buyseRGcy5$targets <- phenoBuyseCy5 @ Below is the {\R} code chunk used to save the final \Rclass{RGList} objects. <>= ###Save RGList instances for later use: not run dataDirLoc <- system.file("data", package = "mammaPrintData") save(buyseRGcy5, buyseRGcy3, file=paste(dataDirLoc, "/buyseRG.rda", sep="")) @ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{System Information} Session information: <>= toLatex(sessionInfo()) @ \pagebreak \section{References} \bibliographystyle{plain} \bibliography{./mammaPrintData} \end{document}