\documentclass{article} \usepackage{natbib} \usepackage{graphics} \usepackage{amsmath} \usepackage{indentfirst} \usepackage[utf8]{inputenc} \usepackage{color} \usepackage{hyperref} % \VignetteIndexEntry{massiR_Example} \begin{document} \SweaveOpts{concordance=TRUE} \title{massiR: MicroArray Sample Sex Identifier} \author{Sam Buckberry} \maketitle \tableofcontents \clearpage \section{The Problem} Given that the sex of many species is an easily observable and usually unambiguous classification, it is surprising the number of microarray data sets in public repositories that lack the associated sample sex information. Sex-biased gene expression in normal and pathological tissues is a well recognized for both sex chromosome and autosomal genes. Sex biases also exist in the prevalence and severity of many common human diseases, such as cardiovascular disease and some cancers. As sex is a potential influencing factor of both pathological and non-pathological phenotypes, gene expression analyses that do not account for sex-specific effects could fail to identify a significant proportion of genes that contribute the condition under investigation. Therefore, the absence of sample sex information restricts the reuse of gene expression data sets where the researcher intends to factor the effect of sex in reanalysis or reinterpretation, or when intending to include such data sets in larger gene expression meta-analyses. This is why we developed massiR, a package for predicting the sex of samples in microarray data sets. This package allows researchers to expand their analyses to retrospectively incorporate sex as a variable, generate or confirm sex information associated with publicly available data sets, to accurately predict the sex of samples missing sex information, or as a simple sanity check for your own microarray gene expression data. \section{Importing data and beginning the analysis} The massiR analysis begins by importing standard gene expression data of normalized and log transformed probe values. The gene expression data can be in the form of a data.frame object and have the sample identifiers as the column names and the probe identifiers as the row names, or as an ExpressionSet object. The identifiers for probes corresponding to Y chromosome genes must be as a data.frame object with the probe identifiers as row.names. To load the included test massiR gene expression data: <>= library(massiR) data(massi.test.dataset) @ The included gene expression data is composed of 60 samples and 1026 probes as a data.frame object. To load the test Y chromosome probes corresponding to the included data: <>= data(massi.test.probes) @ The Included list of Y chromosome probes contains probe identifiers as row.names in the data.frame class. \section{Extracting the Y chromosome probe data} The first step of the massiR analysis involves extracting the expression values for probes that correspond to Y chromosome genes. The user has the option of using their own list of probes corresponding to Y chromosome genes or using the probe lists included with the package. The included lists correspond to popular microarray platforms and contain identifiers for probes that map uniquely to Y chromosome genes. See section 8 for detials on using the included probes and section 9 for details on obtaining Y chromosome probes easily from Ensembl Biomart. When the expression values for Y chromosome probes are extracted, the expression variance for each probe across all samples is calculated. This allows the identification of low variance probes, which are unlikely to be informative in sex classification. The user has the option of selecting a probe variation threshold, so only the most informative probes are used in the classification process. Deciding on a probe variation threshold can be informed by inspecting a probe variation plot (Figure 1) generated by the massi.y.plot function. In our experience, using the most variable 25-50\% of probes (typically 10-40 probes, depending on platform) produces good results. To extract data corresponding to Y chromosome probes from the test data set and look at a probe variation plot: <>= massi.y.out <- massi_y(massi.test.dataset, massi.test.probes) @ <>= massi_y_plot(massi.y.out) @ This plot (Figure 1) is output to the R graphics device. <>= massi.y.out <- massi_y(massi.test.dataset, massi.test.probes) @ \setkeys{Gin}{width=1\linewidth} \begin{figure} \begin{center} <>= barplot(height=massi.y.out[[2]], names.arg=massi.y.out[[1]], xpd=T, cex.names=0.5, las=2, ylab="Probe CV (%)") # Get the quantile values from the massi.y output quantiles <- massi.y.out[[3]] # add lines for the 0%, 25%, 50%, and 75% quartiles abline(h=quantiles[1:4], col=c("black", "red", "blue", "green"), lwd=2) legend("topleft",cex=0.7, title="Threshold (Quantile)", col=c("black", "red", "blue", "green"), fill= c("black", "red", "blue", "green"), legend=c("1 (0%)", "2 (25%)", "3 (50%)", "4 (75%)")) @ \end{center} \caption{Expression variation (CV) of Y chromosome probes across all samples} \label{fig:fig1} \end{figure} \clearpage After viewing the probe variation plot, a decision can be made regarding which probes to use in the clustering step. The massiR package includes methods for selecting probe variation thresholds based on quantiles. The threshold can be determined by quantiles of probe variance (CV): 1=All probes, 2=Upper 75\%, 3=Upper 50\%, 4=Upper 25\%. It is highly recommended that probe CV plot generated using the massi\_y\_plot function be inspected to inform threshold choice (Figure 1). The default threshold value is 3. Once a probe threshold has been decided upon, run the massi\_select function. This will return a data.frame with the samples as columns and the subset of selected y chromosome probes as row names. <>= massi.select.out <- massi_select(massi.test.dataset, massi.test.probes, threshold=4) @ Check the output for the first 5 samples: <>= head(massi.select.out)[,1:5] @ \section{Predicting the sex of samples} To classify samples as either male or female, clustering is performed using the values from the subset of Y chromosome probes by implementing the partitioning around medoids algorithm which performs k-medoids clustering (Hennig 2013), where samples are assigned to one of two clusters. The two clusters are then compared using the probe expression values across all samples in each cluster. Samples within the cluster featuring the highest Y chromosome probe values are classed as male and those within the cluster with the lowest Y probe values classed as female. Results such sample probe mean, standard deviation and z-scores are reported in a table together with the sex predicted for each sample. To predict the sex of the samples using massi\_cluster: <>= results <- massi_cluster(massi.select.out) @ Extract the results for each sample from the returned list: <>= sample.results <- data.frame(results[[2]]) @ <<>>= head(sample.results) @ As you can see, it is a relatively straightforward procedure to produce a table with the predicted sex of each sample with some basic metrics. \section{Visualizing the massiR analysis data} The massiR package includes a function which allows various aspects of the data used in the analysis to be visualized. These plots enable to used to inspect sample and clustering characteristics which could aid in identifying problematic samples and outliers. To run the massi.plot function with the output from the massi\_select and massi\_cluster functions: <>= massi_cluster_plot(massi.select.out, results) @ This function will generate a heat map with dendrogram of Y chromosome probes as rows and individual samples in columns (Figure 2), a bar plot of mean values and standard deviation from the subset of Y chromosome probes used in K-medoids clustering (Figure 3), with the bars colored with respect to predicted sex and a principal component plot showing clusters (Figure 4). These plots can aid the user in identifying sample outliers or probes that may not be informative in the clustering step. \clearpage \begin{figure} \begin{center} <>= ord <- order(rowSums(abs(massi.select.out)),decreasing=T) heatmap.2(x=as.matrix(massi.select.out[ord,]), keysize=2, cexRow=0.7, key=T, trace="none", dendrogram="row", col=redgreen(75), scale="row") @ \end{center} \caption{Heat map with dendrogram of Y chromosome probes as rows and individual samples in columns. Notice that the values for the probe in the fifth row are reasonably variable but do not show the same pattern seen with other probes. Therefore viewing the heatmap may help identify problematic probes.} \label{fig:fig2} \end{figure} \begin{figure} \begin{center} <>= massi.cluster.results <- data.frame(results[[2]]) massi.cluster.results.sort <- massi.cluster.results[order(massi.cluster.results$sex),] # sort data by sex probe.means <- massi.cluster.results.sort$mean_y_probes_value # samples probe mean values probe.sd <- massi.cluster.results.sort$y_probes_sd # sample probe sd values sample.names <- massi.cluster.results.sort$ID # set x-axis names plot.top <- ceiling(max(probe.means+probe.sd*1.1)) # set y-axis upper limit plot.bottom <- floor(min(probe.means-probe.sd*1.1)) # set y-axis lower limit sample.sex <- massi.cluster.results.sort$sex # set the factor for bar color # create the plot barCenters <- barplot(probe.means, xpd=F, names.arg=results$ID, cex.names=0.7, ylab="Chr.Y mean probe value +/- SD", xlab="", col=c("red", "green")[as.factor(sample.sex)], las=2, ylim=c(plot.bottom,plot.top)) segments(barCenters, probe.means-probe.sd, # add the sd bars barCenters, probe.means+probe.sd, lwd=0.8) legend("topleft", fill=c("red", "green"), title="predicted sex", ## add legend to plot legend=c("female", "male"), cex=0.5, ) @ \end{center} \caption{Mean values of the subset of Y chromosome probes used in K-medoids clustering. The bar colors represent clusters, which were assigned as female (red) and male (green)} \label{fig:fig3} \end{figure} \begin{figure} \begin{center} <>= ## generate PC plot of clusters k.medoids.results <- results[[1]] clusplot(t(massi.select.out), k.medoids.results$clustering, color=TRUE, shade=FALSE, main="",cex.txt=0.5, labels=2, lines=0) @ \end{center} \caption{Principal component plot of male and female clusters} \label{fig:fig4} \end{figure} \clearpage \section{Check for potential sex bias using the dip test} The massiR method for predicting the sex of samples is >97\% accurate for data sets with 6 or more samples and with at least of 15\% of either males or females. Outside of this range, this method still performs well in most cases. As there is no guarantee that publicly available data sets will fall within these limits, the function massi.dip can be used to test if the data set might have a male/female ratio that might affect performance. \paragraph{} \hspace{0pt} \\ The massiR method was tested using empirical data sets for five different human tissues. Individual data subsets were randomly generated for each tissue data set ranging from 6-50 samples and with a wide-range of Male/Female ratios. The results of this testing suggest for data sets with >10 samples a dip statistic >0.08 is indicative of at least 15\% of males or females in the data set. The massi\_dip function calculates z-scores for each sample and implements the dip test to test for unimodality (Maechler 2013). As a relatively balanced dataset would typically show a bi-modal distribution of the z-scores, the dip statistic is then used to predict if a dataset shows a unimodal distribution that would be expected if a vast majority of samples were of one sex. To use massi\_dip function, which calculates the dip statistic using the data output from the massi\_select function: <>= dip.result <- massi_dip(massi.select.out) @ This returns the message: {\color{red}dip test statistic is >0.08. This suggests that the proportion of male and female samples in this data set is relatively balanced} \paragraph{} \hspace{0pt} \\ Visually inspecting this distribution as a density plot (figure 5) or a histogram plot (figure 6) enables the user to see if there is the expected bi-modal distribution (as there should be distinct distributions for each sex). To produce a density plot and histogram of sample z-scores: \begin{figure} \begin{center} <>= dip.result <- massi_dip(massi.select.out) plot(dip.result[[3]]) @ \end{center} \caption{Density plot of mean y chromosome probe z-scores} \label{fig:fig5} \end{figure} \begin{figure} \begin{center} <>= dip.result <- massi_dip(massi.select.out) hist(x=dip.result[[2]], breaks=20) @ \end{center} \caption{Histogram of mean y chromosome probe z-scores} \label{fig:fig6} \end{figure} \clearpage If the data set was has a sex bias that may influence the accuracy of the massiR sex prediction, then the massi\_dip function is likely to return a dip statistic of <0.08. For example, if we are to use the massiR test data set to generate a subset to 20 samples composed of 10\% males, we will see that the dip statistic returned is <0.08. To create this female skewed bias: get the sample id's for the male and female samples: <>= male.ids <- subset(sample.results$ID, subset=sample.results$sex=="male") female.ids <- subset(sample.results$ID, subset=sample.results$sex=="female") @ Create a data subsest of 20 samples with 10\% males: <<>>= bias.subset.ids <- c(female.ids[1:18], male.ids[1:2]) bias.subset <- massi.select.out[bias.subset.ids] @ Use the massi.dip function to test for sex-biased data set: <<>>= bias.dip <- massi_dip(bias.subset) @ Please note that a dip >0.08 is a good indication that there is not a sex bias present that will affect the accuracy of the massiR method. However, and dip statistic <0.08 may still be returned for data sets with >15\% males or female or data sets that a suitable for massiR analysis, therefore the results of the massi\_dip function should be interpreted with caution and in light of the massi\_cluster results. \section{Performing massiR analysis with an ExpressionSet object} The massiR pipeline allows the input of expression data in the class ExpressionSet. Here is an example of how to use data in the ExpressionSet class in a massiR analysis and how to put the results back into the ExpressionSet: Load the example ExpressionSet data included with the massiR package: <>= data(massi.eset, massi.test.probes) @ Using massiR with an ExpressionSet is the same as using a data.frame as in the above example: <>= eset.select.out <- massi_select(massi.eset, massi.test.probes) eset.results <- massi_cluster(eset.select.out) @ Now to get the massi.cluster results and add them to the ExpressionSet: <>= # Get the sex for each sample from the massi_cluster results eset.sample.results <- data.frame(eset.results[[2]]) sexData <- data.frame(eset.sample.results[c("ID", "sex")]) # Extract the order of samples in the ExpressionSet and match with results eset.names <- colnames(exprs(massi.eset)) # match the sample order in massiR results to the same as the ExpressionSet object sexData <- sexData[match(eset.names, sexData$ID),] # create an annotatedDataFrame to add to ExpressionSet pData <- new("AnnotatedDataFrame", data = sexData) # add the annotatedDataFrame to the Expressionset as phenoData phenoData(massi.eset) <- pData @ Check the phenoData is in the ExpressionSet and double check that all sample id's from the massiR analysis match the sample identifiers in the ExpressionSet. <<>>= # check the phenodata is now within the ExpressionSet phenoData(massi.eset) @ <<>>= # check that all phenodata id's match expressionSet column names. # This must return "TRUE" all(massi.eset$ID == colnames(exprs(massi.eset))) @ \section{Using the included massiR Y chromosome probe lists} The massiR package includes lists of Y chromosome probes for widely used Illumina and Affymetrix human gene expression platforms. If you wish to use one of the included probe lists, for example the Illumina human v2 probes: Load the massiR included probe lists: <>= data(y.probes) @ Check the names of the platforms for the probe lists. <>= names(y.probes) @ To get probe list into format for massiR analysis: <>= illumina.v2.probes <- data.frame(y.probes["illumina_humanwg_6_v2"]) @ The names of the probe lists correspond to Ensembl biomart attribute names. For instructions on obtaining probe identifiers for other platforms, see the section "Using biomaRt to obtain y chromosome probe lists" \section{Using biomaRt to obtain y chromosome probe lists} Obtaining y chromosome probes lists for many microarray platforms is relatively easy using the biomaRt package (Durinik et al. 2005 and Durinik et al. 2009). This method is recommended because Ensembl have mapped probe sequences to reference genomes for many platforms and this allows ambiguous and non-specific probes to be removed. For details on probe mapping methods, see \url{} For example, you can download the probes corresponding to the massiR test data set and obtain the Entrez gene id and genomic positions and convert these into a format for a massiR analysis: Use the biomaRt package to download genomic regions and Entrez gene id's for Illumina v2 probes: <>= library(biomaRt) mart <- useMart('ensembl', dataset="hsapiens_gene_ensembl") filters <- listFilters(mart) attributes <- listAttributes(mart) gene.attributes <- getBM(mart=mart, values=TRUE, filters=c("with_illumina_humanwg_6_v2"), attributes= c("illumina_humanwg_6_v2", "entrezgene", "chromosome_name", "start_position", "end_position", "strand")) @ Remove the probes mapped to multiple genomic regions: <>= unique.probe <- subset(gene.attributes, subset=!duplicated(gene.attributes[,1])) @ Select the probes that correspond to y chromosome genes: <>= y.unique <- subset(unique.probe, subset=unique.probe$chromosome_name == "Y") @ Get the probe id's as row.names in the format for massiR analysis: <>= illumina.v2.probes <- data.frame(row.names=y.unique$illumina_humanwg_6_v2) @ This is a straightforwd way of obtaining Y chromosome probes for many microarray platforms that is independent of platform manufacturer annotations and is highly reccomended. \clearpage \section{References} \paragraph{} \hspace{0pt} \\ Christian Hennig (2013) fpc: Flexible procedures for clustering. R package version 2.1-6. http://CRAN.R-project.org/package=fpc \paragraph{} \hspace{0pt} \\ Martin Maechler (2013) diptest: Hartigan's dip test statistic for unimodality - corrected code. R package version 0.75-5. http://CRAN.R-project.org/package=diptest \paragraph{} \hspace{0pt} \\ Gregory R. Warnes, Ben Bolker, Lodewijk Bonebakker, Robert Gentleman, Wolfgang Huber, Andy Liaw, Thomas Lumley, Martin Maechler, Arni Magnusson, Steffen Moeller, Marc Schwartz and Bill Venables (2013). gplots: Various R programming tools for plotting data. R package version 2.12.1. http://CRAN.R-project.org/package=gplots \paragraph{} \hspace{0pt} \\ Steffen Durinck, Paul T. Spellman, Ewan Birney and Wolfgang Huber (2009) Mapping identifiers for the integration of genomic datasets with the R/Bioconductor package biomaRt. Nature Protocols 4, 1184-1191. \paragraph{} \hspace{0pt} \\ Steffen Durinck, Yves Moreau, Arek Kasprzyk, Sean Davis, Bart De Moor, Alvis Brazma and Wolfgang Huber (2005) BioMart and Bioconductor: a powerful link between biological databases and microarray data analysis. Bioinformatics 21, 3439-3440. \end{document}