\name{read.long} \alias{read.long} \title{ Read SNP genotype data in long format } \description{ This function reads SNP genotype data from a file in which each line refers to a single genotype call. Replaces the earlier function \code{read.snps.long}. } \usage{ read.long(file, samples, snps, fields = c(snp = 1, sample = 2, genotype = 3, confidence = 4, allele.A = NA, allele.B = NA), split = "\t| +", gcodes, no.call = "", threshold = NULL, lex.order = FALSE, verbose = FALSE) } %- maybe also 'usage' for other objects documented here. \arguments{ \item{file}{ Name(s) of file(s) to be read (can be gzipped) } \item{samples}{ Either a vector of sample identifiers, or the number of samples to be read. If a single file is to be read and this argument is omitted, the file will be scanned initially and all samples will be included } \item{snps}{ Either a vector of SNP identifiers, or the number of SNPs to be read. If a single file is to be read and this argument is omitted, the file will be scanned initially and all SNPs will be included } \item{fields}{ A named vector giving the locations of the required fields. See Details below } \item{split}{ A regular expression specifying how the input line will be split into fields. The default value specifies separation of fields by a TAB character, or by one or more blanks } \item{gcodes}{ When the genotype is read as a single field, this argument specifies how it is handled. See Details below. } \item{no.call}{ The string which indicates "no call" for either a genotype or (when the genotype is read as two allele fields) an allele } \item{threshold}{ A vector of length 2 giving the lower and higher acceptable limits for the confidence score } \item{lex.order}{ If \code{TRUE}, the alleles at each locus will be in lexographical order. Otherwise, ordering of alleles is arbitrary, depending on the order in which they are encountered } \item{verbose}{ If \code{TRUE}, this turns on output from the function. Otherwise only error and warning messages are produced } } \details{ Each line on the input file represents a single call and is split into fields using the function \code{strsplit}. The required fields are extracted according to the \code{fields} argument. This \emph{must} contain the locations of the sample and snp identifier fields and \emph{either} the location of a genotype field \emph{or} the locations of two allele fields. If the \code{samples} and \code{snps} arguments contain vectors of character strings, a \code{SnpMatrix} is created with these row and column names and the genotype values are "cherry-picked" from the input file. If either, or both, of these arguments are specified simply as numbers, then these numbers determine the \emph{dimensions} of the \code{SnpMatrix} created. In this case samples and/or SNPs are included in the \code{SnpMatrix} on a first-come-first-served basis. If either or both of these arguments are omitted, a preliminary scan of the input file is carried out to find the missing sample and/or SNP identifiers. In this scan, when a sample or SNP identifier differs from that in the previous line, but is identical to one previously found, then all the relevant identifiers are assumed to have been found. This implies that the file must be sorted, in some consistent order, by sample and by SNP (although either one of these may vary fastest). If the genotype is to be read as a single field, the \code{genotype} element of the \code{fields} argument must be set to the appropriate value, and the \code{allele.A} and \code{allele.B} elements should be set to \code{NA}. Its handling is controlled by the \code{gcodes} argument. If this is missing or \code{NA}, then the genotype is assumed to be represented by a two-character field, the two characters representing the two alleles. If \code{gcodes} is a single string, then it is assumed to contain a regular expression which will split the genotype field into two allele fields. Otherwise, \code{gcode} must be an array of length three, specifying the three genotype codes in the order "AA", "AB", "BB". If the two alleles of the genotype are to be read from two separate fields, the \code{genotype} element should be set to \code{NA} and the \code{allele.A} and \code{allele.B} elements set to the appropriate values. The \code{gcode} argument should be missing or set to \code{NA}. } \value{ If the genotype is read as a single field matching one of three specified codes, the function returns an object of class \code{SnpMatrix}. Otherwise it returns a list whose first element is the \code{SnpMatrix} object and whose second element is a dataframe containing the allele codes, with the SNP identifiers as row names. Note that allele codes only occur in this file if they occur in a genotype which was accepted. Thus, monomorphic SNPs have \code{allele.B} coded as \code{NA}, and SNPs which never pass confidence score filters have both alleles coded as \code{NA}. } \author{ David Clayton \email{david.clayton@cimr.cam.ac.uk} } \note{ Unlike \code{read.snps.long}, this function is written entirely in R and may not be particularly fast. However, it imposes no restrictions on the allele codes recognized. Homozygous genotypes are assumed to be represented in the input file by coding both alleles to the same value. No special provision is made to read \code{XSnpMatrix} objects; such data should first be read as a \code{SnpMatrix} and then coerced to an \code{XSnpMatrix} using \code{new} or \code{as}. } \seealso{ \code{\link{SnpMatrix-class}}, \code{\link{XSnpMatrix-class}} } \examples{ ## ## No example supplied yet ## } \keyword{manip} \keyword{IO} \keyword{file} \keyword{utilities}