\documentclass[10pt]{article} %\VignetteEngine{knitr::knitr} %\VignetteIndexEntry{lfa Package} \usepackage{fullpage} \usepackage{hyperref} \title{Logistic Factor Analysis Vignette} \author{Wei Hao, Minsun Song, John D. Storey} \date{\today} \begin{document} \maketitle \section{Introduction} Logistic Factor Analysis (LFA)~\cite{hao_probabilistic_2016}. Briefly, LFA fits a latent variable model on categorical (i.e. SNP genotypes coded as 0, 1, and 2) data by modeling the logit transformed binomial parameters in terms of latent variables. The resulting ``logistic factors'' are analagous to principal components, but fit into a convenient likelihood based model. As a result, the logistic factors can power a number of other analyses. \section{Sample usage} We include a sample real dataset with the package as the variable \texttt{hgdp\_subset}---a small subset of the HGDP genotypes. The row names are the rsids for the SNPs and the column names are coarse geographical labels for the individuals. <>= library(lfa) dim(hgdp_subset) @ \subsection{\texttt{lfa}} The \texttt{lfa} function has two required arguments. The first is the genotype matrix, and the second is the number of logistic factors including the intercept. <>= LF <- lfa(hgdp_subset, 4) dim(LF) head(LF) @ We can plot the first two logistic factors and color by geographical information: <>= dat <- data.frame(LF[,1], LF[,2], colnames(hgdp_subset)) colnames(dat) = c("LF1", "LF2", "geo") library(ggplot2) ggplot(dat, aes(LF1, LF2, color=geo)) + geom_point() + theme_bw() + coord_fixed(ratio=(max(dat[,1])-min(dat[,1]))/(max(dat[,2])-min(dat[,2]))) @ One aspect of \texttt{lfa} is that the return value is a matrix of logistic factors, thus, an important part of subsequent analysis is to keep your matrix of logistic factors to pass as an argument. \subsection{\texttt{af}} Given a genotype matrix and logistic factors, the \texttt{af} function computes the individual-specific allele frequencies <>= allele_freqs <- af(hgdp_subset, LF) allele_freqs[1:5, 1:5] @ Since the calculation is independent at each locus, you can pass a subset of the genotype matrix as an argument if you aren't interested in all the SNPs. <>= subset <- af(hgdp_subset[15:25,], LF) subset[1:5,1:5] @ Given the allele frequencies, you can do some other interesting calculations---for example, compute the log-likelihood for each SNP. <>= ll <- function(snp, af){ -sum(snp*log(af) + (2-snp)*log(1-af)) } log_lik <- sapply(1:nrow(hgdp_subset), function(i) {ll(hgdp_subset[i,], allele_freqs[i,])}) which(max(log_lik) == log_lik) @ \section{Data Input} The best way to load genotypes is by using the function \texttt{read\_plink}, from the \texttt{genio} package, which assumes that you have binary PLINK formatted genotypes. The binary PLINK format uses files: a \texttt{.bed} for the genotypes, a \texttt{.bim} for the genotype information, and a \texttt{.fam} for the individuals information. \texttt{read\_plink} takes as an argument the prefix for your three files. A \texttt{BEDMatrix} object (from the eponymous function and package) is also supported, and can result in reduced memory usage (at a small runtime penalty). \bibliographystyle{plain} \bibliography{lfa} \end{document}