--- title: "rSWeeP: Functions to creation of low dimensional comparative matrices of Amino Acid Sequence occurrences " author: - name: Danrley Rafael Fernandes affiliation: Federal University of Paraná, Graduate in Biological Sciences, Curitiba, Paraná, Brazil. - name: Camilla Reginatto De Pierri affiliation: - Federal University of Paraná, Graduate Program in Bioinformatics, Curitiba, Paraná, Brazil. - Federal University of Paraná, Department of Biochemistry and Molecular Biology, Curitiba, Paraná, Brazil. - name: Mariane Gonçalves Kulik affiliation: Federal University of Paraná, Graduate Program in Bioinformatics, Curitiba, Paraná, Brazil. - name: Roberto Raittz affiliation: - Federal University of Paraná, Graduate Program in Bioinformatics, Curitiba, Paraná, Brazil. - Federal University of Paraná, Department of Biochemistry and Molecular Biology, Curitiba, Paraná, Brazil. output: BiocStyle::html_document package: rSWeeP abstract: | This is a package with a couple of functions to possibilite the use of the sWeeP method in R. This method was developed to favor the analizes between amino acids sequences and to assist alignment free phylogenetic studies. This method is based on the concept of sparse words, which is applied in the scan of biological sequences and its the conversion in a matrix of ocurrences. Aiming the generation of low dimensional matrices of Amino Acid Sequence occurrences. vignette: > %\VignetteIndexEntry{rSWeeP} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- # Overview The “Spaced Words Projection (sWeeP)” is a method for representing biological sequences using relatively, it uses the spacedwords concept by scanning sequences and generating indices to create a higherdimensional vector that is later projected into a smaller randomly oriented orthonormal base. This function is suitable for making high quality comparisons between sequences allowing analyzes that are not possible due to the computational limitation of the traditional techniques. The method is available at [sWeeP](https://sourceforge.net/projects/spacedwordsprojection/) (PIERRI, 2019). This tool has it's main speed gain in constanci processing time. The response time grows linear to the number of inputs, while in other methods it grow is exponencial. ## Functions The package has two functions: orthBase, that generates an orthonormal matrix of a chosen size, and sWeeP, a function that applies the sWeeP method # Quick Start The orthBase function can create a quasi-orthonormal matrix in any desired size. Here it is used to create a matrix to project the sWeeP method, so it must have 160.000 rows and the columns of the size wished for projection. ```{r} library(rSWeeP) baseMatrix <- orthBase(160000,10) ``` The **exdna.fas** dataset consists in a list of three strings that simulates a DNA sequence used for demonstration purposes only. ```{r} path <- system.file(package = "rSWeeP", "extdata", "exdna.fas") ``` Then the sWeeP method is applied and the returns a matrix that represents the sequences compared by a vectorial method. And then it's possible to see a graphic representation in a phylogenetic tree ```{r} return <- sWeeP(path,baseMatrix) distancia <- dist(return, method = "euclidean") tree <- hclust(distancia, method="ward.D") plot(tree, hang = -1, cex = 1) ``` # Session information ```{r label='Session information', eval=TRUE, echo=FALSE} sessionInfo() ``` # References - Pierri,C. R. *et al*. **sWeeP: Representing large biological sequences data sets in compact vectors**. Scientific Reports, accepted in December 2019.doi: 10.1038/s41598-019-55627-4.