---
title: "packFinder"
author: "Jack Gisby"
date: "`r Sys.Date()`" 
output:
    BiocStyle::html_document:
        number_sections: yes
        toc: true
vignette: >
    %\VignetteIndexEntry{packFinder}
    %\VignetteEngine{knitr::rmarkdown}
    %\VignetteEncoding{UTF-8} 
---

```{r setup, include = FALSE}
BiocStyle::markdown()

library(packFinder)

dir.create("tempOutput")
```

# Introduction

The goal of packFinder was to implement a simple tool for the prediction of 
potential Pack-TYPE elements. packFinder uses the following prior knowledge, 
provided by the user, to detect transposons:

1. Terminal Inverted Repeat (TIR) Base Sequence
2. Length of Terminal Site Duplication (TSD)
3. Length of the Transposon

These features, shown in __Figure 1__, provide enough information to detect 
autonomous and pack-TYPE elements. For a transposon to be predicted by 
packFinder its TSD sequences must be identical to each other, its forward TIR 
sequence must match the base sequence provided and its reverse TIR sequence 
must match its reverse complement.

![**Important structural features of Pack-TYPE transposons**](tirSeq.jpg)

Transposons are therefore predicted by searching a given genome for these 
characteristics, and further analysis steps can reveal the nature of these 
elements - while the packFinder tool is sensitive for the detection of 
transposons, it does not discriminate between autonomous and Pack-TYPE 
elements. 

Autonomous elements will contain a transposase gene within the 
terminal inverted repeats and tend to be larger than their Pack-TYPE 
counterparts; pack-TYPE elements instead capture sections of host genomes. 
Following cluster analysis, BLAST can be used to discern which predicted 
elements are autonomous (transposase-containing) and with are true 
Pack-TYPE elements.

# Getting Started

Users may download packFinder and use the primary function - packSearch - 
to locate potential transposons in a given set of DNA sequences. In addition 
to R packages, the command line tool `VSEARCH` must be installed prior to use 
of clustering and alignment functions.

## R Package Dependencies

CRAN and Bioconductor packages will be automatically installed 
when downloading `packFinder`. The package may be installed 
from the development branch of Bioconductor, as long as R version 
4.0.0 is installed. 

```{r packFinderInstall, eval=FALSE}
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

# The following initializes usage of Bioc devel
BiocManager::install(version='devel')

BiocManager::install("packFinder")
```

## Command Line Dependencies

While this allows R to run the packSearch pipeline within packFinder, 
VSEARCH must be installed for use of clustering and alignment functions 
within packSearch. Detailed installation instructions are available from 
the README file on the VSEARCH github https://github.com/torognes/vsearch). 
The command line can be used to install VSEARCH on Linux and MacOS operating 
systems (using wget and tar) while VSEARCH can be downloaded and extracted for 
use on Windows systems. 

For Linux and MacOS systems, correct installation of VSEARCH should allow 
users to use all functions within packFinder whereas for windows users, the 
absolute path to the VSEARCH executable file must be specified when calling 
packFinder clustering and alignment functions. This vignette will assume a 
MacOS/Linux operating system - see ?packClust for examples using windows.

# Searching for Potential Transposable Elements using packFinder

## Getting Data

Both DNA to be searched and the TIR query must be in XString format 
(see Biostrings). The TIR query should be coerced to a DNAString object 
while the DNA sequence, or set of sequences, should be a DNAStringSet. 
Bioconductor's `Biostrings` provides methods for the conversion of various 
formats to `DNAString` objects, including from character vectors and 
fasta files.

```{r biostrings}
# Convert a character vector to a DNAString
Biostrings::DNAString("CATG")

# Convert a list of character vectors to a DNAStringSet
Biostrings::DNAStringSet(c(
    "CATG",
    "GTAC"
))

# Convert a FASTA file to a DNAStringSet
Biostrings::readDNAStringSet(
    system.file("extdata", "packMatches.fasta", package = "packFinder"),
    format = "fasta"
)
```


## Using packSearch

In this example we search a subset of chromosome 3, from the Arabidopsis 
thaliana reference sequence; our query will be first 8 bases of the CACTA1 
autonomous transposable element's TIR. The CACTA transposons have TSD 
sequences that are 3 bases long, and the Pack-CACTA elements tend to be 
between 300 to 3500 bases in width. Since the first 8 bases of the Pack-CACTA 
TIRs tend to be conserved between elements, the default allowable mismatch of 
0 was used. 

```{r packSearch}
data("arabidopsisThalianaRefseq")

packMatches <- packSearch(
    Biostrings::DNAString("CACTACAA"),
    arabidopsisThalianaRefseq,
    elementLength = c(300, 3500),
    tsdLength = 3
)

head(packMatches)
```

In this subset, we identified six potential Pack-TYPE elements, however at 
this stage it is unclear what genetic information can be found between the 
inverted TIR sequences __Figure 1__. These elements could contain transposase 
genes, making them autonomous elements, or may be Pack-TYPE elements that 
have captured parts of the Arabidopsis thaliana host genome. Additionally, 
of the Pack-TYPE elements detected, some may share sequences of related 
chromosomal origin between their TIRs. 

`packSearch` returns a dataframe of ranges, in the format produced by 
coercing a GRanges object to a dataframe: `data.frame(GRanges)`. 
This format is used to transfer the transposon ranges between the functions 
in `packFinder`.

# Analysing Potential Transposable Elements using packFinder

## Clustering of Transposable Elements

In order to make downstream analysis more efficient and understand the 
relations between the identified elements, we can use VSEARCH to cluster 
predicted transposons. Here we run VSEARCH with the default parameters, 
meaning:

*An identity of 60% between two elements is required to form a cluster
*The method of identity detection is the VSEARCH default

The identity and method of identity definition can be altered depending on 
analysis (see VSEARCH documentation). 

```{r packClustData, include=FALSE}
data(packMatches)
```
```{r packClust, eval=FALSE}
packMatches <- packClust(
    packMatches,
    arabidopsisThalianaRefseq,
    saveFolder = "tempOutput"
)
```
```
## Rognes T, Flouri T, Nichols B, Quince C, Mahe F (2016)
## VSEARCH: a versatile open source tool for metagenomics
## PeerJ 4:e2584 doi: 10.7717/peerj.2584 https://doi.org/10.7717/peerj.2584
##
## vsearch v2.14.1_win_x86_64, 7.9GB RAM, 4 cores
## https://github.com/torognes/vsearch
##
## vsearch v2.14.1_win_x86_64, 7.9GB RAM, 4 cores
## https://github.com/torognes/vsearch
##
## Reading file data/packMatches.fasta 100%
## 9396 nt in 6 seqs, min 713, max 2463, avg 1566
## Counting k-mers 100%
## Clustering 100%
## Sorting clusters 100%
## Writing clusters 100%
## Clusters: 5 Size min 1, max 2, avg 1.2
## Singletons: 4, 66.7% of seqs, 80.0% of clusters
## Sorting clusters by abundance 100%
```
```
##   seqnames   start     end width strand TSD cluster
## 1     Chr3  100830  102347  1518      + TGT       3
## 2     Chr3 1068802 1069514   713      + ATA       4
## 3     Chr3 2807747 2809454  1708      + GGT       2
## 4     Chr3 3487540 3488267   728      + ATA       4
## 5     Chr3 3582297 3584562  2266      + ATA       1
## 6     Chr3 3738747 3741209  2463      + TTG       0
```

Of the 6 elements identified in this data subset, only two were found to be 
in the same cluster. When a new cluster is created, the transposon is 
designated as being on the forwards strand (+); elements that are subsequently 
assigned to this cluster are given a strand designation relative to the 
original element in the cluster. 

For this data, as there are few clusters, 
all of the elements have been designated as being on the forwards strand. 
Adjusting the identity % required or changing the identity definition could 
lead to more effective clustering, or lead to false-positives. 

Note, by default `filterWildcards` is called when clustering 
or aligning sequences; this prevents low quality sequences from clustering 
together, and by default removes sequences with a proportion of wildcards 
("N") above 5%. 

## Reading VSEARCH Output Files

Based on the results of packClust, we found that the second and fourth matches 
have similar sequences. Additionally, these potential elements have a similar 
width and so it is feasible that these are elements that have been duplicated 
by a transposase. To investigate the extent of the similarities, we can read 
the more detailed VSEARCH output files:

* USEARCH cluster format `.uc` - containing a summary of the clustering done 
by VSEARCH
* BLAST output - a BLAST compatible summary containing details of the BLAST 
matches between clusters 


```{r readClust}
readUc(system.file(
    "extdata",
    "packMatches.uc",
    package = "packFinder"
))

readBlast(system.file(
    "extdata",
    "packMatches.blast6out",
    package = "packFinder"
))
```

## Clustering of TIR Sequences

Additionally, tirClust can provide a summary of the similarities between 
the TIR sequences of clustered transposons. While in this example "CACTACAA" 
has been used as the TIR search query, the CACTA TIR sequence is longer than 
8 base pairs - although the rest of the TIR sequence may be less well 
conserved. 

For each cluster, tirClust creates a consensus sequence for 
the forward and reverse TIR regions; in this case we will consider the first 
25 base pairs of each TIR. Additionally, clustering of these TIRs is carried 
out using kmer clustering before being plotted as a dendrogram for 
visualisation. 

```{r tirClust}
consensusSeqs <- tirClust(packMatches,
    arabidopsisThalianaRefseq,
    tirLength = 25
)

head(consensusSeqs)
```

As expected, the forward and reverse TIRs of each transposon are very similar; 
this is also true for the two clustered transposons. From the dendrogram, 
groups are visible that weren't found by the VSEARCH clustering; this indicates 
that, while the TIR sequences are related, these groups likely have different 
genetic material between their TIR sequences.

## Alignment of Transposable Elements

After clusters of transposable elements have been identified, it may be useful 
to produce an alignment. Since we know that the transposable elements 
identified by VSEARCH have a minimum 60% sequence similarity, it will be 
possible to produce good quality sequence alignments. This can be useful 
for downstream analysis, such as BLAST searching. In this instance, an 
alignment was done for cluster 4; so an alignment of only two sequences 
was carried out. 

```{r align, eval=FALSE}
packMatches <- packAlign(
    packMatches,
    arabidopsisThalianaRefseq,
    saveFolder = "tempOutput"
)
```
```
## Rognes T, Flouri T, Nichols B, Quince C, Mahe F (2016)
## VSEARCH: a versatile open source tool for metagenomics
## PeerJ 4:e2584 doi: 10.7717/peerj.2584 https://doi.org/10.7717/peerj.2584
##
## vsearch v2.14.1_win_x86_64, 7.9GB RAM, 4 cores
## https://github.com/torognes/vsearch
##
## vsearch v2.14.1_win_x86_64, 7.9GB RAM, 4 cores
## https://github.com/torognes/vsearch
##
## Reading file data//packMatches.fasta 100%
## 9396 nt in 6 seqs, min 713, max 2463, avg 1566
## Masking 100%
## Aligning 100%
## Matching query sequences: 5 of 6 (83.33%)
```

As with VSEARCH clustering, a `.uc` and `.blast` file are created by the 
alignment. The alignment function returns a summary of the `.uc` file, 
however the full details of each file can be recovered as before. 

## BLAST Analysis
BLAST search has not been implemented in this package, so it is expected that 
the user will define clusters of transposable elements above before BLAST 
investigation. This stage will identify the tranposable elements that are 
autonomous and confirm which matches are likely pack-TYPE elements. 

Since 
this example included only a small subset of the Arabidopsis thaliana genome, 
and there were very few elements identified, a BLASTN search was carried out 
for cluster 4. In a larger study, it would have been wise to instead select 
more interesting clusters and BLAST a subset of these. 

```{r blast}
# the packMatches dataframe is exported as a FASTA file for NCBI blast search
packsToFasta(
    packMatches,
    "tempOutput/packMatches.fasta",
    arabidopsisThalianaRefseq
)
```

After running the cluster 4 sequence using NCBI's BLASTX service, two major 
hits were found (GENBANK: AAD28056.1 and AAF06087.1) for the sequence. One 
match appeared to be similar to a putative En/Spm-like transposon protein, 
suggesting that this transposon may not be a Pack-TYPE element as it may not 
have captured genetic material from the host genome.

# Data Formats and Conversion

## packSearch Dataframe

Dataframes are used by packFinder to contain the locations of predicted 
elements and additional metadata, such as TSD sequence and cluster 
designation. The dataframe is in a similar format to that stored by 
GenomicRanges. The dataframe can be saved and restored, as below, for longer 
term storage and export of packFinder results. packMatches refers to the 
object created in previous steps, by applying packSearch followed by packClust.

```{r csvConvert}
packsToCsv(packMatches, "tempOutput/packMatches.csv")
print(getPacksFromCsv("tempOutput/packMatches.csv"))
```

## GRanges

As mentioned previously, the packFinder dataframe is in a similar format to 
that stored by GenomicRanges. The data produced by packFinder can therefore 
be quickly converted to and from a GRanges object, as below. 

```{r grangesConvert}
packsGRanges <- packsToGRanges(packMatches)
print(packsGRanges)

print(getPacksFromGRanges(packsGRanges))
```

If many elements are discovered by packFinder it may be necessary to identify 
overlapping elements, that could not be produced by a transposase, and remove 
them; this can be done using GenomicRanges. 

## FASTA

Additionally, the dataframe produced by packFinder can be exported to and 
restored from FASTA formats. While this will not save additional metadata 
columns, such as cluster designations, the core information will be
preserved in the FASTA title field. 

```{r fastaConvert}
packsToFasta(
    packMatches,
    "tempOutput/packMatches.fasta",
    arabidopsisThalianaRefseq
)

print(getPacksFromFasta("tempOutput/packMatches.fasta"))
```

# Step-wise packSearch Functions

While it is recommended to use packSearch to identify potential Pack-TYPE 
elements, it is possible to run the individual functions that make up the 
packSearch pipeline. Below are the steps used by packSearch to predict 
transposons, using the same parameters as the previous example. 

## Identifying Potential TIRs

Potential TIR sequences are first identified by pattern matching. 

```{r identifyTirMatches}
forwardMatches <- identifyTirMatches(
    Biostrings::DNAString("CACTACAA"),
    arabidopsisThalianaRefseq,
    strand = "+",
    tsdLength = 3
)
nrow(forwardMatches)

reverseMatches <- identifyTirMatches(
    Biostrings::reverseComplement(Biostrings::DNAString("CACTACAA")),
    arabidopsisThalianaRefseq,
    strand = "-",
    tsdLength = 3
)
nrow(reverseMatches)
```

## Obtaining TSD Sequences 

The function `getTsds` may be used to quickly obtain the TSD sequences for 
TIRs. This function can also be used for obtaining TSDs from the ranges of 
known full transposons, given a dataframe in the format produced by packSearch. 

```{r getTsds}
forwardMatches$TSD <- getTsds(
    forwardMatches,
    arabidopsisThalianaRefseq,
    3,
    strand = "+"
)

head(forwardMatches)

reverseMatches$TSD <- getTsds(
    reverseMatches,
    arabidopsisThalianaRefseq,
    3,
    strand = "-"
)

head(reverseMatches)
```

## Filtering TIR Matches

The main step of the algorithm matches TIR sequences together based on their 
proximity and TSD sequence. 

```{r identifyPotentialPackElements}
identifyPotentialPackElements(
    forwardMatches,
    reverseMatches,
    arabidopsisThalianaRefseq,
    c(300, 3500)
)
```

After attaching the TSD sequences to the dataframe of matches, the output 
matches that of the previous example run using the packSearch pipeline. 

## Get Transposon Sequences 

The function `getSeqs` may be used to obtain transposon sequences from a 
dataframe in the format produced by packSearch. This could additionally be 
used on other dataframes of ranges, such as that produced by 
identifyTirMatches above. 

```{r getPackSeqs}
getPackSeqs(packMatches, arabidopsisThalianaRefseq)
```

# References
Information on the properties of Pack-TYPE transposable elements 
was obtained from the following papers during the development of 
`packFinder`.

* Catoni, M. et al. (2019) ‘Mobilization of Pack-CACTA transposons in 
Arabidopsis suggests the mechanism of gene shuffling’, 
*Nucleic Acids Research*. Oxford University Press, 47(3), pp. 1311–1320. 
doi: 10.1093/nar/gky1196.
* Jiang, N. et al. (2004) ‘Pack-MULE transposable elements mediate gene 
evolution in plants’, *Nature*, 431(7008), pp. 569–573. 
doi: 10.1038/nature02953.

The Bioconductor packages `GenomicRanges` and `Biostrings` are 
used for detection of tranposable elements, and manipulation of 
DNA sequences. 

# Session Information

This vignette was compiled during the following session:

```{r SessionInfo}
sessionInfo()
```

```{r include=FALSE}
unlink("tempOutput", recursive = TRUE)
```