---
title: "Generate transcript to gene file for bustools"
output: 
  BiocStyle::html_document:
    df_print: paged
vignette: >
  %\VignetteIndexEntry{tr2g}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

# Introduction

Converting the BUS format into gene count matrix requires a file that maps transcripts to genes. This is required by both the C++ command line tool `bustools` and this package if you choose to use it for the gene count matrix. For `bustools`, the file should be of a specific format. This package implements utility functions to generate the transcript to gene data frame and save it in the format required by `bustools`. This vignette shows examples of generating this file with different methods

```{r setup}
library(BUSpaRse)
library(TENxBUSData)
```

# Obtaining transcript to gene information
## From Ensembl
The transcript to gene data frame can be generated by directly querying Ensembl with biomart. This can query not only the vertebrate database (www.ensembl.org), but also the Ensembl databases for other organisms, such as plants (plants.ensembl.org) and fungi (fungi.ensembl.org). By default, this will use the most recent version of Ensembl, but older versions can also be used. By default, Ensembl transcript ID (with version number), gene ID (with version number), and gene symbol are downloaded, but other attributes available on Ensembl can be downloaded as well. Make sure that the Ensembl version matches the Ensembl version of transcriptome used for kallisto index.
```{r}
# Specify other attributes
tr2g_mm <- tr2g_ensembl("Mus musculus", ensembl_version = 94, 
                        other_attrs = "description")
```

```{r}
head(tr2g_mm)
```

```{r}
# Plants
tr2g_at <- tr2g_ensembl("Arabidopsis thaliana", type = "plant")
```

```{r}
head(tr2g_at)
```

## From FASTA file
We need a FASTA file for the transcriptome used to build kallisto index. Transcriptome FASTA files from Ensembl contains gene annotation in the sequence name of each transcript. Transcript and gene information can be extracted from the sequence name. At present, only Ensembl FASTA files or FASTA files with sequence names formatted like in Ensembl are accepted.
```{r}
# Subset of a real Ensembl FASTA file
toy_fasta <- system.file("testdata/fasta_test.fasta", package = "BUSpaRse")
tr2g_fa <- tr2g_fasta(file = toy_fasta)
head(tr2g_fa)
```

## From GTF and GFF3 files
If you have GTF or GFF3 files for other purposes, these can also be used to generate the transcript to gene file.
```{r}
# Subset of a reral GTF file from Ensembl
toy_gtf <- system.file("testdata/gtf_test.gtf", package = "BUSpaRse")
tr2g_tg <- tr2g_gtf(toy_gtf)
head(tr2g_tg)
```

## Sorting
Before using the transcript to gene data frame to generate the sparse matrix, it must be sorted into the same order as transcripts in the kallisto index. The equivalence classes from kallisto contain sets of numbers; these numbers are line number of the transcript, so the first transcript in the index is 0, the second is 1, and so on. Note that if the same FASTA file used for kallisto indexing is used to generate the transcript to gene data frame, then the data frame should already be in the right order. Here is an example of code used to sort the data frame, where `"./out_retina"` is the directory where kallisto output is stored.

```{r}
# Download kallisto bus example output
TENxBUSData(".", dataset = "retina")
tr2g_mm <- sort_tr2g(tr2g_mm, kallisto_out_path = "./out_retina")
```

## Mixed species
The `tr2g_*` family of functions in this package only processes one species at a time. There is a function that processes multiple species at a time, but this only does Ensembl queries and Ensembl FASTA files. This function will also sort the transcripts. This function can be used for one species as well, to do the transcript and gene information extraction and the sorting in one step. Alternatively, youo can simply concatenate the tr2g data frames for single species and then sort it.
```{r}
# Download example mixed species output
TENxBUSData(".", dataset = "hgmm100")
tr2g_hgmm <- transcript2gene(species = c("Homo sapiens", "Mus musculus"),
                             type = "vertebrate",
                             ensembl_version = 94,
                             kallisto_out_path = "./out_hgmm100")
```

# Save file for `bustools`
This package can be used to save the transcript to gene data frame to a format required by the `bustools`. This assumes that the data frame is already properly sorted.
```{r}
save_tr2g_bustools(tr2g_at, file_save = "./tr2g_at.tsv")
```

```{r}
sessionInfo()
```