---
title: "Generate transcript to gene file for bustools"
output: 
  BiocStyle::html_document:
    df_print: paged
vignette: >
  %\VignetteIndexEntry{Transcript to gene}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

# Introduction

Originally, this package was written when the `kallisto | bustools` concept was still experimental, to test a new and fast way to generate the gene count matrix from fastq files for scRNA-seq. In the past year, `kallisto | bustools` has matured. Now there's a wrapper [kb-python](https://github.com/pachterlab/kb_python) that can download a prebuilt `kallisto` index for human and mice and call `kallisto bus` and `bustools` to get the gene count matrix. So largely, the old way of calling `kallisto bus` and `bustools`, and some functionalities of `BUSpaRse`, such as getting transcript to gene mapping, are obsolete.

So now the focus of `BUSpaRse` has shifted to finer control of the transcripts that go into the transcriptome and more options. Now all `tr2g_*` functions (except `tr2g_ensembl`) can filter transcripts for gene and transcript biotypes, only keep standard chromosomes (so no scaffolds and haplotypes), and extract the filtered transcripts from the transcriptome. GTF files from Ensembl, Ensembl fasta files, GFF3 files from Ensembl and RefSeq, TxDb, and EnsDb can all be used here. 
```{r setup}
library(BUSpaRse)
library(BSgenome.Hsapiens.UCSC.hg38)
library(TxDb.Hsapiens.UCSC.hg38.knownGene)
library(EnsDb.Hsapiens.v86)
```

# Downloading a transcriptome
The transcriptome can be downloaded from a specified version of Ensembl and filtered for biotypes and standard chromosomes, not only for the vertebrate database (www.ensembl.org and its mirrors), but also other Ensembl sites for plants, fungi, protists, and metazoa. The `gene_biotype_use = "cellranger"` means that the same gene biotypes [Cell Ranger uses for its reference package](https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/advanced/references) are used here. By default, only standard chromosomes are kept. The `dl_transcriptome` function not only downloads the transcriptome and filters it, it also output the `tr2g.tsv` file of all transcripts in the filtered transcriptome, without column names, so can be directly used for `bustools`. 

Wonder which biotypes are available? The lists of all gene and transcript biotypes from Ensembl are now provided in this package, and can be queried by `data("ensembl_gene_biotypes")` and `data("ensembl_tx_biotypes")`.

Resources for common invertebrate model organisms such as _Drosophila melanogaster_ and _C. elegans_ are actually available on the vertebrate site (www.ensembl.org).
```{r}
# For Drosophila
dl_transcriptome("Drosophila melanogaster", out_path = "fly", 
                 gene_biotype_use = "cellranger", verbose = FALSE)
list.files("fly")
```

The first file is the original fasta file. The second is the `tr2g` file without column names. The third is the filtered fasta file.

For _C. elegans_, from an archived version of Ensembl
```{r}
dl_transcriptome("Caenorhabditis elegans", out_path = "worm", verbose = FALSE,
                 gene_biotype_use = "cellranger", ensembl_version = 97)
list.files("worm")
```

For _Saccharomyces cerevisiae_. Note that the versioning of Ensembl for the plants, fungi, and etc. sites, that are actually www.ensemblgenomes.org, is different from that of the vertebrate site. 
```{r}
dl_transcriptome("Saccharomyces cerevisiae", out_path = "yeast", 
                 type = "fungus", gene_biotype_use = "cellranger", 
                 verbose = FALSE)
list.files("yeast")
```

# Obtaining transcript to gene information
## From Ensembl
The transcript to gene data frame can be generated by directly querying Ensembl with biomart. This can query not only the vertebrate database (www.ensembl.org), but also the Ensembl databases for other organisms, such as plants (plants.ensembl.org) and fungi (fungi.ensembl.org). By default, this will use the most recent version of Ensembl, but older versions can also be used. By default, Ensembl transcript ID (with version number), gene ID (with version number), and gene symbol are downloaded, but other attributes available on Ensembl can be downloaded as well. Make sure that the Ensembl version matches the Ensembl version of transcriptome used for kallisto index.
```{r}
# Specify other attributes
tr2g_mm <- tr2g_ensembl("Mus musculus", ensembl_version = 99, 
                        other_attrs = "description", 
                        gene_biotype_use = "cellranger")
```

```{r}
head(tr2g_mm)
```

```{r}
# Plants
tr2g_at <- tr2g_ensembl("Arabidopsis thaliana", type = "plant")
```

```{r}
head(tr2g_at)
```

## From FASTA file
We need a FASTA file for the transcriptome used to build kallisto index. Transcriptome FASTA files from Ensembl contains gene annotation in the sequence name of each transcript. Transcript and gene information can be extracted from the sequence name. At present, only Ensembl FASTA files or FASTA files with sequence names formatted like in Ensembl are accepted.

By default, the `tr2g.tsv` file and filtered fasta file (if filtering for biotypes and chromosomes) are written to disk, but these can be turned off so only the `tr2g` data frame is returned into the R session.
```{r}
# Subset of a real Ensembl FASTA file
toy_fasta <- system.file("testdata/fasta_test.fasta", package = "BUSpaRse")
tr2g_fa <- tr2g_fasta(file = toy_fasta, write_tr2g = FALSE, save_filtered = FALSE)
head(tr2g_fa)
```

## From GTF and GFF3 files
If you have GTF or GFF3 files for other purposes, these can also be used to generate the transcript to gene file. Now `tr2g_gtf` and `tr2g_gff3` can extract transcriptome from a genome that is either a `BSgenome` or a `DNAStringSet`. 
```{r}
# Subset of a reral GTF file from Ensembl
toy_gtf <- system.file("testdata/gtf_test.gtf", package = "BUSpaRse")
tr2g_tg <- tr2g_gtf(toy_gtf, Genome = BSgenome.Hsapiens.UCSC.hg38,
                    gene_biotype_use = "cellranger",
                    out_path = "gtf")
head(tr2g_tg)
```

A new GTF or GFF3 file after filtering biotypes and chromosomes is also written, and this can be turned off by setting `save_filtered_gtf = FALSE` or `save_filtered_gff = FALSE`. The transcriptome, with biotypes filtered and only standard chromosomes, is in `transcriptome.fa`. Use `compress_fa = TRUE` to gzip it. 
```{r}
list.files("gtf")
```

## From TxDb
`TxDb` is a class for storing transcript annotations from the Bioconductor package `GenomicFeatures`. Unfortunately, `TxDb.Hsapiens.UCSC.hg38.knownGene` does not have biotype information or gene symbols.
```{r}
tr2g_hs <- tr2g_TxDb(TxDb.Hsapiens.UCSC.hg38.knownGene, get_transcriptome = FALSE,
                     write_tr2g = FALSE)
head(tr2g_hs)
```

## From EnsDb
`EnsDb` is a class for Ensembl gene annotations, from the Bioconductor package `ensembldb`. Ensembl annotations as `EnsDb` are available on `AnnotationHub` (since version 87), and some older versions are stand alone packages (e.g. `EnsDb.Hsapiens.v86`).
```{r}
tr2g_hs86 <- tr2g_EnsDb(EnsDb.Hsapiens.v86, get_transcriptome = FALSE, 
                        write_tr2g = FALSE, gene_biotype_use = "cellranger",
                        use_gene_version = FALSE, use_transcript_version = FALSE)
head(tr2g_hs86)
```

## Deprecation
There used to be sections about `sort_tr2g` and `save_tr2g_bustools`, but these functions have been superseded by the new version of `tr2g` functions and `dl_transcriptome`, which sort the transcriptome after extracting it so the `tr2g` and the transcriptome are in the same order. The new version of `tr2g` functions and `dl_transcriptome` also by default writes the `tr2g.tsv` without column names with the first column as transcript and the second as gene to disk.

```{r}
sessionInfo()
```