---
title: "scoup: Simulate Codons with Darwinian Selection Incorporated as an
        Ornstein-Uhlenbeck Process"
author:
- name: Hassan Sadiq
  affiliation: Department of Statistics and Actuarial Science,
    Stellenbosch University, South Africa
output:
    BiocStyle::html_document:
        toc_float: true
package: scoup
bibliography: citation.bib
abstract: |
    Genetic simulations are an important part of molecular biology. They are
    useful for assessing the efficiency and the sensitivity of models of
    evolution. Despite their relevance, hardly any simulator dedicated to
    sequence generation for natural selection inference analyses exist on the
    Bioconductor platform. In the broader molecular evolution genre, existing
    genetic simulators are yet to fully appreciate the correspondence between
    the population genomic and the phylogenetic literature. `scoup`  is
    designed as a contribution toward filling these voids. With `scoup`, it
    is possible to explore the implications of the interplay between mutation
    and genetic drift on the phenomenological inferences of natural selection
    obtained from phylogenetic models. The ratio of the variance of
    non-synonymous to synonymous selection coefficient (vN/vS) is also
    presented as a reliable selection discriminant metric. Example code of
    how to use the package are presented with elaborate comments.  
vignette: |
    %\VignetteIndexEntry{scoup Tutorial}
    %\VignetteEngine{knitr::rmarkdown}
    %\VignetteEncoding{UTF-8}
---

```{r style, echo = FALSE, results = 'asis'}
    BiocStyle::markdown()
```

# Introduction
Statistical inference of the extent at which Darwinian natural selection had
impacted on genetic data from multiple populations commands a healthy portion
of the phylogenetic literature [@jacques2023]. Validation of these
codon-based models relies heavily on simulated data. A search of the entries
on the Bioconductor [@bioc2004] platform, on 29 July 2024, with keywords
`codon`, `mutation`, `selection`, `simulate` and `simulation` returned a total
of 72 unique (out of the 2300 available Software) packages. None of the
retrieved entries was dedicated to codon data simulation for natural selection
analyses. Given the ever increasing diversity natural selection inference
models that exist, there is indeed need for applicable packages on the
platform.

Population genomic studies provided the mathematical foundation upon which
phylogenetics thrived [@wright1931; @fisher1922; @hardy1908; @weinberg1908;
@darwin1859]. The thirst to bridge the gap between these two genres of
evolutionary biology continue to drive the invention of more complex models
of evolution [@brosou2012]. Consequently, there is need to develop codon
sequence simulators to match the growth. `scoup` is designed on the
basis of the mutation-selection (MutSel) framework [@halpern1998] as a
contribution to this quest. Only a couple of existing selection-focused
simulators in the literature used the MutSel framework [@wilke2015;
@arenas2014]. This is most probably due to the perceived complexity of the
methodology. In `scoup`, the versatility of the @uhlenbeck1930
algorithm as a framework for evolutionary analyses [@bartoszek2017] was
used to circumvent the complexity.

In a bid to identify an appropriate quantifier that permits direct comparison
between the degree of selection signatures imposed during simulation and that
inferred, the ratio of the variance of the non-synonymous to synonymous
selection coefficients (vN/vS) was discovered to be appropriate. The vN/vS
statistic is consequently posited as a quality selection discriminant metric.
`scoup` therefore represents an important contribution to the
phylogenetic modelling literature. Example code of how to successfully use
the package is presented below.

# Installation
Use the following code to install `scoup` from the Bioconductor platform.
```{r, eval=FALSE}
if(!requireNamespace("BiocManager", quietly=TRUE))
    install.packages("BiocManager")
BiocManager::install("scoup")
```

# Sample Code
Three primary evolutionary algorithms are available in `scoup`. These
include the frequency-dependent [@jones2017; @ayala1974], the
Ornstein-Uhlenbeck (OU) and the episodic algorithms. Example of `R`
[@cran2024] code where these functions are utilised are presented.
The homogeneous [@muse1994; @goldman1994] and heterogeneous (site-wise
and branch-wise) [@nielsen1998; @kpond2005] selection inference modelling
contexts are explored. Data quality is assessed by comparing the
maximum likelihood inferred ($\omega$) and the analytically calculated
($\mathrm{d}N/\mathrm{d}S$) estimates to the magnitude of the imposed
selection pressure (measured by vN/vS). Template code used to analyse
the output data (to obtain $\omega$) with `PAML` [@sandra2023; @yang2007]
and `FUBAR` [@murrell2013; @hyphy2005] are presented in the Appendix. The
`R` code presented subsequently, require that the user should have already
installed the `scoup` package.

## Ornstein-Uhlenbeck Sensitivity
```{r, eval=TRUE}
# Make package accessible in R session
library(scoup)

# Number of extant taxa
leaves <- 8

# Number of codon sites
sSize <- 15

# Number of data replications for each parameter combination
sims <- 1

# OU reversion parameter (Theta) value
eThta <- c(0.01)

# OU asymptotic variance value
eVary <- c(0.0001)

# OU landscape shift parameters
hbrunoStat <- hbInput(c(vNvS=1, nsynVar=0.01))

# Sequence alignment size information
seqStat <- seqDetails(c(nsite=sSize, ntaxa=leaves))

# Iterate over all listed OU variance values
for(g in seq(1,length(eVary))){

    # Iterate over all listed OU reversion parameter values
    for(h in seq(1,length(eThta))){

        # Create appropriate simulation function ("ou") object
        adaptStat <- ouInput(c(eVar=eVary[g],Theta=eThta[h]))

        # Iterate over the specified number of replicates
        for(i in seq(1,sims)){

            # Execute simulation
            simData <- alignsim(adaptStat, seqStat, hbrunoStat, NULL)
        }
    }
}
# Print simulated alignment
seqCOL(simData)
```
As expected, the correlation coefficient estimate was approximately
$0.9974$ when the means of the inferred ($\omega$) and the calculated
($\mathrm{d}N/\mathrm{d}S$) selection effects were first compared. The
correlation estimation included more iterations.


## vN/vS Sensitivity
```{r, eval=TRUE}
# Make package accessible in R session
library(scoup)

# Number of extant taxa
xtant <- 8

# Number of codon sites
siteSize <- 15

# Number of data replications for each parameter combination
simSize <- 1

# Variance of the non-synonymous selection coefficients
nsynVary <- c(0.001)

# Ratio of the variance of the non-synonymous to synonymous coeff.
vNvSvec <- c(1)

# Sequence alignment size information
seqStat <- seqDetails(c(nsite=siteSize, ntaxa=xtant))

# Iterate over all listed coefficient variance ratios
for(a in seq(1,length(vNvSvec))){

    # Iterate over all listed non-synonymous coefficients variance
    for(b in seq(1,length(nsynVary))){

        # Create appropriate simulation function ("omega") object
        adaptData <- wInput(list(vNvS=vNvSvec[a],nsynVar=nsynVary[b]))
        
        # Iterate over the specified number of replicates
        for(i in seq(1,simSize)){

            # Execute simulation
            simulateSeq <- alignsim(adaptData, seqStat, NA)
        }
    }
}
# Print simulated alignment
cseq(simulateSeq)
```
Sequences generated with the presented code (with more iterations
included) produced strongly correlated selection inferences
(correlation coefficient $\approx 0.9923$) when the average
$\mathrm{d}N/\mathrm{d}S$ and the $\omega$ values were compared.
This implementation is an example of how to execute the
frequency-dependent evolutionary technique with the package.

## Site-wise Application
```{r, eval=TRUE}
# Make package accessible in R session
library(scoup)

# Number of codon sites
sitesize<- 15

# Variance of non-synonymous selection coefficients
nsynVary <- 0.01

# Number of extant taxa
## Commented value was used for results presented in article
taxasize <- 8

# Sequence alignment size information
seqsEntry <- seqDetails(c(nsite=sitesize, ntaxa=taxasize))

# Create the applicable ("ou") object for simulation function
## eVar= OU asymptotic variance, Theta=OU reversion parameter
adaptEntry <- ouInput(c(eVar=0.1,Theta=1))

# Ratio of the variance of the non-synonymous to synonymous coeff.
sratio <- c(1e-06)

# Iterate over all listed coefficient variance ratios
for(a0 in seq(1,length(sratio))){

    # OU landscape shift parameters
    mValues <- hbInput(c(vNvS=sratio[a0], nsynVar=nsynVary))
    
    # Execute simulation
    simSeq <- alignsim(adaptEntry, seqsEntry, mValues, NA)
}
# Print simulated codon sequence
cseq(simSeq)
```
This is another example of how to call the OU shifting landscape
evolutionary approach. The results obtained yielded a pairwise
correlation coefficient estimate of approximately $0.9988$ between
the means of $\mathrm{d}N/\mathrm{d}S$ and $\omega$. The correlation
coefficient estimates were approximately $0.8123$ and $0.8305$ when
the averages were each compared to vN/vS, respectively.

## Branch-wise (Episodic) Test
```{r, eval=TRUE}
# Make package accessible in R session
library(scoup)

# Number of internal nodes on the desired balanced tree
iNode <- 3

# Number of required codon sites
siteCount <- 15

# Variance of non-synonymous selection coefficients
nsnV <- 0.01

# Number of data replications for each parameter combination
nsim <- 1

# Ratio of the variance of the non-synonymous to synonymous coeff.
vNvSvec <- c(1e-06)

# Sequence alignment size information
seqsBwise <- seqDetails(c(nsite=siteCount, blength=0.10))

# Iterate over all listed coefficient variance ratios
for(h in seq(1,length(vNvSvec))){

    # Iterate over the specified number of replicates
    for(i in seq(1,nsim)){

        # Create the parameter set applicable at each internal tree node
        scInput <- rbind(vNvS=c(rep(0,iNode-1),vNvSvec[h]),
                        nsynVar=rep(nsnV,iNode))
        
        # Create the applicable ("discrete") object for simulation function
        adaptBranch <- discreteInput(list(p02xnodes=scInput))
        
        # Execute simulation
        genSeq <- alignsim(adaptBranch, seqsBwise, NULL)
    }
}
# Print simulated sequence data
seqCOL(genSeq)
```
The correlation coefficient between the averages of the analytical
$\mathrm{d}N/\mathrm{d}S$ and the inferred $\omega$ estimates was
approximately $0.9998$, obtained from 50 independent iterations of
the code for a set of vN/vS values. The correlation coefficient
estimate was approximately $0.6349$ for vN/vS vs $\omega$ and $0.6360$
for vN/vS vs $\mathrm{d}N/\mathrm{d}S$.

# Conclusion
Reference `scoup` code are presented to facilitate use of the package.
Although not explicitly presented, it is also possible to generate data
with signatures of directional selection by setting the `aaPlus` element
of the `wInput` entry of the `alignsim` function accordingly. The
capacity of the package is expected to be extended in future versions.

# Citation
A more appropriate citation will be provided for the package 
after it has been accepted for publication. In the meantime, to cite
this package, use <em>Sadiq, H. 2024. "scoup: Simulate Codon Sequences
with Darwinian Selection Incorporated as an Ornstein-Uhlenbeck Process".
R Package. doi:10.18129/B9.bioc.scoup.</em>

# References

# Appendix: Sample Data Analyses (Non-R) Code
`CODEML` script executed in `PAML` to infer
single alignment-wide $\omega$ estimates for data sets generated from 50
independent executions of the sensitivity analyses code presented
above. The same `CODEML` script was used to analyse data (also 50
replicates) from the episodic analyses code, with the `model` entry
replaced with `2`. The `scoup` simulated sequence data and
tree are `seq.nex` and `seq.tre`, respectively.
```{bash, eval=FALSE}
    seqfile   = seq.nex
    treefile  = seq.tre
    outfile   = seq.out
    noisy     = 0
    verbose   = 0
    seqtype   = 1
    ndata     = 1
    icode     = 0
    cleandata = 0
    model     = 0
    NSsites   = 0
    CodonFreq = 3
    estFreq   = 0
    clock     = 0
    fix_omega = 0
    omega     = 1e-05
```

\vspace*{1cm}
`FUBAR` command executed with `HyPhy`
through the terminal in MacBook. Note that `HyPhy`
was already installed on the computer. The `seq.nex` input is the
`scoup` simulated codon sequence data that is saved in the same
NEXUS file with the tree data. The NEXUS file resides in the
working directory.
```{bash, eval=FALSE}
hyphy fubar --code Universal --alignment seq.nex --tree seq.nex
```

# Session info

The output of `sessionInfo()` from the computer where this file is generated
is provided below.

```{r sessionInfo, echo=FALSE}
sessionInfo()
```