---
title: "Introduction to DeProViR"
author:
- name: Matineh Rahmatbakhsh
affiliation: ProCogia, Vancouver, BC
email: matineh.rahmatbakhsh@procogia.com
package: DeProViR
output:
BiocStyle::html_document:
toc_float: true
toc_depth: 4
number_sections: true
highlight: "textmate"
vignette: >
%\VignetteIndexEntry{Introduction to DeProViR}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r echo=FALSE,message=FALSE, warning=FALSE}
library("knitr")
```
```{r setup, include = FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
# Abstract
Emerging infectious diseases, including zoonoses, pose a significant threat to
public health and the global economy, as exemplified by the COVID-19 pandemic
caused by the zoonotic severe acute respiratory syndrome coronavirus 2
(SARS-CoV-2). Understanding the protein-protein interactions (PPIs) between
host and viral proteins is crucial for identifying targets for antiviral
therapies and comprehending the mechanisms underlying pathogen
replication and immune evasion. Experimental techniques like yeast
two-hybrid screening and affinity purification mass spectrometry have
provided valuable insights into host-virus interactomes. However, these
approaches are limited by experimental noise and cost, resulting in
incomplete interaction maps. Computational models based on machine learning
have been developed to predict host-virus PPIs using sequence-derived
features. Although
these models have been successful, they often overlook the semantic information
embedded in protein sequences and require effective encoding schemes.
Here, we introduces DeProViR, a deep learning (DL) framework that predicts
interactions between viruses and human hosts using only primary amino acid
sequences. DeProViR employs a Siamese-like neural network architecture,
incorporating convolutional and bidirectional long short-term memory (Bi-LSTM)
networks to capture local and global contextual information. It utilizes
GloVe embedding to represent amino acid sequences, allowing for the
integration of semantic associations between residues. The proposed framework
addresses limitations of existing models, such as the need for feature
engineering and the dependence on the choice of encoding scheme.
DeProViR presents a promising approach for accurate and efficient prediction
of host-virus interactions and can contribute to the development of antiviral
therapies and understanding of infectious diseases.
# Proposed Framework
The DeProViR framework is composed of a two-step automated computational
workflow: (1) Learning sequence representation of both host and viral proteins
and (2) inferring host-viral PPIs through a hybrid deep learning architecture.
More specifically, in the first step, host or virus protein sequences are
separately encoded into a sequence of tokens via a tokenizer and padded to
the same length of size 1000 with a pad token. The embedding matrix *E*
of 100-dimension is then generated by applying the unsupervised GloVe
embedding model to a host or viral profile representation to
learn the implicit yet low-dimensional vector space based on the corpus of
tokens. Next, the embedding layer is fed with sequences of integers,
i.e., amino acid token indexes, and mapped to corresponding pre-trained
vectors in the GloVe embedding matrix *E*, which turns the tokens into a
dense real-valued 3D matrix M. In the subsequent step, DeProViR uses a
Siamese-like neural network architecture composed of two identical
sub-networks with the same configuration and weights. Each sub-network combines
convolution and recurrent neural networks (bidirectional Bi-LSTM) to capture
amino acids' local and global contextual relatedness accurately.
To achieve the best-performing DL architecture, we fine-tuned the
hyper-parameters for each block on the validation set by random search
employing auROC as the performance metric. We determined the number of epochs
through an early stopping strategy on the validation set, with a patience
threshold set to 3. The optimized DL architecture achieved an auROC of 0.96
using 5-fold cross-validation and 0.90 on the test set. This architecture
includes 32 filters (1-D kernel with size 16) in the first CNN layer to
generate a feature map from the input layer (i.e., embedding matrix *M*)
through convolution operation and non-linear transformation of its input with
the ReLU activation function. Next, the hidden features generated by the first
convolution layer are transformed to the second CNN layer with 64 filters
(1-D kernel with size seven) in the same way. After the convolutional layers,
the k-max pooling layer is added to perform max pooling, where k is set to
30. Subsequently, the flattened pooling output is fed into a
bidirectional LSTM consisting of 64 hidden neurons, which finally connects to a
fully dense layer of 8 neurons that connects to the output layer with the
sigmoid activation function to output the predicted probability score.
# Vignette Overview
The modular structure of this package is designed in a way that allows users
the flexibility to either utilize their own training set or
load a fine-tuned pre-trained model that constructed previously (see previous
section). This dual capability empowers researchers to tailor their model
development approach to their specific needs and preferences.
In the first approach, users can use their own training data to train a model
tailored to their specific needs and subsequently apply the trained model to
make predictions on uncharted interactions. This capability proves particularly
valuable when users wish to undertake diverse tasks, such as predicting
interactions between host and bacterial pathogens, drug-target interactions,
or protein-protein interactions, etc.
Alternatively, the second approach streamlines the process by allowing users to
leverage a fine-tuned pre-trained model. This model has undergone training on a
comprehensive dataset, as detailed in the [accompanying paper](https://www.sciencedirect.com/science/article/pii/S2001037019304295),
achieving an auROC > 90 in both cross-validation and external test sets.
In this scenario, users simply upload the pre-trained model and initiate
predictions without the need for additional training. This approach offers
the advantage of speed and convenience since it bypasses the time-consuming
training phase. By employing a pre-trained model, users can swiftly obtain
predictions and insights, making it a time-efficient option for their
research needs.
It's important to note that for the second approach, a random search strategy
has been employed to meticulously tune all possible hyperparameters of the
pre-trained model. This tuning process ensures the acquisition of the
best-performing for the given training set. However, if you intend to alter
the training input, we strongly recommend that you exercise caution and
take the time to carefully fine-tune the hyperparameters using
[tfruns](https://tensorflow.rstudio.com/reference/tfruns/) to achieve
optimal results.
# First Approach
The `modelTraining` function included in this package allows users to update
the training dataset. It begins by converting protein sequences into amino
acid tokens, where tokens are mapped to positive integers. Next, it represents
each amino acid token using pre-trained co-occurrence embedding vectors
acquired from GloVe. Following this, it utilizes an embedding layer to convert
a sequence of amino acid token indices into dense vectors based on the GloVe
token vectors. Finally, it leverages a Siamese-like neural network architecture
for model training, employing a k-fold cross-validation strategy. Please ensure
that the newly imported training set adheres to the format of the sample
training set stored in the **inst/extdata/trainingSet** directory of
the DeProViR package.
The `modelTraining` function takes following parameters:
- `url_path` URL path to GloVe embedding. Defaults to
"https://nlp.stanford.edu/data".
See `\code{\link[DeProViR]{gloveImport}}`.
- `training_dir` Directory containing viral-host training set.
`See \code{\link[DeProViR]{loadTrainingSet}}`.
Defaults to `inst/extdata/training_Set`.
- `input_dim` Integer. Size of the vocabulary, i.e. amino acid
tokens. Defults to 20. See `\code{keras}`.
- `output_dim` Integer. Dimension of the dense embedding,
i.e., GloVe. Defaults to 100. See `\code{keras}`.
- `filters_layer1CNN` Integer, the dimensionality of the output space
(i.e. the number of output filters in the first convolution).
Defaults to 32. See `\code{keras}`.
- `kernel_size_layer1CNN` An integer or tuple/list of 2 integers,
specifying the height and width of the convolution window in the first
layer. Can be a single integer to specify the same value for all
spatial dimensions.Defaults to 16. See `\code{keras}`.
- `filters_layer2CNN` Integer, the dimensionality of the output space
(i.e. the number of output filters in the second convolution).
Defaults to 64. See `\code{keras}`.
- `kernel_size_layer2CNN` An integer or tuple/list of 2 integers,
specifying the height and width of the convolution window in the second layer.
Can be a single integer to specify the same value for all spatial dimensions.
Defaults to 7. See `\code{keras}`.
- `pool_size` Down samples the input representation by taking the
maximum value over a spatial window of size pool_size.
Defaults to 30. See `\code{keras}`.
- `layer_lstm` Number of units in the Bi-LSTM layer. Defaults to 64.
See `\code{keras}`.
- `units` Number of units in the MLP layer. Defaults to 8. See `\code{keras}`.
- `metrics` Vector of metric names to be evaluated by the model
during training and testing. Defaults to "AUC". See `\code{keras}`.
- `cv_fold` Number of partitions for cross-validation. Defaults to 10.
- `epochs` Number of epochs to train the model. Defaults to 100.
See `\code{keras}`.
- `batch_size` Number of samples per gradient update.Defults to 128.
See `\code{keras}`.
- `plots` PDF file containing plots of tge predicitve learning algorithms
achived via cross-validatiob. Defaults to TRUE.
See `\code{\link[DeProViR]{ModelPerformance_evalPlots}}`.
- `tpath` A character string indicating the path to the project
directory. If the directory is missing, PDF file containing performance
measures will be stored in the Temp directory.
See `\code{\link[DeProViR]{ModelPerformance_evalPlots}}`.
- `save_model_weights` If TRUE, it allows users to save the trained weights.
Defaults to TRUE. See `\code{keras}`.
- `filepath` A character string indicating the path to save the model weights
after training. Default to tempdir(). See `\code{keras}`.
To run `modelTraining`, we can use the following commands:
```{r, message=FALSE, warning=FALSE, eval=TRUE}
options(timeout=240)
library(tensorflow)
library(data.table)
library(DeProViR)
tensorflow::set_random_seed(101)
model_training <- modelTraining(
url_path = "https://nlp.stanford.edu/data",
training_dir = system.file("extdata", "training_Set",
package = "DeProViR"),
input_dim = 20,
output_dim = 100,
filters_layer1CNN = 32,
kernel_size_layer1CNN = 16,
filters_layer2CNN = 64,
kernel_size_layer2CNN = 7,
pool_size = 30,
layer_lstm = 64,
units = 8,
metrics = "AUC",
cv_fold = 2,
epochs = 5, # for the sake of this example
batch_size = 128,
plots = FALSE,
tpath = tempdir(),
save_model_weights = FALSE,
filepath = tempdir())
```
When the plots argument set to TRUE, the `modelTraining` function generates one
pdf file containing three figures as shown below indicating the performance of
the DL model using k-fold cross-validation.
# Second Approach In this context, users have the option to employ the `loadPreTrainedModel` function to load the finely-tuned pre-trained model for predictive purposes. ```{r message=FALSE, warning=FALSE} options(timeout=240) library(tensorflow) library(data.table) library(DeProViR) pre_trainedmodel <- loadPreTrainedModel() ``` # Viral-Host Interactions Prediction The models that have undergone training can subsequently be leveraged to generate predictions on unlabeled data, specifically on interactions that are yet to be identified. This can be achieved by executing the following commands: ```{r} #load the demo test set (unknown interactions) testing_set <- fread( system.file("extdata", "test_Set", "test_set_unknownInteraction.csv", package = "DeProViR")) scoredPPIs <- predInteractions( url_path = "https://nlp.stanford.edu/data", testing_set, trainedModel = pre_trainedmodel) scoredPPIs ``` ```{r warning=FALSE, message=FALSE, eval=TRUE} # or using the newly trained model predInteractions(url_path = "https://nlp.stanford.edu/data", testing_set, trainedModel = model_training) ``` # Session information ```{r, eval=TRUE} sessionInfo() ```