Tutorial for RNASeq analysis using pathwrap package

Eliza Dhungel

Installation

To install pathwrap package, start R(version “4.3”) and enter:

if (!require("BiocManager", quietly = TRUE)) {
    install.packages("BiocManager")
}
BiocManager::install("pathwrap")

Based on the species you want to analyse, you can install the bioconductor annotation and genome package. For example, for human species

library(pathwrap)
data(anntpkglist, package = "pathwrap")
#replace Homo sapiens with scientific name of mouse, Mus musculus for mouse data
genomepkg <- anntpkglist$genome[which(anntpkglist$species == "Homo sapiens")]
anntpkg <- anntpkglist$annotate[which(anntpkglist$species == "Homo sapiens")]
#BiocManager::install(genomepkg)
#BiocManager::install(anntpkg)

Documentation

To view this documentation, run following command in R.

browseVignettes("pathwrap")

Phenofile structure

Pathwrap requires an argument that denotes path to a phenofile that has information about the path of raw files and the class each sample belong to. An example of how phenofile can be created is as below.

# create a phenofile
phenofile <- tempfile("hellotmpphenofile.txt")
# create columns for phenofile, this is for SE data
# col.names should be SampleName, FileName and Class for SE data
library(stringr)
FileName <- list.files(
    file.path(system.file(
        package = "pathwrap"
    ), "extdata"),
    full.names = TRUE,
    pattern = "fastq.gz"
)
patternmy <- c(dirname(FileName[1]), "_sub.fastq.gz")
SampleName <- str_replace_all(
    pattern = paste0(dirname(FileName)[1], "/|_sub.fastq.gz"),
    string = list.files(
        file.path(
            system.file(package = "pathwrap"), "extdata"
        ),
        full.names = TRUE, pattern = "fastq.gz"
    ), ""
)
Class <- c("A", "B", "A", "B")
# write the columns in the phenofile
write.table(as.data.frame(cbind(SampleName, FileName, Class)),
    file = phenofile, sep = "\t", row.names = FALSE,
    col.names = TRUE, quote = FALSE
)

The phenofile should be tab delimited file with information about the path of raw files and class to which each sample belong to. The phenofile for paired end data should be of 4 columns and for single end data should be of three columns like follows:

For single end reads;

SampleName	FileName	Class
sample1	/Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/library/pathwrap/extdata/sample1_sub.fastq.gz	A
sample3	/Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/library/pathwrap/extdata/sample3_sub.fastq.gz	B
sample5	/Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/library/pathwrap/extdata/sample5_sub.fastq.gz	A
sample6	/Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/library/pathwrap/extdata/sample6_sub.fastq.gz	B

For paired end reads;

SampleName	FileName1	FileName2	Class
sample1	full_path_to_read_1_sample1	full_path_to_read_2_sample1	A
sample3	full_path_to_read_1_sample3	full_path_to_read_2_sample3	B
sample5	full_path_to_read_1_sample5	full_path_to_read_2_sample5	A
sample6	full_path_to_read_1_sample6	full_path_to_read_2_sample6	B

For paired experiment design; SampleName FileName Class sample1 ~/fullpathto/sample1_sub.fastq.gz A sample3 ~/fullpathto/sample3_sub.fastq.gz B sample1 ~/fullpathto/sample4_sub.fastq.gz A sample3 ~/fullpathto/sample2_sub.fastq.gz B

Demo run

To run you can create a results folder and give that as argument in the function If argument to outdir is not given, a result directory is created in the present working directory of R.

Also, when ref.dir = NA, it uses the reference annotation and genome package provided by bioconductor and installed in the system. To provide reference files the location to the directory where files are stored should be provided. NOTES: Please make sure that any path does not have trailing “/” at the end. For example: ref=”/Reference/mouse/” #Wrong ref=”/Reference/mouse” #Right

To run the wrapper,

# create directory to store results
Results <- tempdir()
csamp <- c(1,2)
cref <- c(3,4)
cdatapath <- file.path(system.file(package = "pathwrap"), "extdata", 
                         "example_cpd_data.tsv")
if (interactive()){pathwrap(
    ref.dir = NA, phenofile = phenofile,
    outdir = Results, entity = "Mus musculus", corenum = 16,
    compare = "as.group",  keep_tmp = TRUE,startover = TRUE, 
    diff.tool = "DESeq2", aligner = "Rhisat2", mode = "combined", 
    cdatapath = cdatapath, cref= cref, csamp = csamp, ccompare = "paired"
)}

Overview

Pathwrap is a wrapper function/package that allows users to analyse the RNASeq data from the raw files to the pathway visualization. It does quality control analysis of raw files, performs adapter and quality trimming, builds genome index and does alignemnet , counts genes perfoms differential gene expression analysis, does gene enrichment test using GAGE and visualize the enriched pathways using Pathview all using one wrapper function. It has the ability to continue the analysis if it is halted at any stage and generate quality pictures and generate comprehensive analysis of the data.

With one wrapper function, it runs all the steps listed below.

STEP 1 : Quality control

STEP 1a: Running fastqc

It runs fastqc analysis in R using fastqcr. If fastqc is not available in system to run by R, this function is capable of downloading the fastqc tools before running the quality check. The results are standard html files where the quality of each fastq files can be examined.

STEP 1b : Running fastp

After running fastqc, it runs fastp. The function takes name of the samples and for each sample does the quality and adapter trimming for Illumina and long read sequencing. It works for both PE and SE data. HTML files are generated for each fastq files that has information/figures of quality control before and after quality trimming.

STEP 2: Making txdb obj

The wrapper then makes TxDb object either from annnotation file or by loading from the annotation package. Make_txdbobj uses the makeTxDbFromGFF to make TxDb object from transcript annotations available in gtf file in ref.dir or if ref.dir is NA, it makes txdb object fromt the annotaion package

STEP 3 : Alignmnet and counting

After the txdb object is formed, the wrapper runs the Rhisat2 or Rbowtie for alignment on paired or single end mode depending on data. It saves the alignment object in RDS file which can be loaded in R for further analysis. If the reference index is not found in the reference directory, it creates reference index before running alignment. If the refernece genome is a package the reference index is created as R package. It generates the barplot of mapped and unmapped sequence reads.

STEP 4: Counting aligned sequences

After aligning the reads to reference genome, the wrapper generates the count of gene and store it in a table with genes in row and counts in columns. The table is stored as a RDS file that can be loaded into R for future analysis and if gene id is ensembl, the wrapper converts it to entrez to match with genes names for gene set analysis.

STEP 5a: Running differential gene analysis using DESeq2

Then the wrapper runs standard DESeq2 for differential gene expression analysis and plots volcano plots. The function run_deseq2 takes counts and the list indicating reference and samples and the directory where the results are stored and performs the deseq2 analysis. The output is result table with columns of genes and log2FoldChange from result of deseq2 analysis and a volcanoplot.

STEP 5b: Running differential gene analysis using edgeR

Then the wrapper runs standard edgeR for differential gene expression analysis and plots volcano plots. The function run_deseq2 takes counts and the list indicating reference and samples and the directory where the results are stored and performs the deseq2 analysis. The output is result table with columns of genes and log2FoldChange from result of deseq2 analysis and a volcanoplot.

STEP 6 : Running pathway analysis using GAGE

After the differential gene analysis the wrapper runs generally applicable gene set enrichment for pathway analysis, GAGE based upon the user supplied comparison method for the species specified. The biological process, cellular component and molecular function analysis for GO terms are done separately. Also, KEGG disease and KEGG signalling and metabolism pathways are analysed separately. Compound sets can also be analysed when path to compound data with KEGG compound id is provided. Combined analysis of gene and compound based enrichment is run when mode=“combined”.

STEP 7: Visualizing the pathway using Pathview

Finally the top enriched pathways with “q.val” < 0.1 are visualized using Pathview.

There is check at each step, so that if analysis fails at any stage due to any reason, result of previous analysis is saved and new run can be run starting from the halted step.

Acknowledgements

The wrapper would not have been successful without the help of Steven Blanchard, Kevin Lambirth, Cory Brouwer and Richard Allen White.

References

Funding

This project has been funded by NSF grant 1565030.

Session info

sessionInfo()
#> R Under development (unstable) (2024-03-18 r86148)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 22.04.4 LTS
#> 
#> Matrix products: default
#> BLAS:   /home/biocbuild/bbs-3.19-bioc/R/lib/libRblas.so 
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_GB              LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: America/New_York
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] stringr_1.5.1
#> 
#> loaded via a namespace (and not attached):
#>  [1] compiler_4.4.0  magrittr_2.0.3  cli_3.6.2       tools_4.4.0    
#>  [5] glue_1.7.0      vctrs_0.6.5     stringi_1.8.3   knitr_1.45     
#>  [9] xfun_0.43       lifecycle_1.0.4 rlang_1.1.3     evaluate_0.23