To install pathwrap package, start R(version “4.3”) and enter:
if (!require("BiocManager", quietly = TRUE)) {
install.packages("BiocManager")
}
BiocManager::install("pathwrap")
Based on the species you want to analyse, you can install the bioconductor annotation and genome package. For example, for human species
library(pathwrap)
data(anntpkglist, package = "pathwrap")
#replace Homo sapiens with scientific name of mouse, Mus musculus for mouse data
genomepkg <- anntpkglist$genome[which(anntpkglist$species == "Homo sapiens")]
anntpkg <- anntpkglist$annotate[which(anntpkglist$species == "Homo sapiens")]
#BiocManager::install(genomepkg)
#BiocManager::install(anntpkg)
To view this documentation, run following command in R.
browseVignettes("pathwrap")
Pathwrap requires an argument that denotes path to a phenofile that has information about the path of raw files and the class each sample belong to. An example of how phenofile can be created is as below.
# create a phenofile
phenofile <- tempfile("hellotmpphenofile.txt")
# create columns for phenofile, this is for SE data
# col.names should be SampleName, FileName and Class for SE data
library(stringr)
FileName <- list.files(
file.path(system.file(
package = "pathwrap"
), "extdata"),
full.names = TRUE,
pattern = "fastq.gz"
)
patternmy <- c(dirname(FileName[1]), "_sub.fastq.gz")
SampleName <- str_replace_all(
pattern = paste0(dirname(FileName)[1], "/|_sub.fastq.gz"),
string = list.files(
file.path(
system.file(package = "pathwrap"), "extdata"
),
full.names = TRUE, pattern = "fastq.gz"
), ""
)
Class <- c("A", "B", "A", "B")
# write the columns in the phenofile
write.table(as.data.frame(cbind(SampleName, FileName, Class)),
file = phenofile, sep = "\t", row.names = FALSE,
col.names = TRUE, quote = FALSE
)
The phenofile should be tab delimited file with information about the path of raw files and class to which each sample belong to. The phenofile for paired end data should be of 4 columns and for single end data should be of three columns like follows:
For single end reads;
SampleName FileName Class sample1 /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/library/pathwrap/extdata/sample1_sub.fastq.gz A sample3 /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/library/pathwrap/extdata/sample3_sub.fastq.gz B sample5 /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/library/pathwrap/extdata/sample5_sub.fastq.gz A sample6 /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/library/pathwrap/extdata/sample6_sub.fastq.gz B
For paired end reads;
SampleName FileName1 FileName2 Class sample1 full_path_to_read_1_sample1 full_path_to_read_2_sample1 A sample3 full_path_to_read_1_sample3 full_path_to_read_2_sample3 B sample5 full_path_to_read_1_sample5 full_path_to_read_2_sample5 A sample6 full_path_to_read_1_sample6 full_path_to_read_2_sample6 B
For paired experiment design; SampleName FileName Class sample1 ~/fullpathto/sample1_sub.fastq.gz A sample3 ~/fullpathto/sample3_sub.fastq.gz B sample1 ~/fullpathto/sample4_sub.fastq.gz A sample3 ~/fullpathto/sample2_sub.fastq.gz B
To run you can create a results folder and give that as argument in the function If argument to outdir is not given, a result directory is created in the present working directory of R.
Also, when ref.dir = NA, it uses the reference annotation and genome package
provided by bioconductor and installed in the system. To provide reference files
the location to the directory where files are stored should be provided.
NOTES:
Please make sure that any path does not have trailing “/” at the end.
For example:
ref=”/Reference/mouse/” #Wrong
ref=”/Reference/mouse” #Right
To run the wrapper,
# create directory to store results
Results <- tempdir()
csamp <- c(1,2)
cref <- c(3,4)
cdatapath <- file.path(system.file(package = "pathwrap"), "extdata",
"example_cpd_data.tsv")
if (interactive()){pathwrap(
ref.dir = NA, phenofile = phenofile,
outdir = Results, entity = "Mus musculus", corenum = 16,
compare = "as.group", keep_tmp = TRUE,startover = TRUE,
diff.tool = "DESeq2", aligner = "Rhisat2", mode = "combined",
cdatapath = cdatapath, cref= cref, csamp = csamp, ccompare = "paired"
)}
Pathwrap is a wrapper function/package that allows users to analyse the RNASeq data from the raw files to the pathway visualization. It does quality control analysis of raw files, performs adapter and quality trimming, builds genome index and does alignemnet , counts genes perfoms differential gene expression analysis, does gene enrichment test using GAGE and visualize the enriched pathways using Pathview all using one wrapper function. It has the ability to continue the analysis if it is halted at any stage and generate quality pictures and generate comprehensive analysis of the data.
With one wrapper function, it runs all the steps listed below.
STEP 1 : Quality control
STEP 1a: Running fastqc
It runs fastqc analysis in R using fastqcr. If fastqc is not available in system to run by R, this function is capable of downloading the fastqc tools before running the quality check. The results are standard html files where the quality of each fastq files can be examined.
STEP 1b : Running fastp
After running fastqc, it runs fastp. The function takes name of the samples and for each sample does the quality and adapter trimming for Illumina and long read sequencing. It works for both PE and SE data. HTML files are generated for each fastq files that has information/figures of quality control before and after quality trimming.
STEP 2: Making txdb obj
The wrapper then makes TxDb object either from annnotation file or by loading from the annotation package. Make_txdbobj uses the makeTxDbFromGFF to make TxDb object from transcript annotations available in gtf file in ref.dir or if ref.dir is NA, it makes txdb object fromt the annotaion package
STEP 3 : Alignmnet and counting
After the txdb object is formed, the wrapper runs the Rhisat2 or Rbowtie for alignment on paired or single end mode depending on data. It saves the alignment object in RDS file which can be loaded in R for further analysis. If the reference index is not found in the reference directory, it creates reference index before running alignment. If the refernece genome is a package the reference index is created as R package. It generates the barplot of mapped and unmapped sequence reads.
STEP 4: Counting aligned sequences
After aligning the reads to reference genome, the wrapper generates the count of gene and store it in a table with genes in row and counts in columns. The table is stored as a RDS file that can be loaded into R for future analysis and if gene id is ensembl, the wrapper converts it to entrez to match with genes names for gene set analysis.
STEP 5a: Running differential gene analysis using DESeq2
Then the wrapper runs standard DESeq2 for differential gene expression analysis and plots volcano plots. The function run_deseq2 takes counts and the list indicating reference and samples and the directory where the results are stored and performs the deseq2 analysis. The output is result table with columns of genes and log2FoldChange from result of deseq2 analysis and a volcanoplot.
STEP 5b: Running differential gene analysis using edgeR
Then the wrapper runs standard edgeR for differential gene expression analysis and plots volcano plots. The function run_deseq2 takes counts and the list indicating reference and samples and the directory where the results are stored and performs the deseq2 analysis. The output is result table with columns of genes and log2FoldChange from result of deseq2 analysis and a volcanoplot.
STEP 6 : Running pathway analysis using GAGE
After the differential gene analysis the wrapper runs generally applicable gene set enrichment for pathway analysis, GAGE based upon the user supplied comparison method for the species specified. The biological process, cellular component and molecular function analysis for GO terms are done separately. Also, KEGG disease and KEGG signalling and metabolism pathways are analysed separately. Compound sets can also be analysed when path to compound data with KEGG compound id is provided. Combined analysis of gene and compound based enrichment is run when mode=“combined”.
STEP 7: Visualizing the pathway using Pathview
Finally the top enriched pathways with “q.val” < 0.1 are visualized using Pathview.
There is check at each step, so that if analysis fails at any stage due to any reason, result of previous analysis is saved and new run can be run starting from the halted step.
The wrapper would not have been successful without the help of Steven Blanchard, Kevin Lambirth, Cory Brouwer and Richard Allen White.
Luo W, Brouwer C. Pathview: an R/Biocondutor package for pathway-based data integration and visualization. Bioinformatics, 2013, 29(14):1830-1831, doi:10.1093/bioinformatics/btt285
Luo, W., Friedman, M.S., Shedden, K. et al. GAGE: generally applicable gene set enrichment for pathway analysis. BMC Bioinformatics 10, 161 (2009). https://doi.org/10.1186/1471-2105-10-161
Luo W, Pant G, Bhavnasi YK, Blanchard SG Jr, Brouwer C. Pathview Web: user friendly pathway visualization and data integration. Nucleic Acids Res. 2017 Jul 3;45(W1):W501-W508. doi: 10.1093/nar/gkx372. PMID: 28482075; PMCID: PMC5570256
This project has been funded by NSF grant 1565030.
sessionInfo()
#> R Under development (unstable) (2024-03-18 r86148)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 22.04.4 LTS
#>
#> Matrix products: default
#> BLAS: /home/biocbuild/bbs-3.19-bioc/R/lib/libRblas.so
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_GB LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: America/New_York
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] stringr_1.5.1
#>
#> loaded via a namespace (and not attached):
#> [1] compiler_4.4.0 magrittr_2.0.3 cli_3.6.2 tools_4.4.0
#> [5] glue_1.7.0 vctrs_0.6.5 stringi_1.8.3 knitr_1.45
#> [9] xfun_0.43 lifecycle_1.0.4 rlang_1.1.3 evaluate_0.23