--- title: "Analyzing cell-free DNA methylation data with cfTools" author: - name: Ran Hu affiliation: - Department of Pathology and Laboratory Medicine, David Geffen School of Medicine, University of California at Los Angeles - Institute for Quantitative and Computational Biosciences, University of California at Los Angeles - Bioinformatics Interdepartmental Graduate Program, University of California at Los Angeles email: huran@ucla.edu - name: Mary Louisa Stackpole affiliation: - Department of Pathology and Laboratory Medicine, David Geffen School of Medicine, University of California at Los Angeles - name: Shuo Li affiliation: - Department of Pathology and Laboratory Medicine, David Geffen School of Medicine, University of California at Los Angeles - name: Xianghong Jasmine Zhou affiliation: - Department of Pathology and Laboratory Medicine, David Geffen School of Medicine, University of California at Los Angeles - Institute for Quantitative and Computational Biosciences, University of California at Los Angeles - name: Wenyuan Li affiliation: - Department of Pathology and Laboratory Medicine, David Geffen School of Medicine, University of California at Los Angeles package: cfTools output: BiocStyle::html_document: toc: true toc_float: true theme: flatly highlight: pygments BiocStyle::pdf_document: default abstract: > Cell-free DNA (cfDNA) in blood has emerged as an ideal surrogate for tumor biopsy. It can be obtained noninvasively, and provides a comprehensive landscape of the heterogeneous genetic and epigenetic alterations in tumors. The `r Biocpkg("cfTools")` package provides methods for cfDNA methylation data analysis to facilitate cfDNA-based cancer studies, including (1) cancer detection: sensitively detect tumor-derived cfDNA to determine whether plasma samples are from cancer patients or normal individuals; (2) tissue deconvolution: infer the tissue composition of plasma samples. vignette: > %\VignetteIndexEntry{cfTools-vignette} %\VignetteEncoding{UTF-8} %\VignetteEngine{knitr::rmarkdown} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` # Introduction Given the methylation sequencing data of a cell-free DNA (cfDNA) sample, for each cancer marker or tissue marker, we deconvolve the tumor-derived or tissue-specific reads from all reads falling in the marker region. Our read-based deconvolution algorithm exploits the pervasiveness of DNA methylation for signal enhancement, therefore can sensitively identify a trace amount of tumor-specific or tissue-specific cfDNA in plasma. Specifically, `r Biocpkg("cfTools")` can deconvolve different sources of cfDNA fragments (or reads) in two contexts: 1. Cancer detection [1]: separate cfDNA fragments into tumor-derived fragments and background normal fragments (2 classes), and estimate the tumor-derived cfDNA fraction $\theta$ ($0\leq \theta < 1$). 2. Tissue deconvolution [2,3]: separate cfDNA fragments from different tissues (> 2 classes), and estimate the cfDNA fraction $\theta_t$ ($0\leq \theta_t < 1$) of different tissue types (including an unknown type) $t$ for a plasma cfDNA sample. These functions can serve as foundations for more advanced cfDNA-based studies, including cancer diagnosis and disease monitoring. # Installation `r Biocpkg("cfTools")` is an `R` package available via the [Bioconductor](http://bioconductor.org) repository for packages. You can install the release version by using the following commands in your `R` session: ```{r "install", eval = FALSE} if (!requireNamespace("BiocManager", quietly = TRUE)) { install.packages("BiocManager") } BiocManager::install("cfTools") ``` Alternatively, you can install the development version from [GitHub](https://github.com/) : ```{r 'install_dev', eval = FALSE} BiocManager::install("jasminezhoulab/cfTools") ``` # Input data preparation The two main input files for `CancerDetector()` and `cfDeconvolve()` are * **Input 1**: fragment-level methylation states (methState), which are represented by a sequence of binary values (0 represents unmethylated CpG and 1 represents methylated CpG on the same fragment); * **Input 2**: methylation pattern (paired shape parameters of beta distributions) of markers. `cfSort()` mainly takes **Input 1** as the only input file. Section 3.1, 3.2, 3.3 provide an example for generating **Input 1**. We require users to provide pre-processed paired-end bisulfite sequencing reads (i.e., aligned to the reference genome). For each cfDNA sample, users need to prepare (1) the standard (sorted) BED file of the aligned reads and (2) the methylation information that *bismark* extracts from the aligned reads as input data files. Specifically, * Section 3.1 `MergePEReads()` * Input: a standard (sorted) BED file of paired-end sequencing reads * Output: fragment-level information of cfDNA reads * Section 3.2 `MergeCpGs()` * Input: a `CpG_OT*` file and a `CpG_OB*` file generated by *bismark methylation extractor* * Output: methylation information of all CpGs on the same fragment * Section 3.3 `GenerateFragMeth()` * Input: the output lists of `MergePEReads()` and `MergeCpGs()` * Output: fragment-level information about methylation states Section 3.4 provides an example for generating **Input 2**. Specifically, * Section 3.4 `GenerateMarkerParam()` * Input: a list of methylation levels (e.g., beta values) for markers, a vector of sample types (e.g., tumor or normal, tissue types) corresponding to the rows of the list, a vector of marker names corresponding to the columns of the list * Output: a list containing the paired shape parameters of beta distributions for markers Example input data files are included within the package: ```{r demo} library(cfTools) library(utils) demo.dir <- system.file("data", package="cfTools") ``` ## Merge paired-end sequencing reads to fragment-level Function `MergePEReads()` generates fragment-level information for paired-end sequencing reads. The main input file is a standard (sorted) BED file (e.g. output of *bedtools bamtobed*) of paired-end sequencing reads containing columns of chromosome name, chromosome start, chromosome end, sequence ID, mapping quality score, and strand. ```{r} PEReads <- file.path(demo.dir, "demo.sorted.bed.txt.gz") head(read.table(PEReads, header=FALSE), 2) ``` The output is a list in BED file format and/or written to an output BED file, where each line contains the information of a cfDNA fragment. ```{r} fragInfo <- MergePEReads(PEReads) head(fragInfo, 2) ``` ## Merge methylation states of CpGs on two strands to fragment-level Function `MergeCpGs()` generates fragment-level methylation states of CpGs. The main inputs of it are two output files of *bismark methylation extractor*, which is a program performing methylation calling on bisulfite treated sequencing reads. The `CpG_OT*` file contains methylation information for CpGs on the original top strand (OT); the `CpG_OB*` file contains methylation information for CpGs on the original bottom strand (OB). Both files contain columns of sequence ID, methylation state, chromosome name, chromosome start, methylation call. ```{r} CpG_OT <- file.path(demo.dir, "CpG_OT_demo.txt.gz") CpG_OB <- file.path(demo.dir, "CpG_OB_demo.txt.gz") head(read.table(CpG_OT, header=FALSE), 2) head(read.table(CpG_OB, header=FALSE), 2) ``` The output is a list in BED file format and/or written to an output BED file, where each line contains methylation states of all CpGs on the same fragment. Column *methState* is a sequence of binary values indicating the methylation states of all CpGs on the same fragment (0 represents unmethylated CpG and 1 represents methylated CpG). ```{r} methInfo <- MergeCpGs(CpG_OT, CpG_OB) head(methInfo, 2) ``` ## Generate fragment-level information about methylation states Function `GenerateFragMeth()` combines the output lists of `MergePEReads()` and `MergeCpGs()` into one list, which contains both the fragment information and the methylation states of all CpGs on each fragment. ```{r} fragMeth <- GenerateFragMeth(fragInfo, methInfo) head(fragMeth, 2) ``` ## Generate the methylation pattern of markers Function `GenerateMarkerParam()` calculates paired shape parameters of beta distributions for each marker. There are three main inputs to this function: (1) a list of methylation levels (e.g., beta values), where each row is a sample and each column is a marker; (2) a vector of sample types (e.g., tumor or normal, tissue types) corresponding to the rows of the list; (3) a vector of marker names corresponding to the columns of the list. ```{r} methLevel <- read.table(file.path(demo.dir, "beta_matrix.txt.gz"), row.names=1, header = TRUE) sampleTypes <- read.table(file.path(demo.dir, "sample_type.txt.gz"), row.names=1, header = TRUE)$sampleType markerNames <- read.table(file.path(demo.dir, "marker_index.txt.gz"), row.names=1, header = TRUE)$markerIndex print(methLevel) print(sampleTypes) print(markerNames) ``` The output is a list containing the paired shape parameters of beta distributions for markers, which are delimited by `:`. Users can save this list into a file with column names, no row names, and columns are delimited by TAB for later use. ```{r} markerParam <- GenerateMarkerParam(methLevel, sampleTypes, markerNames) print(markerParam) ``` Note that parameter value `NA:NA` or `0:0` may cause errors in the following analyses. Remove these values before using this file. # Fragments intersecting with marker regions To make the computation more efficient, users may only keep the fragments that overlap with the genomic regions of markers. Here, we provide an example of using `R` package `r Biocpkg("GenomicRanges")` to perform the intersection. First, transform the two BED files into GRanges classes. ```{r message=FALSE} library(GenomicRanges) # a BED file of fragment-level methylation information frag_bed <- read.csv(file.path(demo.dir, "demo.fragment_level.meth.bed.txt.gz"), header=TRUE, sep="\t") frag_meth.gr <- GRanges(seqnames=frag_bed$chr, ranges=IRanges(frag_bed$start, frag_bed$end), strand=frag_bed$strand, methState=as.character(frag_bed$methState)) # a BED file of genomic regions of markers markers_bed <- read.csv(file.path(demo.dir, "markers.bed.txt.gz"), header=TRUE, sep="\t") markers.gr <- GRanges(seqnames=markers_bed$chr, ranges=IRanges(markers_bed$start, markers_bed$end), markerName=markers_bed$markerName) head(frag_meth.gr, 2) head(markers.gr, 2) ``` Then, overlap two GRanges classes and get the fragment-level methylation states intersecting with the markers. ```{r} ranges <- subsetByOverlaps(frag_meth.gr, markers.gr, ignore.strand=TRUE) hits <- findOverlaps(frag_meth.gr, markers.gr,ignore.strand=TRUE) idx <- subjectHits(hits) values <- DataFrame(markerName=markers.gr$markerName[idx]) mcols(ranges) <- c(mcols(ranges), values) marker.methState <- as.data.frame(cbind(ranges$markerName, ranges$methState)) colnames(marker.methState) <- c("markerName", "methState") head(marker.methState, 4) ``` # Cancer detection with `CancerDetector()` Function `CancerDetector()` separates cfDNA into tumor-derived fragments and background normal fragments and estimates the tumor burden. The main inputs are two files: (1) the fragment-level methylation states of reads (column `methState`) that mapped to the cancer-specific markers; (2) paired shape parameters of beta distributions for cancer-specific markers. All columns are delimited by TAB, and the first line is the column names. In addition, users can tune the parameter `lambda` to adjust the relative level of tumor burden. ```{r} fragMethFile <- file.path(demo.dir, "CancerDetector.reads.txt.gz") markerParamFile <- file.path(demo.dir, "CancerDetector.markers.txt.gz") head(read.csv(fragMethFile, sep = "\t", colClasses = "character"), 4) head(read.csv(markerParamFile, sep = "\t"), 4) ``` The output is the estimated tumor burden $\theta$ and the normal cfDNA fraction $1-\theta$. ```{r} CancerDetector(fragMethFile, markerParamFile, lambda=0.5, id="test") ``` # Tissue deconvolution with `cfDeconvolve()` Function `cfDeconvolve()` estimates fractions of cfDNA fragments from different tissues (> 2 classes). The main two input files are similar to function `CancerDetector()`: (1) the fragment-level methylation states of reads (column `methState`) that mapped to the tissue-specific markers; (2) paired shape parameters of beta distributions for tissue-specific markers. All columns are delimited by TAB, and there is a header line of the column names. ```{r} fragMethFile2 <- file.path(demo.dir, "cfDeconvolve.reads.txt.gz") markerParamFile2 <- file.path(demo.dir, "cfDeconvolve.markers.txt.gz") head(read.csv(fragMethFile2, header=TRUE, sep="\t", colClasses = "character"), 4) head(read.csv(markerParamFile2, header=TRUE, sep="\t", colClasses = "character"), 4) ``` Other input parameters are: - **Number of tissue types**: a positive integer number *k*. Reads will be classified into *k* different tissue types. - **Read-based tissue deconvolution EM algorithm type**: options are `em.global.unknown` (default), `em.global.known`, `em.local.unknown`, `em.local.known`. - **Likelihood ratio threshold**: a positive float number. We suggest this number is 2 (default). All reads with likelihood ratio < cutoff (default: -1.0, meanin no reads filtering is used) will not be used. Likelihood ratio is the max(all tissues' likelihoods)/min(all tissues' likelihoods). - **EM algorithm maximum iteration number**: a positive integer number. Default is 100. - **Random seed**: a random seed that initialize the EM algorithm. Default is 0. - **Sample ID**: the unique sample ID. For example: ```{r} numTissues <- 7 emAlgorithmType <- "em.global.unknown" likelihoodRatioThreshold <- 2 emMaxIterations <- 100 randomSeed <- 0 id <- "test" ``` The output is a list containing the cfDNA fractions of different tissue types and an unknown class. ```{r, message=TRUE} tissueFraction <- cfDeconvolve(fragMethFile2, markerParamFile2, numTissues, emAlgorithmType, likelihoodRatioThreshold, emMaxIterations, randomSeed, id) tissueFraction ``` # Tissue deconvolution with `cfSort()` Function `cfSort()` estimates fractions of cfDNA fragments derived from 29 major human tissues. It is the first supervised tissue deconvolution approach with deep learning models. The main input file is similar to function `CancerDetector()` and `cfDeconvolve()`: the fragment-level methylation states of reads (column `methState`) that mapped to the tissue-specific markers. The first column is the marker name of cfSort markers. ```{r} fragMethInfo <- file.path(demo.dir, "cfsort_reads.txt.gz") head(read.csv(fragMethInfo, header=TRUE, sep="\t", colClasses = "character"), 4) ``` The output is a list containing the cfDNA fractions of 29 tissue types. ```{r, message=FALSE} tissueFraction2 <- cfSort(fragMethInfo, id="demo") tissueFraction2 ``` # Reference [1] Li W, Li Q, Kang S, Same M, Zhou Y, Sun C, Liu CC, Matsuoka L, Sher L, Wong WH, Alber F, Zhou XJ. CancerDetector: ultrasensitive and non-invasive cancer detection at the resolution of individual reads using cell-free DNA methylation sequencing data. *Nucleic Acids Res*. 2018 Sep 6;46(15):e89. doi: 10.1093/nar/gky423. PMID: 29897492; PMCID: PMC6125664. [2] Del Vecchio G, Li Q, Li W, Thamotharan S, Tosevska A, Morselli M, Sung K, Janzen C, Zhou X, Pellegrini M, Devaskar SU. Cell-free DNA methylation and transcriptomic signature prediction of pregnancies with adverse outcomes. *Epigenetics*. 2021 Jun;16(6):642-661. doi: 10.1080/15592294.2020.1816774. PMID: 33045922; PMCID: PMC8143248. [3] Li S, Zeng W, Ni X, Liu Q, Li W, Stackpole ML, Zhou Y, Gower A, Krysan K, Ahuja P, Lu DS, Raman SS, Hsu W, Aberle DR, Magyar CE, French SW, Han SB, Garon EB, Agopian VG, Wong WH, Dubinett SM, Zhou XJ. Comprehensive tissue deconvolution of cell-free DNA by deep learning for disease diagnosis and monitoring. *Proc Natl Acad Sci U S A*. 2023 Jul 11;120(28):e2305236120. doi: 10.1073/pnas.2305236120. Epub 2023 Jul 3. PMID: 37399400. # Session info {-} ```{r sessionInfo, echo=FALSE} sessionInfo() ```