The software package ZygosityPredictor allows to predict how many copies of a gene are affected by mutations, in particular small variants (single nucleotide variants, SNVs, and small insertions and deletions, Indels). In addition to the basic calculations of the affected copy number of a variant, ZygosityPredictor can phase multiple variants in a gene and ultimately make a prediction if and how many wild-type copies of the gene are left. This information proves to be of particular use in the context of translational medicine. For example, in cancer genomes, ZygosityPredictor can address whether unmutated copies of tumor-suppressor genes are present. ZygosityPredictor was developed to handle somatic and germline small variants. In addition to the small variant context, it can assess larger deletions, which may cause losses of exons or whole genes.
The following code can be used to install ZygosityPredictor. The installation needs to be done once.
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("ZygosityPredictor")
To demonstrate the use of ZygosityPredictor, NGS data from the Seq-C2 project were used [1]. In the following chunk, all required datalayer of the WGS_LL_T_1 sample are loaded. The variants are loaded as GRanges objects, one for somatic copy number alterations (GR_SCNA), one for germline- and one for somatic small variants (GR_GERM_SMALL_VARS and GR_SOM_SMALL_VARS). The input formats will be discussed in more detail in section 4.
library(ZygosityPredictor)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(stringr)
library(GenomicRanges)
## Loading required package: stats4
## Loading required package: BiocGenerics
##
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:dplyr':
##
## combine, intersect, setdiff, union
## The following objects are masked from 'package:stats':
##
## IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
##
## Filter, Find, Map, Position, Reduce, anyDuplicated, aperm, append,
## as.data.frame, basename, cbind, colnames, dirname, do.call,
## duplicated, eval, evalq, get, grep, grepl, intersect, is.unsorted,
## lapply, mapply, match, mget, order, paste, pmax, pmax.int, pmin,
## pmin.int, rank, rbind, rownames, sapply, setdiff, sort, table,
## tapply, union, unique, unsplit, which.max, which.min
## Loading required package: S4Vectors
##
## Attaching package: 'S4Vectors'
## The following objects are masked from 'package:dplyr':
##
## first, rename
## The following object is masked from 'package:utils':
##
## findMatches
## The following objects are masked from 'package:base':
##
## I, expand.grid, unname
## Loading required package: IRanges
##
## Attaching package: 'IRanges'
## The following objects are masked from 'package:dplyr':
##
## collapse, desc, slice
## Loading required package: GenomeInfoDb
# file to sequence alignment
FILE_BAM <- system.file("extdata", "ZP_example.bam",
package = "ZygosityPredictor")
# meta information of the sample
PURITY <- 0.98
PLOIDY <- 1.57
SEX <- "female"
# variants
data("GR_SCNA")
data("GR_GERM_SMALL_VARS")
data("GR_SOM_SMALL_VARS")
# used gene model
data("GR_GENE_MODEL")
Two functions are provided to calculate how many copies are affected by single small variants, based on two formulas, one for germline variants and one for somatic variants.
To calculate the affected copies for a germline variant by using
aff_germ_copies()
, the following inputs are required:
the output is a numeric value that indicates the affected copies.
## as an example we take the first variant of our prepared input data and
## extract the required information from different input data layer
## the allele frequency and the chromosome can be taken from the GRanges object
AF = elementMetadata(GR_GERM_SMALL_VARS[1])[["af"]]
CHR = seqnames(GR_GERM_SMALL_VARS[1]) %>%
as.character()
## the total copy number (tcn) can be extracted from the CNV object by selecting
## the CNV from the position of the variant
TCN = elementMetadata(
subsetByOverlaps(GR_SCNA, GR_GERM_SMALL_VARS[1])
)[["tcn"]]
## purity and sex can be taken from the global variables of the sample
## with this function call the affected copies are calculated for the variant
aff_germ_copies(af=AF,
tcn=TCN,
purity=PURITY,
chr=CHR,
sex=SEX)
## [1] 1.50733
To calculate how many copies are affected by a somatic variant by
aff_som_copies()
, the same inputs are required, but a different formula
is evaluated:
## the function for somatic variants works the same way as the germline function
AF = elementMetadata(GR_SOM_SMALL_VARS[1])[["af"]]
CHR = seqnames(GR_SOM_SMALL_VARS[1]) %>%
as.character()
TCN = elementMetadata(
subsetByOverlaps(GR_SCNA, GR_SOM_SMALL_VARS[1])
)[["tcn"]]
aff_som_copies(af=AF,
chr=CHR,
tcn=TCN,
purity=PURITY,
sex=SEX)
## [1] 1.614292
In order to apply the previously mentioned functions to a whole set of variants and calculate the affected copies, the following code can be used.
## as an example we calculate the affected copies for the somatic variants:
GR_SOM_SMALL_VARS %>%
## cnv information for every variant is required.. merge with cnv object
IRanges::mergeByOverlaps(GR_SCNA) %>%
as_tibble() %>%
## select relevant columns
select(chr=1, pos=2, gene, af, tcn) %>%
mutate_at(.vars=c("tcn", "af"), .funs=as.numeric) %>%
rowwise() %>%
mutate(
aff_copies = aff_som_copies(chr, af, tcn, PURITY, SEX),
wt_copies = tcn-aff_copies
)
## # A tibble: 6 × 7
## # Rowwise:
## chr pos gene af tcn aff_copies wt_copies
## <fct> <int> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 chr6 4734801 CDYL 0.327 4.90 1.61 3.29
## 2 chr6 4734837 CDYL 0.274 4.90 1.35 3.55
## 3 chr8 143805314 SCRIB 0.0608 2.92 0.180 2.74
## 4 chr8 143805342 SCRIB 0.0515 2.92 0.152 2.77
## 5 chr17 7675088 TP53 0.474 1.49 0.724 0.763
## 6 chr17 41771694 JUP 0.857 1.06 0.947 0.117
In this section, we will use the WGS_LL_T_1 dataset from the Seq-C2 project as an example to investigate whether mutations in the following genes result in total absence of wildtype copies. The genes which were selected as an example for the analysis are shown below. The example data set was reduced to these genes.
Some inputs are optional, while others are compulsory. The latter are labeled with “**”. Of note, ZygosityPredictor is applied downstream of variant calling, therefore the variant calls, including information on identified somatic copy number aberrations (sCNAs), have to be provided. The inputs can be divided into five classes:
The prediction of zygosity for a set of genes in a sample can be assessed by
the predict_zygosity()
function.
Important note: The runtime of the analysis depends strongly on the number of genes to be assessed and on the number of input variants. It is therefore recommended to reduce the number of genes to the necessary ones. Also, depending on the research question to be addressed, the variants should be filtered to the most relevant ones, not only because of runtime considerations, but also to sharpen the final result. A large number of mutations in a gene, some of which are of little biological relevance or even SNPs, will inevitably reduce the validity of the results.
full_prediction = predict_zygosity(
purity = PURITY,
ploidy = PLOIDY,
sex = SEX,
somCna = GR_SCNA,
somSmallVars = GR_SOM_SMALL_VARS,
germSmallVars = GR_GERM_SMALL_VARS,
geneModel = GR_GENE_MODEL,
bamDna = FILE_BAM,
showReadDetail = TRUE
)
## no RNA file provided: Analysis will be done without RNA reads
Of note, the results displayed here were chosen to explain and exemplify the functionality of the tool; biological and medical impact of the specific variants had not been a selection criterion. The result which is returned by the function consists of a list of three (or up to six) tibbles:
The first result of the function is the evaluation per variant. In this step
all information required for subsequent steps is annotated and the affected
copies per variant are calculated. For every variant, the function checks
whether it already affects all copies of the gene. The format of the output is
a tibble; the number of rows corresponds to the total
number of input variants. The tool annotates a few self-explanatory columns
such as the origin of the respective variant (germline/somatic) or the class
(snv/ins/del). It also appends information from the sCNA results: the total
copy number at the position of the variant and the information if a loss of
heterozygosity is present (cna_type). Also, an ID is assigned to every small
variant. Then, the genes are numbered consecutively in order to unambiguously
assign variants to genes in the following analysis. The most important results
of this step are the calculation of the affected and wildtype copies, as well
as, depending on the data, an initial check of whether a variant already affects
all copies.
Of note, there can be situations in which left wildtype copies are below 0.5,
but still this information is not sufficient to predict “all copies affected”
without doubt. Depending on the origin of the variant, further criteria must
be met (e.g., LOH). The procedure for this first check is shown in the pre_info
column.
# here the new columns are selected and viewed
full_prediction$eval_per_variant %>%
# these steps are just to have a better overview
# round numeric columns
mutate_at(.vars=c("af","tcn", "aff_cp", "wt_cp"),
.funs = round, 2) %>%
# to get a better overview, the columns which are already in the inut are
# removed
select(-chr, -pos, -alt, -ref, -af)
## # A tibble: 19 × 10
## gene mut_id origin class tcn tcn_assumed cna_type aff_cp wt_cp pre_info
## <chr> <chr> <chr> <chr> <dbl> <lgl> <chr> <dbl> <dbl> <chr>
## 1 TRANK1 m1 germline snv 1.49 FALSE LOH 1.5 -0.01 germlin…
## 2 BRCA1 m1 germline snv 1.06 FALSE HZ 1.07 -0.01 germlin…
## 3 ELP2 m1 germline del 2 FALSE HZ 1.07 0.93 germlin…
## 4 ELP2 m2 germline snv 2 FALSE HZ 0.8 1.2 germlin…
## 5 ELP2 m3 germline snv 2 FALSE HZ 0.8 1.2 germlin…
## 6 CDYL m1 somatic snv 4.9 FALSE HZ 1.61 3.29 somatic…
## 7 CDYL m2 somatic snv 4.9 FALSE HZ 1.35 3.55 somatic…
## 8 SCRIB m1 somatic snv 2.92 FALSE LOH 0.18 2.74 somatic…
## 9 SCRIB m2 somatic snv 2.92 FALSE LOH 0.15 2.77 somatic…
## 10 TP53 m1 somatic snv 1.49 FALSE LOH 0.72 0.76 somatic…
## 11 JUP m1 somatic del 1.06 FALSE LOH 0.95 0.12 somatic…
## 12 TRIM3 m1 somatic homd… 0.03 FALSE homdel 1.54 0.03 homdel …
## 13 TRIM3 m2 somatic inco… 1.06 FALSE incompl… 0.51 1.06 incompl…
## 14 TRIM3 m3 somatic inco… 0.03 FALSE incompl… 1.54 0.03 incompl…
## 15 TP53 m2 somatic inco… 1.49 FALSE incompl… 0.08 1.49 incompl…
## 16 JUP m2 somatic inco… 1.06 FALSE incompl… 0.51 1.06 incompl…
## 17 BRCA1 m2 somatic inco… 1.06 FALSE incompl… 0.51 1.06 incompl…
## 18 BRCA1 m3 somatic inco… 1.06 FALSE incompl… 0.51 1.06 incompl…
## 19 TRANK1 m2 somatic inco… 1.49 FALSE incompl… 0.08 1.49 incompl…
For example, mutation m1 in TRANK1 (first line) fulfills all criteria that are sufficient for a germline variant to say with a high degree of certainty that all copies are affected, which can be seen in the pre_info column. The same applies to the somatic JUP variant. In the case of the ELP2, CDYL and SCRIB genes, there are two or even three variants, neither of which leads to complete loss of the wildtype gene, which is why the next step is to check whether they together affect all copies. Homzygous deletions (homdels) always lead automatically to “all copies affected”, whereas incomplete deletions do not affect all copies.
The aggregation at gene level is one of the main features of ZygosityPredictor. A prediction of how many copies are affected by mutations is provided for every gene. The final prediction is stored in tibble format with three columns: gene, status - either “all copies affected” or “wt copies left” - and an info column that explains how the prediction was made for that gene.
full_prediction$eval_per_gene
## # A tibble: 8 × 3
## gene status info
## <chr> <chr> <chr>
## 1 TRANK1 all_copies_affected germline-variant -> LOH -> AF >= 0.5
## 2 BRCA1 wt_copies_left germline-variant -> no LOH
## 3 ELP2 all_copies_affected phasing: at least one of the combinations of muts …
## 4 CDYL wt_copies_left phasing: variants detected on the same reads -> sa…
## 5 SCRIB wt_copies_left phasing: variants found only on different reads ->…
## 6 TP53 wt_copies_left somatic-variant -> LOH -> left wt copies 0.76
## 7 JUP all_copies_affected somatic-variant -> LOH -> left wt copies 0.12
## 8 TRIM3 all_copies_affected homdel
Each gene appears once in the table. If there are several variants in one gene, the info always refers to the one that affects the highest number of copies. If most copies are affected by a combination of two small variants, the info refers to both of them. In general, the heterozygous deletions have the lowest priority; and if any other small variant or homdel is present in the gene, it will always be indicated as the more relevant variant. In the case of ELP2, CDYL and SCRIB, several small variants are present, which is why the tool attempts to do phasing at the present position. This is indicated by “phasing” in the info column. The exact results of the phasing can be viewed in detail in the next step.
A haplotype phasing is only performed if more than one variant occurs in a gene, each of which does not affect all copies of the gene by itself. The tool then tries to phase at this position to check if the variants are located on the same or different copies. That means that not at every prediction a phasing is done and therefore the following output is not always present.
full_prediction$phasing_info
## # A tibble: 5 × 17
## gene comb dist tcn status left_wt_cp both none mut1 mut2 dev_var
## <chr> <chr> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ELP2 m1-m2 20 2 diff 0.13 0 2 1 6 0
## 2 ELP2 m2-m3 8828 2 null 2 0 0 0 0 0
## 3 ELP2 m1-m3 8848 2 null 2 0 0 0 0 0
## 4 CDYL m1-m2 36 4.9 same 3.29 25 22 0 0 0
## 5 SCRIB m1-m2 28 2.92 diff 2.59 0 38 5 7 0
## # ℹ 6 more variables: no_overlap <dbl>, none_raw <dbl>, info <chr>,
## # DNA_rds <dbl>, RNA_rds <dbl>, class_comb <chr>
In the present case, phasing was performed for three genes separately. The output contains information such as the distance between the two variants and the tcn at the position. If tcn differs at both positions, for example for breakpoint genes, the larger one is displayed here. The status column shows whether the mutations were found on the same copy (same) or on different copies (diff). This prediction is determined by whether the mutations are on the same reads / read pairs. The column DNA_rds contains the number of reads / read pairs that cover both positions. If RNA is available, the number of reads / read pairs is also displayed for RNA reads. Depending on whether the variants were found on different or the same copies, a final prediction for the remaining wildtype-copies can be calculated from the affected copies of both variants. This final calculation can be found in the column left_wt_cp. Only if the variants are found on different copies and the remaining wildtype-copies are below 0.5, it can be assumed that no wildtype copies are left. The four columns both, none, mut1, and mut2 indicate how many of the reads / read pairs belong to each category: Both variants detected, none, only the first, or only the second. RNA reads that do not contain an exon necessary for phasing, e.g., due to alternative splicing or that map somewhere entirely different in the genome, end up in a category called non-overlapping. The category dev_var contains reads / read pairs that carry a different variant than the one to be expected at the position. The output for ELP2 shows how genes are handled in which more than 2 variants occur. Each mutation is then combined once with each other to perform the phasing on each combination. In the case of ELP2 three variants are present, resulting in three possible combinations. Two of the variants are located close to each other (m1 and m2), while the third one has a distance of almost 9000 basepairs to them. Therefore, it is only possible to assign a status to the combination of the close variants. The two other combinations get the status “null” as no overlapping reads / read pairs are present. Theoretically, the tool can attempt phasing if no reads overlap both positions. For such distant variants, intermediate SNPs can be used, which then have to be provided by the input vcf.
This output is only provided if the option showReadDetail is set TRUE. It consists of a table containing every read pair that was used to perform haplotype phasing. In more detail, the read name of the read in the bam file is provided with the classification in one of the four categories mentioned before (both, none, mut1, mut2, no_overlap, dev_var).
full_prediction$readpair_info
## # A tibble: 108 × 5
## qname result origin comb gene
## <chr> <chr> <chr> <chr> <chr>
## 1 K00281:15:HM775BBXX:8:2113:22972:30398 mut2 DNA m1-m2 ELP2
## 2 K00281:21:HT52WBBXX:2:1203:24870:26652 mut2 DNA m1-m2 ELP2
## 3 K00281:21:HT52WBBXX:2:1203:24880:26424 mut2 DNA m1-m2 ELP2
## 4 K00281:15:HM775BBXX:7:2112:14925:27989 mut1 DNA m1-m2 ELP2
## 5 K00281:21:HT52WBBXX:3:1205:26230:48843 none DNA m1-m2 ELP2
## 6 K00281:15:HM775BBXX:4:2210:28767:35409 none DNA m1-m2 ELP2
## 7 K00281:15:HM775BBXX:6:2224:16102:47682 mut2 DNA m1-m2 ELP2
## 8 K00281:21:HT52WBBXX:2:2118:32157:42478 mut2 DNA m1-m2 ELP2
## 9 K00281:15:HM775BBXX:4:1213:4117:24208 mut2 DNA m1-m2 ELP2
## 10 K00281:21:HT52WBBXX:1:1117:2270:27690 both DNA m1-m2 CDYL
## # ℹ 98 more rows
For example here, the first row tells us that the read / read pair with the name K00281:15:HM775BBXX:8:2113:22972:30398 (first row) stem from a fragment of the gene ELP2 and can be used for the phasing of the combination of the variants m1 and m2. It was assigned into the category “mut2” which means that m2 is present on that read / read pair, but m1 was not detected. Depending on the number of variants of the used sample and the coverage, this table can get very big, which may affect runtime.
sessionInfo()
## R version 4.3.1 (2023-06-16)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.3 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.18-bioc/R/lib/libRblas.so
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/New_York
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] XVector_0.42.0 GenomicRanges_1.54.0 GenomeInfoDb_1.38.0
## [4] IRanges_2.36.0 S4Vectors_0.40.0 BiocGenerics_0.48.0
## [7] stringr_1.5.0 dplyr_1.1.3 ZygosityPredictor_1.2.0
## [10] BiocStyle_2.30.0
##
## loaded via a namespace (and not attached):
## [1] tidyselect_1.2.0 blob_1.2.4
## [3] filelock_1.0.2 Biostrings_2.70.0
## [5] bitops_1.0-7 fastmap_1.1.1
## [7] RCurl_1.98-1.12 BiocFileCache_2.10.0
## [9] VariantAnnotation_1.48.0 GenomicAlignments_1.38.0
## [11] XML_3.99-0.14 digest_0.6.33
## [13] lifecycle_1.0.3 KEGGREST_1.42.0
## [15] RSQLite_2.3.1 magrittr_2.0.3
## [17] compiler_4.3.1 rlang_1.1.1
## [19] sass_0.4.7 progress_1.2.2
## [21] tools_4.3.1 igraph_1.5.1
## [23] utf8_1.2.4 yaml_2.3.7
## [25] rtracklayer_1.62.0 knitr_1.44
## [27] prettyunits_1.2.0 S4Arrays_1.2.0
## [29] bit_4.0.5 curl_5.1.0
## [31] DelayedArray_0.28.0 xml2_1.3.5
## [33] abind_1.4-5 BiocParallel_1.36.0
## [35] withr_2.5.1 purrr_1.0.2
## [37] grid_4.3.1 fansi_1.0.5
## [39] biomaRt_2.58.0 SummarizedExperiment_1.32.0
## [41] cli_3.6.1 rmarkdown_2.25
## [43] crayon_1.5.2 generics_0.1.3
## [45] httr_1.4.7 rjson_0.2.21
## [47] DBI_1.1.3 cachem_1.0.8
## [49] zlibbioc_1.48.0 parallel_4.3.1
## [51] AnnotationDbi_1.64.0 BiocManager_1.30.22
## [53] restfulr_0.0.15 matrixStats_1.0.0
## [55] vctrs_0.6.4 Matrix_1.6-1.1
## [57] jsonlite_1.8.7 bookdown_0.36
## [59] hms_1.1.3 bit64_4.0.5
## [61] GenomicFeatures_1.54.0 jquerylib_0.1.4
## [63] glue_1.6.2 codetools_0.2-19
## [65] stringi_1.7.12 BiocIO_1.12.0
## [67] tibble_3.2.1 pillar_1.9.0
## [69] rappdirs_0.3.3 htmltools_0.5.6.1
## [71] BSgenome_1.70.0 GenomeInfoDbData_1.2.11
## [73] R6_2.5.1 dbplyr_2.3.4
## [75] evaluate_0.22 lattice_0.22-5
## [77] Biobase_2.62.0 png_0.1-8
## [79] Rsamtools_2.18.0 memoise_2.0.1
## [81] bslib_0.5.1 SparseArray_1.2.0
## [83] xfun_0.40 MatrixGenerics_1.14.0
## [85] pkgconfig_2.0.3