Maintainer
Zilu Zhou zhouzilu@pennmedicine.upenn.edu
This is a demo for using the iCNV package in R.
iCNV is a normalization and copy number variation
detection procedure for multiple study designs: WES only, WGS only, SNP
array only, or any combination of SNP and sequencing data.
iCNV applies platform specific normalization, utilizes
allele specific reads from sequencing and integrates matched NGS and
SNP-array data by a Hidden Markov Model (HMM). Figure below shows the
overall pipeline. This package specifically emphasizes on the steps
within the parenthesis. Below is an example on calling copy number
variation using whole-exome sequencing data and array SNPs of 46
modified samples from 1000 Genome Project. Only small portion of
chromosome 22 are analyzed for illustration purposes. We will separately
shows normalization for WES and SNP array. We will also introduce
integrated calling procedure as well as single platform procedure.
NGS | Array
BAM BED(UCSC for WES or bed_generator.R for WGS 2.2) | SNP Intensity(in standard format)
|----------------| | |
| | | |icnv_array_input(2.4)
|SAMTools(2.3) |CODEX(2.2) | |
| | | |-----------|
Variants BAF(vcf) PLR | Array LRR Array BAF
| | | | |
| | | |SVD(2.4) |
|bambaf_from_vcf | | | |
| (2.3) | | Normalized LRR |
| | | | |
-------------------------------------------------------------------------------------
|
|iCNV_detection(2.5,2.6)
|
CNV calling
|
|icnv_output_to_gb(2.5,2.6)
|
Genome Browser file
We strongly recommend combining platform when both WES data and SNP
array are available. However, for high quality WGS data, SNP information
isn’t so necessary.
2. iCNV workflow
We could separate the basic iCNV workflow into 6 sections:
1. package installation; 2. .bam file
normalization; 3. sequence variants BAF calling;
4. SNP array LRR normalization and BAF;
5. CNV detection using iCNV_detection
function. 6. CNV detection using
iCNV_detection
function with single platform. We will
illustrate them one by one in the following sessions.
2.1 Install iCNV.
Install CODEX
first
if (!requireNamespace("BiocManager", quietly=TRUE))
install.packages("BiocManager")
BiocManager::install("CODEX")
Install iCNV
from Bioconductor:
if (!requireNamespace("BiocManager", quietly=TRUE))
install.packages("BiocManager")
BiocManager::install("iCNV")
Install the current release from Github:
install.packages("devtools")
library(devtools)
install_github("zhouzilu/iCNV")
2.2 .bam file normalization using CODEX
For researchers work on WGS data, you need to generate your own BED
file for CODEX normalization. We made a function bed_generator
to help you with this.
After you have BAM file and BED file available, follow code below to
get normalized NGS PLR. Code modified from https://github.com/yuchaojiang/CODEX with permission. If
you have any additional question, please refer to CODEX
.
iCNV only adopt normalized Poisson Likelihood Ratio from
CODEX
, representing NGS ‘intensity’. In this example, we
utilize toy data from the 1000 Genomes Project for illustration.
library(CODEX)
library(WES.1KG.WUGSC)
library(iCNV)
dir <- system.file("extdata", package = "WES.1KG.WUGSC")
bamFile <- list.files(dir, pattern = '*.bam$')
bamdir <- file.path(dir, bamFile)
sampname <- as.matrix(read.table(file.path(dir, "sampname")))
bedFile <- file.path(dir, "chr22_400_to_500.bed")
chr <- 22
bambedObj <- getbambed(bamdir = bamdir, bedFile = bedFile,
sampname = sampname, projectname = "icnv.demo.", chr)
bamdir <- bambedObj$bamdir; sampname <- bambedObj$sampname
ref <- bambedObj$ref; projectname <- bambedObj$projectname; chr <- bambedObj$chr
coverageObj <- getcoverage(bambedObj, mapqthres = 20)
Y <- coverageObj$Y; readlength <- coverageObj$readlength
gc <- getgc(chr, ref)
mapp <- getmapp(chr, ref)
qcObj <- qc(Y, sampname, chr, ref, mapp, gc, cov_thresh = c(20, 4000),
length_thresh = c(20, 2000), mapp_thresh = 0.9, gc_thresh = c(20, 80))
Y_qc <- qcObj$Y_qc; sampname_qc <- qcObj$sampname_qc; gc_qc <- qcObj$gc_qc
mapp_qc <- qcObj$mapp_qc; ref_qc <- qcObj$ref_qc; qcmat <- qcObj$qcmat
normObj <- normalize(Y_qc, gc_qc, K = 1:9)
Yhat <- normObj$Yhat; AIC <- normObj$AIC; BIC <- normObj$BIC
RSS <- normObj$RSS; K <- normObj$K
choiceofK(AIC, BIC, RSS, K, filename = paste0(projectname, chr, "_choiceofK.pdf"))
save(qcObj,normObj,sampname,file=paste0(projectname,'plrObj_', chr,".rda"))
CODEX reports all three statistical metrics (AIC, BIC, percent of
Variance explained) and uses BIC as the default method to determine the
number of latent Poisson factors. Since false positives can be screened
out through a closer examination of the post-segmentation data, whereas
CNV signals removed in the normalization step cannot be recovered, CODEX
opts for a more conservative normalization that, when in doubt, uses a
smaller value of K. For example,

Here we pick optK=2
load(paste0(projectname,'plrObj_',chr,".rda"))
optK <- K[which.max(BIC)]
Y_qc <- qcObj$Y_qc; sampname_qc <- qcObj$sampname_qc; gc_qc <- qcObj$gc_qc
mapp_qc <- qcObj$mapp_qc; ref_qc <- qcObj$ref_qc; qcmat <- qcObj$qcmat
Yhat <- normObj$Yhat; AIC <- normObj$AIC; BIC <- normObj$BIC
RSS <- normObj$RSS; K <- normObj$K
ref_qc <- qcObj$ref_qc
sampname_qc <- qcObj$sampname_qc
Y_norm <- normObj$Yhat[[optK]]
plr <- log(pmax(Y_qc,0.0001)/pmax(Y_norm,0.0001))
ngs_plr <- lapply(seq_len(ncol(plr)), function(i) plr[,i])
ngs_plr.pos <- lapply(seq_len(ncol(plr)),function(x){return(cbind(start(ref_qc),end(ref_qc)))})
save(sampname_qc,ngs_plr,ngs_plr.pos,file=paste0(projectname,'plrObj_',chr,'_',optK,'.rda'))
For detailed illustration of CODEX, please visit https://github.com/yuchaojiang/CODEX
2.3 sequence variants BAF calling
For sequencing data without sophisticated pipeline and SNVs call set
in VCF format, we manually call SNVs from quality controlled BAM files
by mpileup module in samtools, and calculate B allele frequency(BAF) on
heterogeneous loci by dividing AD (Number of high-quality non-reference
bases, FORMAT) from DP (Number of high-quality bases, FORMAT). Example
code are:
cd PATH/TO/BAM
for i in *bam; do PATH/TO/SAMTOOLS/samtools mpileup -ugI -t AD -t DP -f
PATH/TO/REF/human_hg37.fasta $i | PATH/TO/BAFTOOLS/bcftools
call -cv -O z -o PATH/TO/OUTPUT/$i.vcf.gz; done
for i in *gz; do PATH/TO/BAFTOOLS/bcftools view -g het -i 'FORMAT/DP>10' -O z -o $i.filter $i; done
We could further extract variants BAF info from vcf file by function
bambaf_from_vcf.
projectname <- "icnv.demo."
dir <- system.file("extdata", package="iCNV")
bambaf_from_vcf(dir,'bam_vcf.list',chr=22,projectname)
load(paste0(projectname,'bambaf_22.rda'))
2.4 SNP array LRR normalization and BAF
Because signal intensity files varies from platform, we set a
standard signal intensity file format for each individual. The format
looks as follow:
Name,Chr,POS,sample_1.Log R Ratio,sample_1.B Allele Freq
rs1,22,15462739,-0.096390619874,0.0443964861333
rs2,22,15520991,-0.154103130102,0.963218688965
rs3,22,15780940,-0.110297381878,0.0457459762692
rs4,22,15863717,-0.21270519495,0.957377910614
rs5,22,16532045,-0.0330782271922,0.0300635993481
First row is the rsid (Name), followed by chromosome (Chr) and
position (POS), then the log R ratio (sample1.Log R Ratio) and BAF
(sample1.B Allele Freq). We could use function get_array_input
to convert into iCNV input. For example, in demo:
dir <- system.file("extdata", package="iCNV")
chr <- 22
projectname <- "icnv.demo."
pattern <- paste0('*.csv.arrayicnv$')
get_array_input(dir,pattern,chr,projectname)
load(paste0(projectname,'array_lrrbaf_',chr,'.rda'))
For some of the SNP array LRR data, we need to apply SVD
normalization to remove high dimension noisy and preserve low dimension
signal. The best way to decide data sanity is by plotting out the data
by plot_intensity
function. Noisy data has the feature of
local strip across samples. Conventional way for identifying elbow
points as the optK can also apply here. Example code for remove high
dimension noisy and plot:
library(iCNV)
projectname <- "icnv.demo."
load(paste0(projectname,'array_lrrbaf_',chr,'.rda'))
lrr.sd <- mapply(function(x){
x=pmax(pmin(x,100),-100);(x-mean(x,na.rm=TRUE))/sd(x,na.rm = TRUE)
},snp_lrr,SIMPLIFY = TRUE)
lrr.sd.dena <- apply(lrr.sd,2,function(x){x[is.na(x)]=mean(x,na.rm=TRUE);return(x)})
lrr.svd <- svd (t(lrr.sd.dena))
save(lrr.svd,file=paste0(projectname,'array_lrrbaf_svd_',chr,'.rda'))
plot(x=seq_len(10),y=(lrr.svd$d[1:10])^2/sum(lrr.svd$d^2),
xlab='rank',ylab='D square',main='Variance explained',type='l')

There is no universal rule of picking the optK. Usually we prefer
those “elbow points”. We pick optK=2
library(iCNV)
optK <- 5
D <- diag(lrr.svd$d)
D.lowrank <- diag(c(rep(0,optK),lrr.svd$d[-(seq_len(optK))]))
lrr.denoise <- t(lrr.svd$u %*% D.lowrank %*% t(lrr.svd$v))
snp_lrr <- lapply(seq_len(ncol(lrr.denoise)), function(i) lrr.denoise[,i])
save(snp_lrr,snp_baf,snp_lrr.pos,snp_baf.pos,
file=file.path(dir,paste0(projectname,'array_lrrbaf_',chr,'_svded.rda')))
2.5 Mutiple platform CNV detection using iCNV
At this step, we should already have PLR and variants BAF from
sequencing, normalized LRR and BAF from SNP array. All the input should
be in list form. This is mainly to accommodate the fact that length for
ngs_baf
and ngs_baf.pos
is sample specific
str(ngs_plr)
str(ngs_plr.pos)
str(ngs_baf)
str(ngs_baf.pos)
str(snp_lrr)
str(snp_lrr.pos)
str(snp_baf)
str(snp_baf.pos)
projectname <- 'icnv.demo.'
icnv_res0 <- iCNV_detection(ngs_plr,snp_lrr,
ngs_baf,snp_baf,
ngs_plr.pos,snp_lrr.pos,
ngs_baf.pos,snp_baf.pos,
projname=projectname,CN=1,mu=c(-3,0,2),cap=TRUE,visual = 1)
icnv.output <- output_list(icnv_res0,sampname_qc,CN=1)
head(icnv.output)
The results plot looks are follow. Each row is a sample and each
column is a hidden state. The color indicates hidden state Z-score
(large positive number prefers amplification, low negative number
prefers deletion). Black dots represent amplification detected, while
white dots show deletion detected.

We could further convert icnv.output to Genome Browser input
file:
icnv.output <- output_list(icnv_res=icnv_res0,sampleid=sampname_qc, CN=0, min_size=10000)
gb_input <- icnv_output_to_gb(22,icnv.output)
write.table(gb_input,file=paste0(projectname,'icnv_res_gb_chr',chr,'.tab'),
quote=FALSE,col.names=FALSE,row.names=FALSE)
The color code for CNV in Genome Browser is:

This is different from the color code for us plotting the data. We
are just try to make it coordinate with DGV notation. We could also plot
out the data in DGV format by setting col='DGV'
in
plotHMMscore
.

We could also plot information in a single individual using function
plotindi
. Example shows below:
I <- 24
plotindi(ngs_plr,snp_lrr,
ngs_baf,snp_baf,
ngs_plr.pos,snp_lrr.pos,
ngs_baf.pos,snp_baf.pos,
icnv_res0,I)



2.6 Single platform CNV detection using iCNV
At this step, we should already have PLR and variants BAF from
sequencing OR normalized LRR and BAF from SNP array. Please make sure
all the input are in list form.
NGS only CNV detection using iCNV. As we showed in
the paper, we highly recommend using iCNV for WGS CNV detection.
str(ngs_plr)
str(ngs_plr.pos)
str(ngs_baf)
str(ngs_baf.pos)
projectname <- 'icnv.demo.ngs.'
icnv_res0_ngs <- iCNV_detection(ngs_plr=ngs_plr, ngs_baf = ngs_baf,
ngs_plr.pos = ngs_plr.pos,ngs_baf.pos = ngs_baf.pos,
projname=projectname,CN=0,mu=c(-3,0,2),cap=TRUE,visual = 2)
icnv.output <- output_list(icnv_res0_ngs,sampname_qc,CN=0)
head(icnv.output)
plotHMMscore(icnv_res0_ngs,title=projectname)

SNP array only CNV detection using iCNV
str(snp_lrr)
str(snp_lrr.pos)
str(snp_baf)
str(snp_baf.pos)
projectname <- 'icnv.demo.snp'
icnv_res0_snp <- iCNV_detection(snp_lrr=snp_lrr, snp_baf = snp_baf,
snp_lrr.pos = snp_lrr.pos,snp_baf.pos = snp_baf.pos,
projname=projectname,CN=1,mu=c(-3,0,2),cap=TRUE,visual = 2)
icnv.output <- output_list(icnv_res0_snp,sampname_qc,CN=1)
head(icnv.output)
plotHMMscore(icnv_res0_snp,title=projectname)
