MICSQTL: Multi-omic deconvolution, Integration and Cell-type-specific Quantitative Trait Loci

Yue Pan1* and Qian Li1

1Department of Biostatistics, St. Jude Children's Research Hospital

24 October 2023

Abstract

MICSQTL (Multi-omic deconvolution, Integration and Cell-type-specific Quantitative Trait Loci) is a tool that estimates cell type proportions in bulk proteomes by leveraging shared information in matched transcriptomes. Based on the deconvoluted cell fractions, MICSQTL further integrates and visualizes multi-source profiles at bulk or cell type levels, as well as performs Quantitative Trait Loci (QTL) mapping at cellular resolution. This is a novel pipeline for multi-omic integrative data analysis that fulfills the need for comprehensive analysis of diverse data sources. This pipeline enables valuable insights into cellular composition, facilitates cell-type-specific protein QTL mapping, and streamlines multi-modal data integration and dimension reduction.

Package

MICSQTL 1.0.0

1 Introduction
2 Install MICSQTL
3 Load packages
4 Quick start
5 Licenses of the analysis methods
Session info

1 Introduction

Our pipeline, MICSQTL, integrates RNA and protein expressions to detect potential cell marker proteins and estimate cell abundance in mixed proteomes without a reference signature matrix. MICSQTL enables cell-type-specific quantitative trait loci (QTL) mapping for proteins or transcripts using bulk expression data and estimated cellular composition per molecule type, eliminating the necessity for single-cell sequencing. We use matched transcriptome-proteome from human brain frontal cortex tissue samples to demonstrate the input and output of our tool.

MICSQTL workflow.

2 Install MICSQTL

if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("MICSQTL")

3 Load packages

library(MICSQTL)

Load packages for making plots.

library(reshape2)
library(GGally)
library(ggplot2)

4 Quick start

To conduct the analysis, you’ll need to start with a SummarizedExperiment object that contains bulk protein expression data in the assays slot. The row metadata (rowData slot) should contain information about the proteins. Additionally, you’ll require bulk gene expression data and a reference file (depends on the method you’re using) included as elements in the metadata slot. For more accurate cell-type fraction estimations, it’s recommended to include marker genes only.

This code chunk assumes that you already have a bulk protein matrix named “protein_data” and annotation information for proteins called “anno_protein.” The gene expression data is stored in “gene_data.”

se <- SummarizedExperiment(
    assays = list(protein = protein_data),
    rowData = anno_protein
)
metadata(se) <- list(
    gene_data = gene_data
)

Further information is necessary for visualization and csQTL (cell-type specific quantitative trait loci) analysis. Additional metadata can be incorporated later using a command like metadata(se)$new_data <- new_data if needed. For detailed instructions, please refer to the following sections and function documentation.

In this package, we provide an example SummarizedExperiment object containing the following elements:

protein_data: A subset of proteomic data PsychENCODE with 2,242 rows (protein) and 127 columns (sample).
anno_protein: A data frame with 2,242 rows and 4 columns (Chr, Start, End, Symbol) as annotations of each protein from protein_data.
ref_protein: A signature matrix with 2,242 rows (protein) and 4 columns (cell types), which serves as a reference of known cellular signatures.
gene_data: A data frame with 2,867 rows (genes) and 127 columns (sample).
ref_gene: A signature matrix with 4,872 rows (genes) and 5 columns (cell types), which serves as a reference of known cellular signatures.
prop_gene: A pre-defined deconvoluted transcriptome proportion matrix.
SNP_data: A sparse matrix with 2,000 rows (SNP), which stores the information of genetic variants at each location from one chromosome and 127 columns (sample, should match the sample in protein_data). Each matrix entry corresponds to the genotype group indicator (0, 1 or 2) for a sample at a genetic location.
anno_SNP: A data frame with 2,000 rows and 3 columns (CHROM, POS, ID), which stores Annotations of each SNP from SNP_data.
meta:A data frame with 127 rows (sample) and 2 columns (disease status and gender) as metadata.
cell_counts: A matrix containing cell counts across multiple subjects, where subjects are represented as rows and cell types as columns. Each entry (i, j) in the matrix indicates the count of cells belonging to the ith subject and jth cell type.

data(se)

4.1 Cell-type proportion deconvolution

This step estimates cell type proportions per molecule type.

In this current version, only nnls is supported as single-source deconvolution methods.

Alternatively, if there are cell-type proportion estimates results generated using other methods or obtained from other sources, just save that as an element (prop) in metadata slot and this deconvolution step can be skipped. Note that the samples in the cell-type proportion estimates must match the samples from bulk protein expression data.

4.2 Cross-source cell-type proportion deconvolution

The reference matrix for pure cell proteomics may be incomplete due to the limitations of single-cell proteomics technologies. To address this, we propose a novel cross-source cell-type fraction deconvolution method (Joint-AJ-RF) that leverages matched bulk transcriptome-proteome data. In the following example, we demonstrate how to estimate protein proportions by utilizing information from deconvoluted transcriptomes.

This method requires an external reference with cell counts from a similar tissue type (usually from small-scale single-cell or flow cytometry experiments). We provide a sample cell counts table in the metadata for illustration; however, in practice, it should be sourced from a matching tissue type.

The ajive_decomp function (more on this in the following section) with refactor_loading = TRUE should be employed to improve joint deconvolution by performing cross-source feature selection for potential protein cell markers.

se <- ajive_decomp(se, use_marker = FALSE, refactor_loading = TRUE)
se <- deconv(se, source = "cross", method = "Joint",
             use_refactor = 1000, cell_counts = se@metadata$cell_counts)

Boxplots of estimated cell proportions using cross-source.

An alternative approach involves a two-step process. First, obtain a sample-wise pre-defined deconvoluted transcriptome proportion and store as prop_gene in the metadata slot, use methods like CIBERSORT or MuSiC. Then, utilize the TCA algorithm (https://cran.r-project.org/web/packages/TCA/index.html) to calculate protein proportion estimates.

4.3 Integrative visualization

AJIVE (Angle based Joint and Individual Variation Explained) is useful when there are multiple data matrices measured on the same set of samples. It decomposes each data matrix as three parts: (1) Joint variation across data types (2) Individual structured variation for each data type and (3) Residual noise.

It is similar as principal component analysis (PCA), but principal component analysis only takes a single data set and decomposes it into modes of variation that maximize variation. AJIVE finds joint modes of variation from multiple data sources.

Common normalized scores are one of the desirable output to explore the joint behavior that is shared by different data sources. Below we show the visualization of common normalized scores. It is clear that the disease status of these samples are well separated by the first common normalized scores.

se <- ajive_decomp(se, plot = TRUE,
                   group_var = "disease",
                   scatter = TRUE, scatter_x = "cns_1", scatter_y = "cns_2")
metadata(se)$cns_plot

Sample group separation based on common normalized scores from AJIVE.

4.3.1 Comparison to PCA

pca_res <- prcomp(t(assay(se)), rank. = 3, scale. = FALSE)
pca_res_protein <- data.frame(pca_res[["x"]])
pca_res_protein <- cbind(pca_res_protein, metadata(se)$meta$disease)
colnames(pca_res_protein)[4] <- "disease"

ggpairs(pca_res_protein,
        columns = seq_len(3), aes(color = disease, alpha = 0.5),
        upper = list(continuous = "points")
) + theme_classic()


pca_res <- prcomp(t(metadata(se)$gene_data), rank. = 3, scale. = FALSE)
pca_res_gene <- data.frame(pca_res[["x"]])
pca_res_gene <- cbind(pca_res_gene, metadata(se)$meta$disease)
colnames(pca_res_gene)[4] <- "disease"
ggpairs(pca_res_gene,
        columns = seq_len(3), aes(color = disease, alpha = 0.5),
        upper = list(continuous = "points")
) + theme_classic()

Top3 principal components of proteome.

Top3 principal components of transcriptome.

4.4 Feature filtering

The feature filtering can be applied at both proteins/genes and SNPs. This step is optional but highly recommended to filter out some features that are not very informative or do not make much sense biologically. Note that this function is required to run even no filtering is expected to be done (just set filter_method = "null") to obtain a consistent object format for downstream analysis.

To apply feature filtering, annotation files for protein/gene and SNPs are required. The annotation file for proteins/genes should be stored in rowData(), where each row corresponds to a protein/gene with it’s symbol as row names. The first column should be a character vector indicating which chromosome each protein or gene is on. In addition, it should contain at least a “Start” column with numeric values indicating the start position on that chromosome, a “End” column with numeric values indicating the end position on that chromosome and a “Symbol” column as a unique name for each protein or gene.

head(rowData(se))
#> DataFrame with 6 rows and 4 columns
#>               Chr     Start       End      Symbol
#>       <character> <integer> <integer> <character>
#> AAGAB          15  67202823  67254631       AAGAB
#> AARS2           6  44300549  44313323       AARS2
#> AASS            7 122076491 122133726        AASS
#> ABAT           16   8735739   8781427        ABAT
#> ABCA1           9 104784317 104903679       ABCA1
#> ABCA2           9 137007931 137028140       ABCA2

The information from genetic variants should be stored in a P (the number of SNP) by N (the number of samples, should match the sample in counts slot) matrix contained as an element (SNP_data) in metadata slot. Each matrix entry corresponds to the genotype group indicator (0 for 0/0, 1 for 0/1 and 2 for 1/1) for a sample at a genetic location. The annotations of these SNP should be stored as an element (anno_SNP) in metadata slot. It should include at least the following columns: (1) “CHROM” (which chromosome the SNP is on); (2) “POS” (position of that SNP) and (3) “ID” (a unique identifier for each SNP, usually a combination of chromosome and its position).

The example SNP data provided here were restricted to chromosome 9 only. In practice, the SNPs may from multiple or even all chromosomes.

head(metadata(se)$anno_SNP)
#>        CHROM       POS          ID
#> 332373     9 137179658 9:137179658
#> 237392     9 104596634 9:104596634
#> 106390     9  28487163  9:28487163
#> 304108     9 126307371 9:126307371
#> 295846     9 122787821 9:122787821
#> 126055     9  33975396  9:33975396

For filtering at protein or gene level, only those symbols contained in target_SNP argument will be kept and if not provided, all SNPs will be used for further filtering.

For filtering at SNP level, there are three options: (1) filter out the SNPs that have minor allele frequency below the threshold defined by filter_allele argument (filter_method = "allele"); (2) filter out the SNPs that the fraction of samples in the smallest genotype group below the threshold defined by filter_geno argument (filter_method = "allele") and (3) restrict to cis-regulatory variants (filter_method = "distance"): the SNPs up to 1 Mb proximal to the start of the gene. Both filtering methods can be applied simultaneously by setting filter_method = c("allele", "distance").

The results after filtering will be stored as an element (choose_SNP_list) in metadata slot. It is a list with the length of the number of proteins for downstream analysis. Each element stores the index of SNPs to be tested for corresponding protein. The proteins with no SNPs correspond to it will be removed from the returned list.

To simplify the analysis, we only test 3 targeted proteins from chromosome 9 as an example.

target_protein <- rowData(se)[rowData(se)$Chr == 9, ][seq_len(3), "Symbol"]
se <- feature_filter(se,
    target_protein = target_protein,
    filter_method = c("allele", "distance"),
    filter_allele = 0.15,
    filter_geno = 0.05,
    ref_position = "TSS"
)

In this example, the number of SNPs corresponding to each protein after filtering ranges from 7 to 26.

unlist(lapply(metadata(se)$choose_SNP_list, length))
#>   ABCA1   ABCA2 AGTPBP1 
#>      26      22       7

4.5 csQTL analysis

In this step, the TOAST method is implemented for cell-type-specific differential expression analysis based on samples’ genotype.

The result will be stored as an element (TOAST_output) in metadata slot. It is a list with the same length as tested proteins or genes where each element consists of a table including protein or gene symbol, SNP ID and p-values from each cell type. A significant p-value indicates that the protein or gene expression is different among the sample from different genotype groups.

system.time(se <- csQTL(se))

We can check the results from csQTL analysis for one of target proteins:

res <- metadata(se)$TOAST_output[[2]]
head(res[order(apply(res, 1, min)), ])

5 Licenses of the analysis methods

method	citation
TCA	Rahmani, Elior, et al. “Cell-type-specific resolution epigenetics without the need for cell sorting or single-cell biology.” Nature communications 10.1 (2019): 3417.
AJIVE	Feng, Qing, et al. “Angle-based joint and individual variation explained.” Journal of multivariate analysis 166 (2018): 241-265.
TOAST	Li, Ziyi, and Hao Wu. “TOAST: improving reference-free cell composition estimation by cross-cell type differential analysis.” Genome biology 20.1 (2019): 1-17.

Session info

#> R version 4.3.1 (2023-06-16)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 22.04.3 LTS
#> 
#> Matrix products: default
#> BLAS:   /home/biocbuild/bbs-3.18-bioc/R/lib/libRblas.so 
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_GB              LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: America/New_York
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats4    stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#>  [1] MICSQTL_1.0.0               SummarizedExperiment_1.32.0
#>  [3] Biobase_2.62.0              GenomicRanges_1.54.0       
#>  [5] GenomeInfoDb_1.38.0         IRanges_2.36.0             
#>  [7] S4Vectors_0.40.0            BiocGenerics_0.48.0        
#>  [9] MatrixGenerics_1.14.0       matrixStats_1.0.0          
#> [11] BiocStyle_2.30.0           
#> 
#> loaded via a namespace (and not attached):
#>   [1] bitops_1.0-7            pbapply_1.7-2           formatR_1.14           
#>   [4] rlang_1.1.1             magrittr_2.0.3          ggridges_0.5.4         
#>   [7] e1071_1.7-13            compiler_4.3.1          gdata_3.0.0            
#>  [10] vctrs_0.6.4             quadprog_1.5-8          stringr_1.5.0          
#>  [13] pkgconfig_2.0.3         crayon_1.5.2            fastmap_1.1.1          
#>  [16] backports_1.4.1         XVector_0.42.0          utf8_1.2.4             
#>  [19] rmarkdown_2.25          pracma_2.4.2            nloptr_2.0.3           
#>  [22] purrr_1.0.2             xfun_0.40               zlibbioc_1.48.0        
#>  [25] cachem_1.0.8            jsonlite_1.8.7          EpiDISH_2.18.0         
#>  [28] DelayedArray_0.28.0     reshape_0.8.9           BiocParallel_1.36.0    
#>  [31] gmodels_2.18.1.1        broom_1.0.5             parallel_4.3.1         
#>  [34] R6_2.5.1                bslib_0.5.1             stringi_1.7.12         
#>  [37] RColorBrewer_1.1-3      limma_3.58.0            GGally_2.1.2           
#>  [40] car_3.1-2               jquerylib_0.1.4         Rcpp_1.0.11            
#>  [43] bookdown_0.36           iterators_1.0.14        knitr_1.44             
#>  [46] Matrix_1.6-1.1          nnls_1.5                splines_4.3.1          
#>  [49] tidyselect_1.2.0        abind_1.4-5             yaml_2.3.7             
#>  [52] doParallel_1.0.17       codetools_0.2-19        lattice_0.22-5         
#>  [55] tibble_3.2.1            plyr_1.8.9              evaluate_0.22          
#>  [58] lambda.r_1.2.4          futile.logger_1.4.3     proxy_0.4-27           
#>  [61] ggpubr_0.6.0            pillar_1.9.0            BiocManager_1.30.22    
#>  [64] carData_3.0-5           foreach_1.5.2           generics_0.1.3         
#>  [67] RCurl_1.98-1.12         ggplot2_3.4.4           munsell_0.5.0          
#>  [70] TOAST_1.16.0            scales_1.2.1            gtools_3.9.4           
#>  [73] class_7.3-22            glue_1.6.2              tools_4.3.1            
#>  [76] data.table_1.14.8       ggsignif_0.6.4          grid_4.3.1             
#>  [79] dirmult_0.1.3-5         tidyr_1.3.0             matrixcalc_1.0-6       
#>  [82] colorspace_2.1-0        GenomeInfoDbData_1.2.11 rsvd_1.0.5             
#>  [85] cli_3.6.1               futile.options_1.0.1    fansi_1.0.5            
#>  [88] S4Arrays_1.2.0          dplyr_1.1.3             corpcor_1.6.10         
#>  [91] gtable_0.3.4            rstatix_0.7.2           sass_0.4.7             
#>  [94] digest_0.6.33           SparseArray_1.2.0       htmltools_0.5.6.1      
#>  [97] lifecycle_1.0.3         TCA_1.2.1               locfdr_1.1-8           
#> [100] statmod_1.5.0           MASS_7.3-60