• 1 How could I generate a manifest file with filtering of Race and Ethnicity?
  • 2 How can I get the number of cases with RNA-Seq data added by date to TCGA project with GenomicDataCommons?

1 How could I generate a manifest file with filtering of Race and Ethnicity?

From https://support.bioconductor.org/p/9138939/.

library(GenomicDataCommons,quietly = TRUE)

I made a small change to the filtering expression approach based on changes to lazy evaluation best practices. There is now no need to include the ~ in the filter expression. So:

q = files() %>%
  GenomicDataCommons::filter(
    cases.project.project_id == 'TCGA-COAD' &
      data_type == 'Aligned Reads' &
      experimental_strategy == 'RNA-Seq' &
      data_format == 'BAM')

And get a count of the results:

count(q)
## [1] 1188

And the manifest.

manifest(q)
ABCDEFGHIJ0123456789
id
<chr>
data_format
<chr>
access
<chr>
b5b03243-3074-4db1-b22e-15d14e790f57BAMcontrolled
fb0ea225-1004-412e-892a-f01dc9d14581BAMcontrolled
87da2a2a-586e-4797-9d2a-4f423a4e3641BAMcontrolled
79126fea-a11b-4410-9e74-60e333eee910BAMcontrolled
c91e5d6c-5a2f-4f74-91a9-a36d5656dcb4BAMcontrolled
8cd0db1a-c53e-4f34-ab17-e1cd97477868BAMcontrolled
6f478235-d73c-45dc-935d-fc934206fa36BAMcontrolled
6d0b8cc5-da52-42b1-9b5b-0aee1dbca1baBAMcontrolled
085a55a1-98f5-42ab-b7f7-9793a2df3991BAMcontrolled
fa292a95-d125-4c4f-bf88-a066d31fea74BAMcontrolled

Your question about race and ethnicity is a good one.

all_fields = available_fields(files())

And we can grep for race or ethnic to get potential matching fields to look at.

grep('race|ethnic',all_fields,value=TRUE)
## [1] "cases.demographic.ethnicity"                                           
## [2] "cases.demographic.race"                                                
## [3] "cases.follow_ups.hormonal_contraceptive_type"                          
## [4] "cases.follow_ups.hormonal_contraceptive_use"                           
## [5] "cases.follow_ups.other_clinical_attributes.hormonal_contraceptive_type"
## [6] "cases.follow_ups.other_clinical_attributes.hormonal_contraceptive_use" 
## [7] "cases.follow_ups.scan_tracer_used"

Now, we can check available values for each field to determine how to complete our filter expressions.

available_values('files',"cases.demographic.ethnicity")
## [1] "not hispanic or latino" "not reported"           "hispanic or latino"    
## [4] "unknown"                "_missing"
available_values('files',"cases.demographic.race")
##  [1] "white"                                    
##  [2] "not reported"                             
##  [3] "black or african american"                
##  [4] "asian"                                    
##  [5] "unknown"                                  
##  [6] "other"                                    
##  [7] "american indian or alaska native"         
##  [8] "native hawaiian or other pacific islander"
##  [9] "not allowed to collect"                   
## [10] "_missing"

We can complete our filter expression now to limit to white race only.

q_white_only = q %>%
  GenomicDataCommons::filter(cases.demographic.race=='white')
count(q_white_only)
## [1] 695
manifest(q_white_only)
ABCDEFGHIJ0123456789
id
<chr>
data_format
<chr>
access
<chr>
fb0ea225-1004-412e-892a-f01dc9d14581BAMcontrolled
79126fea-a11b-4410-9e74-60e333eee910BAMcontrolled
8cd0db1a-c53e-4f34-ab17-e1cd97477868BAMcontrolled
fa292a95-d125-4c4f-bf88-a066d31fea74BAMcontrolled
48aab61e-8550-4698-9c0f-9db6c0c92793BAMcontrolled
d0a01deb-187c-4fd0-9e4c-c9149ac7f1b4BAMcontrolled
c2962006-6ad5-4fe6-8a5a-1c8eb9fad90cBAMcontrolled
85535fab-ba49-4b3e-b372-08c15a997042BAMcontrolled
5598442c-f6a5-4bc0-b7cb-4176f5318313BAMcontrolled
3f50c43c-1a61-4e36-a23b-aa9c9b106102BAMcontrolled

2 How can I get the number of cases with RNA-Seq data added by date to TCGA project with GenomicDataCommons?

I would like to get the number of cases added (created, any logical datetime would suffice here) to the TCGA project by experiment type. I attempted to get this data via GenomicDataCommons package, but it is giving me I believe the number of files for a given experiment type rather than number cases. How can I get the number of cases for which there is RNA-Seq data?

library(tibble)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:GenomicDataCommons':
## 
##     count, filter, select
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(GenomicDataCommons)

cases() %>% 
  GenomicDataCommons::filter(~ project.program.name=='TCGA' & 
                               files.experimental_strategy=='RNA-Seq') %>% 
  facet(c("files.created_datetime")) %>% 
  aggregations() %>% 
  .[[1]] %>% 
  as_tibble() %>%
  dplyr::arrange(dplyr::desc(key))
ABCDEFGHIJ0123456789
doc_count
<int>
key
<chr>
1642024-06-14t14:27:00.916424-05:00
4122024-06-14t13:28:10.644120-05:00
1512023-03-09t00:35:51.387873-06:00
792023-02-19t04:41:11.008116-06:00
4582023-02-19t04:36:10.605050-06:00
802023-02-19t04:28:49.400023-06:00
1782023-02-19t04:23:49.092629-06:00
5162023-02-19t04:18:49.453628-06:00
1792023-02-19t04:13:47.877168-06:00
2902023-02-19t04:08:47.478925-06:00