Issue Accessing AACR Genie Data with cBioPortalData #79

grainneireland · 2025-01-08T20:28:26Z

Hi, I am trying to access data with cBioPortalData from the AACR Genie Cohort v17.0 - public. I am having trouble downloading the data.

Below is the query I am inputting:

> aacr <- cBioPortal(
+     hostname = "genie.cbioportal.org",
+     token = "   ")

Which returns the below warning messages:

Warning messages:
1: In .service_validate_md5sum(api_reference_url, api_reference_md5sum,  :
  service version differs from validated version
    service url: https://genie.cbioportal.org/api/v2/api-docs
    observed md5sum: b3a87becd0fe3ae60458441dddb676f2
    expected md5sum: 7314de5c5e8056e4e07b411b3e5a0cb9
2: In readLines(url, encoding = "UTF-8") :
  incomplete final line found on '/Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/library/cBioPortalData/service/cBioPortal/api.json'

I am interested in all three molecular profiles (copy number alterations, mutations and structural variants) contained within the Genie Cohort v17. I am unable to download the structural variant profile. The copy number alterations and mutations profiles download but return an error message.

I am having the below issues with all genes / gene panels. I have included ABL1 as an example.

Downloading all Three Molecular Profiles Together

Inputted query

> ABL1 <- cBioPortalData(
+     api = aacr,
+     studyId = "genie_public",
+     genes = "ABL1", by = "hugoGeneSymbol",
+     molecularProfileIds = c("genie_public_cna", "genie_public_mutations", "genie_public_structural_variants")
+ )

This returns the below message:

The build status for 'genie_public' is unknown.
  Use 'downloadStudy()' to manually obtain the data.
  Proceed anyway? [y/n]: 
y
harmonizing input:
  removing 75277 colData rownames not in sampleMap 'primary'

As well as the below output:

> ABL1
A MultiAssayExperiment object of 2 listed
 experiments with user-defined names and respective classes.
 Containing an ExperimentList class object of length 2:
 [1] genie_public_mutations: RangedSummarizedExperiment with 1600 rows and 3665 columns
 [2] genie_public_cna: SummarizedExperiment with 1 rows and 146850 columns
Functionality:
 experiments() - obtain the ExperimentList instance
 colData() - the primary/phenotype DataFrame
 sampleMap() - the sample coordination DataFrame
 `$`, `[`, `[[` - extract colData columns, subset, or experiment
 *Format() - convert into a long or wide DataFrame
 assays() - convert ExperimentList to a SimpleList of matrices
 exportClass() - save data to flat files

No structural variant data experiment is contained within the Multi-Assay Experiment

Downloading Structural Variants individually

Inputted query:

> ABL1_sv <- cBioPortalData(
+     api = aacr,
+     studyId = "genie_public",
+     genes = "ABL1", by = "hugoGeneSymbol",
+     molecularProfileIds = "genie_public_structural_variants")

Returns the below error message:

The build status for 'genie_public' is unknown.
  Use 'downloadStudy()' to manually obtain the data.
  Proceed anyway? [y/n]: 
y
Error in .invoke_fun(api, name, use_cache, ...) : Not Found (HTTP 404).

No output is returned and the data does not download.

Downloading the copy number alterations individually

Inputted query:

> ABL1_cna <- cBioPortalData(
+     api = aacr,
+     studyId = "genie_public",
+     genes = "ABL1", by = "hugoGeneSymbol",
+     molecularProfileIds = "genie_public_cna")

Returns the following error message:

The build status for 'genie_public' is unknown.
  Use 'downloadStudy()' to manually obtain the data.
  Proceed anyway? [y/n]: 
y
harmonizing input:
  removing 76624 colData rownames not in sampleMap 'primary'

As well as the following output:

> ABL1_cna
A MultiAssayExperiment object of 1 listed
 experiment with a user-defined name and respective class.
 Containing an ExperimentList class object of length 1:
 [1] genie_public_cna: SummarizedExperiment with 1 rows and 146850 columns
Functionality:
 experiments() - obtain the ExperimentList instance
 colData() - the primary/phenotype DataFrame
 sampleMap() - the sample coordination DataFrame
 `$`, `[`, `[[` - extract colData columns, subset, or experiment
 *Format() - convert into a long or wide DataFrame
 assays() - convert ExperimentList to a SimpleList of matrices
 exportClass() - save data to flat files

Downloading mutations data individually

Inputted query:

> ABL1_mutations <- cBioPortalData(
+     api = aacr,
+     studyId = "genie_public",
+     genes = "ABL1", by = "hugoGeneSymbol",
+     molecularProfileIds = "genie_public_mutations")

This returns the following error message:

The build status for 'genie_public' is unknown.
  Use 'downloadStudy()' to manually obtain the data.
  Proceed anyway? [y/n]: 
y
harmonizing input:
  removing 192766 colData rownames not in sampleMap 'primary'

As well as the following output:

> ABL1_mutations
A MultiAssayExperiment object of 1 listed
 experiment with a user-defined name and respective class.
 Containing an ExperimentList class object of length 1:
 [1] genie_public_mutations: RangedSummarizedExperiment with 1600 rows and 3665 columns
Functionality:
 experiments() - obtain the ExperimentList instance
 colData() - the primary/phenotype DataFrame
 sampleMap() - the sample coordination DataFrame
 `$`, `[`, `[[` - extract colData columns, subset, or experiment
 *Format() - convert into a long or wide DataFrame
 assays() - convert ExperimentList to a SimpleList of matrices
 exportClass() - save data to flat files

Session Info

sessionInfo() is as follows:

> sessionInfo()
R version 4.4.2 (2024-10-31)
Platform: x86_64-apple-darwin20
Running under: macOS Monterey 12.7.6

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: Europe/Dublin
tzcode source: internal

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] cBioPortalData_2.18.1       MultiAssayExperiment_1.32.0
 [3] SummarizedExperiment_1.36.0 Biobase_2.66.0             
 [5] GenomicRanges_1.58.0        GenomeInfoDb_1.42.1        
 [7] IRanges_2.40.1              S4Vectors_0.44.0           
 [9] BiocGenerics_0.52.0         MatrixGenerics_1.18.0      
[11] matrixStats_1.4.1           AnVIL_1.18.2               
[13] AnVILBase_1.0.0             dplyr_1.1.4                

loaded via a namespace (and not attached):
 [1] DBI_1.2.3                 bitops_1.0-9             
 [3] httr2_1.0.7               formatR_1.14             
 [5] rlang_1.1.4               magrittr_2.0.3           
 [7] compiler_4.4.2            RSQLite_2.3.9            
 [9] GenomicFeatures_1.58.0    png_0.1-8                
[11] vctrs_0.6.5               rvest_1.0.4              
[13] stringr_1.5.1             pkgconfig_2.0.3          
[15] crayon_1.5.3              fastmap_1.2.0            
[17] dbplyr_2.5.0              XVector_0.46.0           
[19] Rsamtools_2.22.0          promises_1.3.2           
[21] tzdb_0.4.0                UCSC.utils_1.2.0         
[23] purrr_1.0.2               bit_4.5.0.1              
[25] zlibbioc_1.52.0           cachem_1.1.0             
[27] jsonlite_1.8.9            blob_1.2.4               
[29] later_1.4.1               DelayedArray_0.32.0      
[31] BiocParallel_1.40.0       parallel_4.4.2           
[33] R6_2.5.1                  stringi_1.8.4            
[35] rtracklayer_1.66.0        Rcpp_1.0.13-1            
[37] readr_2.1.5               BiocBaseUtils_1.8.0      
[39] httpuv_1.6.15             Matrix_1.7-1             
[41] tidyselect_1.2.1          rstudioapi_0.17.1        
[43] abind_1.4-8               yaml_2.3.10              
[45] codetools_0.2-20          miniUI_0.1.1.1           
[47] curl_6.0.1                lattice_0.22-6           
[49] tibble_3.2.1              withr_3.0.2              
[51] shiny_1.10.0              KEGGREST_1.46.0          
[53] lambda.r_1.2.4            futile.logger_1.4.3      
[55] BiocFileCache_2.14.0      xml2_1.3.6               
[57] Biostrings_2.74.1         pillar_1.10.0            
[59] filelock_1.0.3            DT_0.33                  
[61] TCGAutils_1.26.0          generics_0.1.3           
[63] RCurl_1.98-1.16           hms_1.1.3                
[65] xtable_1.8-4              RTCGAToolbox_2.36.0      
[67] glue_1.8.0                tools_4.4.2              
[69] BiocIO_1.16.0             data.table_1.16.4        
[71] GenomicAlignments_1.42.0  rapiclient_0.1.8         
[73] XML_3.99-0.18             grid_4.4.2               
[75] tidyr_1.3.1               AnnotationDbi_1.68.0     
[77] GenomeInfoDbData_1.2.13   RaggedExperiment_1.30.0  
[79] RJSONIO_1.3-1.9           restfulr_0.0.15          
[81] cli_3.6.3                 rappdirs_0.3.3           
[83] futile.options_1.0.1      GenomicDataCommons_1.30.0
[85] S4Arrays_1.6.0            digest_0.6.37            
[87] SparseArray_1.6.0         rjson_0.2.23             
[89] htmlwidgets_1.6.4         memoise_2.0.1            
[91] htmltools_0.5.8.1         lifecycle_1.0.4          
[93] httr_1.4.7                mime_0.12                
[95] bit64_4.5.2

Thank you for your help

The text was updated successfully, but these errors were encountered:

LiNk-NY · 2025-01-09T00:42:42Z

Hi @grainneireland

Can you use markdown code chunks to format the code? It is hard to read.

Note. Use the triple backticks to delimit a code chunk.

```r
<R code goes here>
```

grainneireland · 2025-01-09T10:33:33Z

Thank you for the advice! I've edited the original issue with the code chunk formatting - hopefully it will be easier to read.

LiNk-NY · 2025-01-09T21:43:54Z

Hi Grainne, @grainneireland

Thank you for reporting. It looks like there is no data at the endpoint below despite there being a molecularProfileId: genie_public_structural_variants listed:

suppressPackageStartupMessages(library(cBioPortalData))
genie <- cBioPortal(hostname = "genie.cbioportal.org", token = "~/Downloads/cbioportal_data_access_token.txt")
samps <- allSamples(genie, "genie_public")[["sampleId"]]
res <- genie$getDiscreteCopyNumbersInMolecularProfileUsingGET(
    sampleListId = "genie_public_all",
    molecularProfileId = "genie_public_structural_variants"
)
res
#> Response [https://genie.cbioportal.org/api/molecular-profiles/genie_public_structural_variants/discrete-copy-number?sampleListId=genie_public_all]
#>   Date: 2025-01-09 21:42
#>   Status: 404
#>   Content-Type: application/json
#>   Size: 75 B
httr::content(res)
#> $message
#> [1] "Molecular profile not found: genie_public_structural_variants"

I would contact the maintainers of the genie repository to ensure that the data is there. It may be in another location that I am not aware of.

I have asked on the cBioPortal slack for more information. You may also consider asking on the Google groups site: https://groups.google.com/g/cbioportal

grainneireland · 2025-01-09T22:55:40Z

That is very useful to know - thank you for your help.

Downloading the copy-number-alterations and mutations molecular profiles returns the below warning message:

 “The build status for 'genie_public' is unknown. 
 Use 'downloadStudy()' to manually obtain the data.
  Proceed anyway? [y/n]:”

Would this affect the ability to download the complete datasets of the copy-number-alterations and mutations molecular profiles, or the format in which they are downloaded?

LiNk-NY · 2025-01-09T23:07:10Z

Hi Grainne, @grainneireland

Based on what I see at https://genie.cbioportal.org/datasets (compared to https://cbioportal.org/datasets), bulk data download (as .tar.gz) is not available; thus, downloadStudy would not work in this case.

Downloading the copy-number-alterations and mutations molecular profiles returns the below warning message:

That message is for the original cBioPortal hostname where we validate whether a particular studyId is building. It is not really relevant for the genie.cbioportal.org site (since we don't run validation workflows for it).

grainneireland · 2025-01-10T10:47:55Z

Thank you for your help!

LiNk-NY mentioned this issue Jan 9, 2025

remove build status checks for alternative hostnames #80

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue Accessing AACR Genie Data with cBioPortalData #79

Issue Accessing AACR Genie Data with cBioPortalData #79

grainneireland commented Jan 8, 2025 •

edited

Loading

LiNk-NY commented Jan 9, 2025

grainneireland commented Jan 9, 2025

LiNk-NY commented Jan 9, 2025 •

edited

Loading

grainneireland commented Jan 9, 2025

LiNk-NY commented Jan 9, 2025

grainneireland commented Jan 10, 2025

Issue Accessing AACR Genie Data with cBioPortalData #79

Issue Accessing AACR Genie Data with cBioPortalData #79

Comments

grainneireland commented Jan 8, 2025 • edited Loading

Downloading all Three Molecular Profiles Together

Downloading Structural Variants individually

Downloading the copy number alterations individually

Downloading mutations data individually

Session Info

LiNk-NY commented Jan 9, 2025

grainneireland commented Jan 9, 2025

LiNk-NY commented Jan 9, 2025 • edited Loading

grainneireland commented Jan 9, 2025

LiNk-NY commented Jan 9, 2025

grainneireland commented Jan 10, 2025

grainneireland commented Jan 8, 2025 •

edited

Loading

LiNk-NY commented Jan 9, 2025 •

edited

Loading