Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue Accessing AACR Genie Data with cBioPortalData #79

Open
grainneireland opened this issue Jan 8, 2025 · 6 comments
Open

Issue Accessing AACR Genie Data with cBioPortalData #79

grainneireland opened this issue Jan 8, 2025 · 6 comments

Comments

@grainneireland
Copy link

grainneireland commented Jan 8, 2025

Hi, I am trying to access data with cBioPortalData from the AACR Genie Cohort v17.0 - public. I am having trouble downloading the data.

Below is the query I am inputting:

> aacr <- cBioPortal(
+     hostname = "genie.cbioportal.org",
+     token = "   ")

Which returns the below warning messages:

Warning messages:
1: In .service_validate_md5sum(api_reference_url, api_reference_md5sum,  :
  service version differs from validated version
    service url: https://genie.cbioportal.org/api/v2/api-docs
    observed md5sum: b3a87becd0fe3ae60458441dddb676f2
    expected md5sum: 7314de5c5e8056e4e07b411b3e5a0cb9
2: In readLines(url, encoding = "UTF-8") :
  incomplete final line found on '/Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/library/cBioPortalData/service/cBioPortal/api.json'

I am interested in all three molecular profiles (copy number alterations, mutations and structural variants) contained within the Genie Cohort v17. I am unable to download the structural variant profile. The copy number alterations and mutations profiles download but return an error message.

I am having the below issues with all genes / gene panels. I have included ABL1 as an example.

Downloading all Three Molecular Profiles Together

Inputted query

> ABL1 <- cBioPortalData(
+     api = aacr,
+     studyId = "genie_public",
+     genes = "ABL1", by = "hugoGeneSymbol",
+     molecularProfileIds = c("genie_public_cna", "genie_public_mutations", "genie_public_structural_variants")
+ )

This returns the below message:

The build status for 'genie_public' is unknown.
  Use 'downloadStudy()' to manually obtain the data.
  Proceed anyway? [y/n]: 
y
harmonizing input:
  removing 75277 colData rownames not in sampleMap 'primary'

As well as the below output:

> ABL1
A MultiAssayExperiment object of 2 listed
 experiments with user-defined names and respective classes.
 Containing an ExperimentList class object of length 2:
 [1] genie_public_mutations: RangedSummarizedExperiment with 1600 rows and 3665 columns
 [2] genie_public_cna: SummarizedExperiment with 1 rows and 146850 columns
Functionality:
 experiments() - obtain the ExperimentList instance
 colData() - the primary/phenotype DataFrame
 sampleMap() - the sample coordination DataFrame
 `$`, `[`, `[[` - extract colData columns, subset, or experiment
 *Format() - convert into a long or wide DataFrame
 assays() - convert ExperimentList to a SimpleList of matrices
 exportClass() - save data to flat files

No structural variant data experiment is contained within the Multi-Assay Experiment

Downloading Structural Variants individually

Inputted query:

> ABL1_sv <- cBioPortalData(
+     api = aacr,
+     studyId = "genie_public",
+     genes = "ABL1", by = "hugoGeneSymbol",
+     molecularProfileIds = "genie_public_structural_variants")

Returns the below error message:

The build status for 'genie_public' is unknown.
  Use 'downloadStudy()' to manually obtain the data.
  Proceed anyway? [y/n]: 
y
Error in .invoke_fun(api, name, use_cache, ...) : Not Found (HTTP 404).

No output is returned and the data does not download.

Downloading the copy number alterations individually

Inputted query:

> ABL1_cna <- cBioPortalData(
+     api = aacr,
+     studyId = "genie_public",
+     genes = "ABL1", by = "hugoGeneSymbol",
+     molecularProfileIds = "genie_public_cna")

Returns the following error message:

The build status for 'genie_public' is unknown.
  Use 'downloadStudy()' to manually obtain the data.
  Proceed anyway? [y/n]: 
y
harmonizing input:
  removing 76624 colData rownames not in sampleMap 'primary'

As well as the following output:

> ABL1_cna
A MultiAssayExperiment object of 1 listed
 experiment with a user-defined name and respective class.
 Containing an ExperimentList class object of length 1:
 [1] genie_public_cna: SummarizedExperiment with 1 rows and 146850 columns
Functionality:
 experiments() - obtain the ExperimentList instance
 colData() - the primary/phenotype DataFrame
 sampleMap() - the sample coordination DataFrame
 `$`, `[`, `[[` - extract colData columns, subset, or experiment
 *Format() - convert into a long or wide DataFrame
 assays() - convert ExperimentList to a SimpleList of matrices
 exportClass() - save data to flat files

Downloading mutations data individually

Inputted query:

> ABL1_mutations <- cBioPortalData(
+     api = aacr,
+     studyId = "genie_public",
+     genes = "ABL1", by = "hugoGeneSymbol",
+     molecularProfileIds = "genie_public_mutations")

This returns the following error message:

The build status for 'genie_public' is unknown.
  Use 'downloadStudy()' to manually obtain the data.
  Proceed anyway? [y/n]: 
y
harmonizing input:
  removing 192766 colData rownames not in sampleMap 'primary'

As well as the following output:

> ABL1_mutations
A MultiAssayExperiment object of 1 listed
 experiment with a user-defined name and respective class.
 Containing an ExperimentList class object of length 1:
 [1] genie_public_mutations: RangedSummarizedExperiment with 1600 rows and 3665 columns
Functionality:
 experiments() - obtain the ExperimentList instance
 colData() - the primary/phenotype DataFrame
 sampleMap() - the sample coordination DataFrame
 `$`, `[`, `[[` - extract colData columns, subset, or experiment
 *Format() - convert into a long or wide DataFrame
 assays() - convert ExperimentList to a SimpleList of matrices
 exportClass() - save data to flat files

Session Info

sessionInfo() is as follows:

> sessionInfo()
R version 4.4.2 (2024-10-31)
Platform: x86_64-apple-darwin20
Running under: macOS Monterey 12.7.6

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: Europe/Dublin
tzcode source: internal

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] cBioPortalData_2.18.1       MultiAssayExperiment_1.32.0
 [3] SummarizedExperiment_1.36.0 Biobase_2.66.0             
 [5] GenomicRanges_1.58.0        GenomeInfoDb_1.42.1        
 [7] IRanges_2.40.1              S4Vectors_0.44.0           
 [9] BiocGenerics_0.52.0         MatrixGenerics_1.18.0      
[11] matrixStats_1.4.1           AnVIL_1.18.2               
[13] AnVILBase_1.0.0             dplyr_1.1.4                

loaded via a namespace (and not attached):
 [1] DBI_1.2.3                 bitops_1.0-9             
 [3] httr2_1.0.7               formatR_1.14             
 [5] rlang_1.1.4               magrittr_2.0.3           
 [7] compiler_4.4.2            RSQLite_2.3.9            
 [9] GenomicFeatures_1.58.0    png_0.1-8                
[11] vctrs_0.6.5               rvest_1.0.4              
[13] stringr_1.5.1             pkgconfig_2.0.3          
[15] crayon_1.5.3              fastmap_1.2.0            
[17] dbplyr_2.5.0              XVector_0.46.0           
[19] Rsamtools_2.22.0          promises_1.3.2           
[21] tzdb_0.4.0                UCSC.utils_1.2.0         
[23] purrr_1.0.2               bit_4.5.0.1              
[25] zlibbioc_1.52.0           cachem_1.1.0             
[27] jsonlite_1.8.9            blob_1.2.4               
[29] later_1.4.1               DelayedArray_0.32.0      
[31] BiocParallel_1.40.0       parallel_4.4.2           
[33] R6_2.5.1                  stringi_1.8.4            
[35] rtracklayer_1.66.0        Rcpp_1.0.13-1            
[37] readr_2.1.5               BiocBaseUtils_1.8.0      
[39] httpuv_1.6.15             Matrix_1.7-1             
[41] tidyselect_1.2.1          rstudioapi_0.17.1        
[43] abind_1.4-8               yaml_2.3.10              
[45] codetools_0.2-20          miniUI_0.1.1.1           
[47] curl_6.0.1                lattice_0.22-6           
[49] tibble_3.2.1              withr_3.0.2              
[51] shiny_1.10.0              KEGGREST_1.46.0          
[53] lambda.r_1.2.4            futile.logger_1.4.3      
[55] BiocFileCache_2.14.0      xml2_1.3.6               
[57] Biostrings_2.74.1         pillar_1.10.0            
[59] filelock_1.0.3            DT_0.33                  
[61] TCGAutils_1.26.0          generics_0.1.3           
[63] RCurl_1.98-1.16           hms_1.1.3                
[65] xtable_1.8-4              RTCGAToolbox_2.36.0      
[67] glue_1.8.0                tools_4.4.2              
[69] BiocIO_1.16.0             data.table_1.16.4        
[71] GenomicAlignments_1.42.0  rapiclient_0.1.8         
[73] XML_3.99-0.18             grid_4.4.2               
[75] tidyr_1.3.1               AnnotationDbi_1.68.0     
[77] GenomeInfoDbData_1.2.13   RaggedExperiment_1.30.0  
[79] RJSONIO_1.3-1.9           restfulr_0.0.15          
[81] cli_3.6.3                 rappdirs_0.3.3           
[83] futile.options_1.0.1      GenomicDataCommons_1.30.0
[85] S4Arrays_1.6.0            digest_0.6.37            
[87] SparseArray_1.6.0         rjson_0.2.23             
[89] htmlwidgets_1.6.4         memoise_2.0.1            
[91] htmltools_0.5.8.1         lifecycle_1.0.4          
[93] httr_1.4.7                mime_0.12                
[95] bit64_4.5.2   

Thank you for your help

@LiNk-NY
Copy link
Contributor

LiNk-NY commented Jan 9, 2025

Hi @grainneireland

Can you use markdown code chunks to format the code? It is hard to read.

Note. Use the triple backticks to delimit a code chunk.

```r
<R code goes here>
```

@grainneireland
Copy link
Author

Thank you for the advice! I've edited the original issue with the code chunk formatting - hopefully it will be easier to read.

@LiNk-NY
Copy link
Contributor

LiNk-NY commented Jan 9, 2025

Hi Grainne, @grainneireland

Thank you for reporting. It looks like there is no data at the endpoint below despite there being a molecularProfileId: genie_public_structural_variants listed:

suppressPackageStartupMessages(library(cBioPortalData))
genie <- cBioPortal(hostname = "genie.cbioportal.org", token = "~/Downloads/cbioportal_data_access_token.txt")
samps <- allSamples(genie, "genie_public")[["sampleId"]]
res <- genie$getDiscreteCopyNumbersInMolecularProfileUsingGET(
    sampleListId = "genie_public_all",
    molecularProfileId = "genie_public_structural_variants"
)
res
#> Response [https://genie.cbioportal.org/api/molecular-profiles/genie_public_structural_variants/discrete-copy-number?sampleListId=genie_public_all]
#>   Date: 2025-01-09 21:42
#>   Status: 404
#>   Content-Type: application/json
#>   Size: 75 B
httr::content(res)
#> $message
#> [1] "Molecular profile not found: genie_public_structural_variants"

I would contact the maintainers of the genie repository to ensure that the data is there. It may be in another location that I am not aware of.

I have asked on the cBioPortal slack for more information. You may also consider asking on the Google groups site: https://groups.google.com/g/cbioportal

@grainneireland
Copy link
Author

That is very useful to know - thank you for your help.

Downloading the copy-number-alterations and mutations molecular profiles returns the below warning message:

 “The build status for 'genie_public' is unknown. 
 Use 'downloadStudy()' to manually obtain the data.
  Proceed anyway? [y/n]:”

Would this affect the ability to download the complete datasets of the copy-number-alterations and mutations molecular profiles, or the format in which they are downloaded?

@LiNk-NY
Copy link
Contributor

LiNk-NY commented Jan 9, 2025

Hi Grainne, @grainneireland

Based on what I see at https://genie.cbioportal.org/datasets (compared to https://cbioportal.org/datasets), bulk data download (as .tar.gz) is not available; thus, downloadStudy would not work in this case.

Downloading the copy-number-alterations and mutations molecular profiles returns the below warning message:

That message is for the original cBioPortal hostname where we validate whether a particular studyId is building. It is not really relevant for the genie.cbioportal.org site (since we don't run validation workflows for it).

@grainneireland
Copy link
Author

Thank you for your help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants