Skip to content

ForomePlatform/AStorage-Java

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AStorage-Java

Vert.x 4.5.9 purple RocksDB 9.4.0 orange

Overview

AStorage is a standalone, Java-based data server engineered for large-scale genomics data ingestion, normalization, and querying. Leveraging an embedded RocksDB store for ultra-fast key‑value operations and an event‑driven Vert.x HTTP engine, AStorage delivers sub‑millisecond access to variant records and annotations, making it ideally suited for interactive analyses and high‑throughput pipelines.

Built to handle the full spectrum of common genomics formats, AStorage provides configurable ingestion endpoints for:

  • FASTA reference genomes

  • dbNSFP v4.3a, gnomAD v4, SpliceAI v1.3, ClinVar, dbSNP and GERP variant annotations

  • PharmGKB drug‐gene interaction data

  • GTEx v8 expression matrices

  • GTF gene models

A universal variant repository can be populated during ingestion when the normalize=true flag is set, enabling a single, consistent schema for downstream queries.

Its RESTful API (documented via OpenAPI UI on http://localhost:8080/api) supports both single‑file and batch operations, with “drop repository” endpoints to safely clear incomplete or corrupted datasets prior to re‑ingestion. All behavior is governed by a simple config.json, allowing customization of storage paths, server port (default: 8080), and ingestion parameters without recompilation.

Key Features

  • High‑performance storage: RocksDB back‑end optimized for genomics data access

  • Reactive API server: Non‑blocking, event‑driven Vert.x framework

  • Batch queries: submit large lists of genomic variants or annotation requests in a single HTTP call, with sub‑millisecond per‑variant latencies

  • Broad format support: Ingestion pipelines for all major annotation and reference datasets

  • Universal normalization: Consistent variant schema across heterogeneous sources

  • Safe re‑ingestion: Drop‑repository APIs to maintain data integrity

  • Self‑contained deployment: Packaged as a single JAR; auto‑creates storage directory on first run

  • Interactive docs: OpenAPI UI for live API exploration and testing

Supported Formats:

  • Fasta

  • dbNSFP v4.3a

  • gnomAD v4

  • SpliceAI v1.3

  • PharmGKB

  • ClinVar

  • GTEx v8

  • GTF

  • GERP

  • dbSNP

Formats mapped in the universal variant query:

  • dbNSFP v4.3a

  • gnomAD v4

  • SpliceAI v1.3

  • ClinVar

  • GERP

  • dbSNP

Setup: Building and Running [Linux/MacOS]

Clone the master branch and package the application as a JAR file:

git clone git@github.com:ForomePlatform/AStorage-Java.git
cd AStorage-Java
./mvnw clean package

The JAR file will be generated inside the target directory as astorage-java-1.0.0.jar.

  • On the first run the application creates a data folder in the user’s home directory with the name AStorage by default if not specified otherwise.

  • The service is running on port 8080 by default if not specified otherwise.

Note
These properties can be adjusted using a config.json file.

config.json example:

{
    "dataDirectoryPath": "/home/user/ExampleStorage",
    "serverPort": 8080
}

To start the application run:

cd target
java -jar astorage-java-1.0.0.jar [config_json_path]
Note
AStorage logs(e.g. ingestion progress) are being written in <dataDirectoryPath>/output_<currentTimeMillis>.log file. Some of the output is printed in terminal where the program is being run.

For detailed API specification access the OpenAPI UI via: http://localhost:8080/api.

Setup: Ingestion

Important Note on Data Ingestion!

To avoid issues such as overlaps, duplicates, or data inconsistencies, it is crucial to drop the specific repository corresponding to the format being ingested using the provided Drop Repository API if the previous ingestion was unsuccessful or encountered errors.

Always ensure that any failed or corrupted repository is properly cleared before attempting another ingestion.

Note
The UniversalVariant repository stores normalized variants for supported formats when the normalize parameter is set to true during ingestion. For example, if an error occurs while ingesting ClinVar data, dropping the ClinVar repository using the provided API will automatically remove ClinVar-related data from the UniversalVariant repository. However, dropping the entire UniversalVariant repository will remove all normalized data across every format.

Fasta:

Download the reference genome: GRCh38.p14_genomic and its assembly report: GRCh38.p14_assembly_report and run ingestion:

curl -X POST "http://localhost:8080/ingestion/fasta?refBuild=GRCh38&dataPath={dataPath}&metadataPath={assemblyReportPath}"

API reference: Fasta Ingestion.

dbNSFP:

Download the entire dbNSFP database: dbNSFP4.3a, extract the downloaded content and run ingestion for each chromosome variant one by one:

curl -X POST "http://localhost:8080/ingestion/dbnsfp?dataPath={chrDataPath}"

API reference: dbNSFP Ingestion.

gnomAD:

Download available exomes and genomes from: gnomAD v4 and ingest the downloaded files:

Note
If you set the normalize parameter to true for ingestion Fasta GRCh38 should already be ingested into the AStorage.
curl -X POST "http://localhost:8080/ingestion/gnomad?dataPath={dataPath}&sourceType={sourceType}&normalize=true&refBuild=GRCh38"

API reference: gnomAD Ingestion.

SpliceAI:

Access the SpliceAI annotations here: SpliceAI v1.3 for which you’ll need an account of Illumina.

From the Illumina Sequence Hub Projects tab open the added project: Predicting splicing from primary sequence, then open genome_scores_v1.3, click on FILES and download spliceai_scores.raw.indel.hg38.vcf.gz and spliceai_scores.raw.snv.hg38.vcf.gz.

Run the ingestion for each data file:

Note
If you set the normalize parameter to true for ingestion Fasta GRCh38 should already be ingested into the AStorage.
curl -X POST "http://localhost:8080/ingestion/spliceai?dataPath={dataPath}&normalize=true&refBuild=GRCh38"

API reference: SpliceAI Ingestion.

PharmGKB:

Download the appropriate data files from: PharmGKB Downloads and ingest the downloaded files:

Note
Types of supported data: CA, CAmeta, CAmeta2CA, SPA, VDA, VDA2SPA, VFA, VFA2SPA, VPA, VPA2SPA
curl -X POST "http://localhost:8080/ingestion/pharmgkb?dataType={dataType}&dataPath={dataPath}"

API reference: PhramGKB Ingestion.

ClinVar:

Download the latest ClinVar release: ClinVarFullRelease_00-latest and its variant summary: variant_summary and ingest the downloaded files:

Note
If you set the normalize parameter to true for ingestion required Fasta reference genomes should already be ingested into the AStorage.
curl -X POST "http://localhost:8080/ingestion/clinvar?dataPath={dataPath}&dataSummaryPath={dataSummaryPath}&normalize=true"

API reference: ClinVar Ingestion.

GTEx:

Download the GTEx v8 bulk tissue expression data: GTEx_Analysis_2017-06-05_v8 and ingest the downloaded file:

curl -X POST "http://localhost:8080/ingestion/gtex?dataPath={dataPath}"

API reference: GTEx Ingestion.

GTF:

Download the GRCh38 GTF data file: Homo_sapiens.GRCh38.111.chr and ingest the downloaded file:

curl -X POST "http://localhost:8080/ingestion/gtf?dataPath={dataPath}"

API reference: GTF Ingestion.

GERP:

Retrieve the necessary GERP rates files for each chromosome and ingest the downloaded files one by one:

curl -X POST "http://localhost:8080/ingestion/gerp?dataPath={dataPath}"

API reference: GERP Ingestion.

dbSNP:

Download the complete dbSNP data: 00-All and ingest the downloaded file:

curl -X POST "http://localhost:8080/ingestion/dbsnp?dataPath={dataPath}"

API reference: dbSNP Ingestion.

Additional Notes

  • Batch-query parameters match single-query parameters for every format.

  • To use the normalization service appropriate genome reference builds(e.g. GRCh38 and GRCh37) should be ingested into Fasta first.

  • To batch-normalize the data same approach is used as in the batch-query.

About

AStorage - a specialized data server for Genomics data

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 5

Languages