AStorage-Java

Overview

AStorage is a standalone, Java-based data server engineered for large-scale genomics data ingestion, normalization, and querying. Leveraging an embedded RocksDB store for ultra-fast key‑value operations and an event‑driven Vert.x HTTP engine, AStorage delivers sub‑millisecond access to variant records and annotations, making it ideally suited for interactive analyses and high‑throughput pipelines.

Built to handle the full spectrum of common genomics formats, AStorage provides configurable ingestion endpoints for:

FASTA reference genomes
dbNSFP v4.3a, gnomAD v4, SpliceAI v1.3, ClinVar, dbSNP and GERP variant annotations
PharmGKB drug‐gene interaction data
GTEx v8 expression matrices
GTF gene models

A universal variant repository can be populated during ingestion when the normalize=true flag is set, enabling a single, consistent schema for downstream queries.

Its RESTful API (documented via OpenAPI UI on http://localhost:8080/api) supports both single‑file and batch operations, with “drop repository” endpoints to safely clear incomplete or corrupted datasets prior to re‑ingestion. All behavior is governed by a simple config.json, allowing customization of storage paths, server port (default: 8080), and ingestion parameters without recompilation.

Key Features

High‑performance storage: RocksDB back‑end optimized for genomics data access
Reactive API server: Non‑blocking, event‑driven Vert.x framework
Batch queries: submit large lists of genomic variants or annotation requests in a single HTTP call, with sub‑millisecond per‑variant latencies
Broad format support: Ingestion pipelines for all major annotation and reference datasets
Universal normalization: Consistent variant schema across heterogeneous sources
Safe re‑ingestion: Drop‑repository APIs to maintain data integrity
Self‑contained deployment: Packaged as a single JAR; auto‑creates storage directory on first run
Interactive docs: OpenAPI UI for live API exploration and testing

Supported Formats:

Fasta
dbNSFP v4.3a
gnomAD v4
SpliceAI v1.3
PharmGKB
ClinVar
GTEx v8
GTF
GERP
dbSNP

Formats mapped in the universal variant query:

dbNSFP v4.3a
gnomAD v4
SpliceAI v1.3
ClinVar
GERP
dbSNP

Setup: Building and Running [Linux/MacOS]

Clone the master branch and package the application as a JAR file:

git clone git@github.com:ForomePlatform/AStorage-Java.git
cd AStorage-Java
./mvnw clean package

The JAR file will be generated inside the target directory as astorage-java-1.0.0.jar.

On the first run the application creates a data folder in the user’s home directory with the name AStorage by default if not specified otherwise.
The service is running on port 8080 by default if not specified otherwise.

Note	These properties can be adjusted using a config.json file.

config.json example:

{
    "dataDirectoryPath": "/home/user/ExampleStorage",
    "serverPort": 8080
}

To start the application run:

cd target
java -jar astorage-java-1.0.0.jar [config_json_path]

Note	AStorage logs(e.g. ingestion progress) are being written in <dataDirectoryPath>/output_<currentTimeMillis>.log file. Some of the output is printed in terminal where the program is being run.

For detailed API specification access the OpenAPI UI via: http://localhost:8080/api.

Setup: Ingestion

Important Note on Data Ingestion!

To avoid issues such as overlaps, duplicates, or data inconsistencies, it is crucial to drop the specific repository corresponding to the format being ingested using the provided Drop Repository API if the previous ingestion was unsuccessful or encountered errors.

Always ensure that any failed or corrupted repository is properly cleared before attempting another ingestion.

Note

The UniversalVariant repository stores normalized variants for supported formats when the normalize parameter is set to true during ingestion. For example, if an error occurs while ingesting ClinVar data, dropping the ClinVar repository using the provided API will automatically remove ClinVar-related data from the UniversalVariant repository. However, dropping the entire UniversalVariant repository will remove all normalized data across every format.

Fasta:

Download the reference genome: GRCh38.p14_genomic and its assembly report: GRCh38.p14_assembly_report and run ingestion:

curl -X POST "http://localhost:8080/ingestion/fasta?refBuild=GRCh38&dataPath={dataPath}&metadataPath={assemblyReportPath}"

API reference: Fasta Ingestion.

dbNSFP:

Download the entire dbNSFP database: dbNSFP4.3a, extract the downloaded content and run ingestion for each chromosome variant one by one:

curl -X POST "http://localhost:8080/ingestion/dbnsfp?dataPath={chrDataPath}"

API reference: dbNSFP Ingestion.

gnomAD:

Download available exomes and genomes from: gnomAD v4 and ingest the downloaded files:

Note	If you set the normalize parameter to true for ingestion Fasta GRCh38 should already be ingested into the AStorage.

curl -X POST "http://localhost:8080/ingestion/gnomad?dataPath={dataPath}&sourceType={sourceType}&normalize=true&refBuild=GRCh38"

API reference: gnomAD Ingestion.

SpliceAI:

Access the SpliceAI annotations here: SpliceAI v1.3 for which you’ll need an account of Illumina.

From the Illumina Sequence Hub Projects tab open the added project: Predicting splicing from primary sequence, then open genome_scores_v1.3, click on FILES and download spliceai_scores.raw.indel.hg38.vcf.gz and spliceai_scores.raw.snv.hg38.vcf.gz.

Run the ingestion for each data file:

Note	If you set the normalize parameter to true for ingestion Fasta GRCh38 should already be ingested into the AStorage.

curl -X POST "http://localhost:8080/ingestion/spliceai?dataPath={dataPath}&normalize=true&refBuild=GRCh38"

API reference: SpliceAI Ingestion.

PharmGKB:

Download the appropriate data files from: PharmGKB Downloads and ingest the downloaded files:

Note	Types of supported data: CA, CAmeta, CAmeta2CA, SPA, VDA, VDA2SPA, VFA, VFA2SPA, VPA, VPA2SPA

curl -X POST "http://localhost:8080/ingestion/pharmgkb?dataType={dataType}&dataPath={dataPath}"

API reference: PhramGKB Ingestion.

ClinVar:

Download the latest ClinVar release: ClinVarFullRelease_00-latest and its variant summary: variant_summary and ingest the downloaded files:

Note	If you set the normalize parameter to true for ingestion required Fasta reference genomes should already be ingested into the AStorage.

curl -X POST "http://localhost:8080/ingestion/clinvar?dataPath={dataPath}&dataSummaryPath={dataSummaryPath}&normalize=true"

API reference: ClinVar Ingestion.

GTEx:

Download the GTEx v8 bulk tissue expression data: GTEx_Analysis_2017-06-05_v8 and ingest the downloaded file:

curl -X POST "http://localhost:8080/ingestion/gtex?dataPath={dataPath}"

API reference: GTEx Ingestion.

GTF:

Download the GRCh38 GTF data file: Homo_sapiens.GRCh38.111.chr and ingest the downloaded file:

curl -X POST "http://localhost:8080/ingestion/gtf?dataPath={dataPath}"

API reference: GTF Ingestion.

GERP:

Retrieve the necessary GERP rates files for each chromosome and ingest the downloaded files one by one:

curl -X POST "http://localhost:8080/ingestion/gerp?dataPath={dataPath}"

API reference: GERP Ingestion.

dbSNP:

Download the complete dbSNP data: 00-All and ingest the downloaded file:

curl -X POST "http://localhost:8080/ingestion/dbsnp?dataPath={dataPath}"

API reference: dbSNP Ingestion.

Additional Notes

Batch-query parameters match single-query parameters for every format.
To use the normalization service appropriate genome reference builds(e.g. GRCh38 and GRCh37) should be ingested into Fasta first.
To batch-normalize the data same approach is used as in the batch-query.

Name		Name	Last commit message	Last commit date
Latest commit History 140 Commits
.idea		.idea
.mvn/wrapper		.mvn/wrapper
docs		docs
src/main		src/main
.gitignore		.gitignore
LICENSE		LICENSE
README.adoc		README.adoc
mvnw		mvnw
mvnw.cmd		mvnw.cmd
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

AStorage-Java

Overview

Key Features

Supported Formats:

Formats mapped in the universal variant query:

Setup: Building and Running [Linux/MacOS]

Setup: Ingestion

Important Note on Data Ingestion!

Fasta:

dbNSFP:

gnomAD:

SpliceAI:

PharmGKB:

ClinVar:

GTEx:

GTF:

GERP:

dbSNP:

Additional Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 5

Uh oh!

Languages

License

ForomePlatform/AStorage-Java

Folders and files

Latest commit

History

Repository files navigation

AStorage-Java

Overview

Key Features

Supported Formats:

Formats mapped in the universal variant query:

Setup: Building and Running [Linux/MacOS]

Setup: Ingestion

Important Note on Data Ingestion!

Fasta:

dbNSFP:

gnomAD:

SpliceAI:

PharmGKB:

ClinVar:

GTEx:

GTF:

GERP:

dbSNP:

Additional Notes

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 5

Uh oh!

Languages

Packages