AStorage is a standalone, Java-based data server engineered for large-scale genomics data ingestion, normalization, and querying. Leveraging an embedded RocksDB store for ultra-fast key‑value operations and an event‑driven Vert.x HTTP engine, AStorage delivers sub‑millisecond access to variant records and annotations, making it ideally suited for interactive analyses and high‑throughput pipelines.
Built to handle the full spectrum of common genomics formats, AStorage provides configurable ingestion endpoints for:
-
FASTA reference genomes
-
dbNSFP v4.3a, gnomAD v4, SpliceAI v1.3, ClinVar, dbSNP and GERP variant annotations
-
PharmGKB drug‐gene interaction data
-
GTEx v8 expression matrices
-
GTF gene models
A universal variant repository can be populated during ingestion when the normalize=true
flag is set, enabling a single, consistent schema for downstream queries.
Its RESTful API (documented via OpenAPI UI on http://localhost:8080/api
) supports both single‑file and batch operations, with “drop repository” endpoints to safely clear incomplete or corrupted datasets prior to re‑ingestion.
All behavior is governed by a simple config.json
, allowing customization of storage paths, server port (default: 8080), and ingestion parameters without recompilation.
-
High‑performance storage: RocksDB back‑end optimized for genomics data access
-
Reactive API server: Non‑blocking, event‑driven Vert.x framework
-
Batch queries: submit large lists of genomic variants or annotation requests in a single HTTP call, with sub‑millisecond per‑variant latencies
-
Broad format support: Ingestion pipelines for all major annotation and reference datasets
-
Universal normalization: Consistent variant schema across heterogeneous sources
-
Safe re‑ingestion: Drop‑repository APIs to maintain data integrity
-
Self‑contained deployment: Packaged as a single JAR; auto‑creates storage directory on first run
-
Interactive docs: OpenAPI UI for live API exploration and testing
Clone the master branch and package the application as a JAR file:
git clone git@github.com:ForomePlatform/AStorage-Java.git
cd AStorage-Java
./mvnw clean package
The JAR file will be generated inside the target directory as astorage-java-1.0.0.jar.
-
On the first run the application creates a data folder in the user’s home directory with the name AStorage by default if not specified otherwise.
-
The service is running on port 8080 by default if not specified otherwise.
Note
|
These properties can be adjusted using a config.json file. |
config.json example:
{
"dataDirectoryPath": "/home/user/ExampleStorage",
"serverPort": 8080
}
To start the application run:
cd target
java -jar astorage-java-1.0.0.jar [config_json_path]
Note
|
AStorage logs(e.g. ingestion progress) are being written in <dataDirectoryPath>/output_<currentTimeMillis>.log file. Some of the output is printed in terminal where the program is being run. |
For detailed API specification access the OpenAPI UI via: http://localhost:8080/api.
To avoid issues such as overlaps, duplicates, or data inconsistencies, it is crucial to drop the specific repository corresponding to the format being ingested using the provided Drop Repository API if the previous ingestion was unsuccessful or encountered errors.
Always ensure that any failed or corrupted repository is properly cleared before attempting another ingestion.
Note
|
The UniversalVariant repository stores normalized variants for supported formats when the normalize parameter is set to true during ingestion. For example, if an error occurs while ingesting ClinVar data, dropping the ClinVar repository using the provided API will automatically remove ClinVar-related data from the UniversalVariant repository. However, dropping the entire UniversalVariant repository will remove all normalized data across every format. |
Download the reference genome: GRCh38.p14_genomic and its assembly report: GRCh38.p14_assembly_report and run ingestion:
curl -X POST "http://localhost:8080/ingestion/fasta?refBuild=GRCh38&dataPath={dataPath}&metadataPath={assemblyReportPath}"
API reference: Fasta Ingestion.
Download the entire dbNSFP database: dbNSFP4.3a, extract the downloaded content and run ingestion for each chromosome variant one by one:
curl -X POST "http://localhost:8080/ingestion/dbnsfp?dataPath={chrDataPath}"
API reference: dbNSFP Ingestion.
Download available exomes and genomes from: gnomAD v4 and ingest the downloaded files:
Note
|
If you set the normalize parameter to true for ingestion Fasta GRCh38 should already be ingested into the AStorage. |
curl -X POST "http://localhost:8080/ingestion/gnomad?dataPath={dataPath}&sourceType={sourceType}&normalize=true&refBuild=GRCh38"
API reference: gnomAD Ingestion.
Access the SpliceAI annotations here: SpliceAI v1.3 for which you’ll need an account of Illumina.
From the Illumina Sequence Hub Projects tab open the added project: Predicting splicing from primary sequence, then open genome_scores_v1.3, click on FILES and download spliceai_scores.raw.indel.hg38.vcf.gz and spliceai_scores.raw.snv.hg38.vcf.gz.
Run the ingestion for each data file:
Note
|
If you set the normalize parameter to true for ingestion Fasta GRCh38 should already be ingested into the AStorage. |
curl -X POST "http://localhost:8080/ingestion/spliceai?dataPath={dataPath}&normalize=true&refBuild=GRCh38"
API reference: SpliceAI Ingestion.
Download the appropriate data files from: PharmGKB Downloads and ingest the downloaded files:
Note
|
Types of supported data: CA, CAmeta, CAmeta2CA, SPA, VDA, VDA2SPA, VFA, VFA2SPA, VPA, VPA2SPA |
curl -X POST "http://localhost:8080/ingestion/pharmgkb?dataType={dataType}&dataPath={dataPath}"
API reference: PhramGKB Ingestion.
Download the latest ClinVar release: ClinVarFullRelease_00-latest and its variant summary: variant_summary and ingest the downloaded files:
Note
|
If you set the normalize parameter to true for ingestion required Fasta reference genomes should already be ingested into the AStorage. |
curl -X POST "http://localhost:8080/ingestion/clinvar?dataPath={dataPath}&dataSummaryPath={dataSummaryPath}&normalize=true"
API reference: ClinVar Ingestion.
Download the GTEx v8 bulk tissue expression data: GTEx_Analysis_2017-06-05_v8 and ingest the downloaded file:
curl -X POST "http://localhost:8080/ingestion/gtex?dataPath={dataPath}"
API reference: GTEx Ingestion.
Download the GRCh38 GTF data file: Homo_sapiens.GRCh38.111.chr and ingest the downloaded file:
curl -X POST "http://localhost:8080/ingestion/gtf?dataPath={dataPath}"
API reference: GTF Ingestion.
Retrieve the necessary GERP rates files for each chromosome and ingest the downloaded files one by one:
curl -X POST "http://localhost:8080/ingestion/gerp?dataPath={dataPath}"
API reference: GERP Ingestion.
Download the complete dbSNP data: 00-All and ingest the downloaded file:
curl -X POST "http://localhost:8080/ingestion/dbsnp?dataPath={dataPath}"
API reference: dbSNP Ingestion.