This repo contains the official code of Orthrus.
@inproceedings{jian2025,
title={{ORTHRUS: Achieving High Quality of Attribution in Provenance-based Intrusion
Detection Systems}},
author={Jiang, Baoxiang and Bilot, Tristan and El Madhoun, Nour and Al Agha, Khaldoun and Zouaoui, Anis and Iqbal, Shahrear and Han, Xueyuan and Pasquier, Thomas},
booktitle={Security Symposium (USENIX Sec'25)},
year={2025},
organization={USENIX}
}
You can find the paper preprint here.
- clone the repo with submodules:
git clone --recurse-submodules https://github.com/ubc-provenance/orthrus.git
- install the environment and requirements (guidelines).
- create a new folder and download all
tar.gz
files from a specific DARPA dataset (follow the link provided for DARPA E3 here and DARPA E5 here). If using CLI, use gdown, by taking the ID of the document directly from the URL. - in the same folder, download the java binary to build the avro files and the schema folder (can be downloaded with
gdown --folder {ID}
) - follow the guidelines to convert bin files to json files
- create postgres databases (guidelines, replace
database_name
with the name of the downloaded dataset) - optionally, if using a specific postgres host/user, update the connection config by setting
DATABASE_DEFAULT_CONFIG
withinsrc/config.py
. - optionaly, the
ROOT_ARTIFACT_DIR
withinsrc/config.py
can be changed. All preprocessed files and model weights will be stored there when the code runs - go to
src/config.py
and search forDATASET_DEFAULT_CONFIG
and set the path to the uncompressed JSON files folder in theraw_dir
variable of your downloaded dataset - fill the database for the corresponding dataset by running this command, using your installed conda env:
python src/create_database.py [CLEARSCOPE_E3 | CADETS_E3 | THEIA_E3 | CLEARSCOPE_E5 | CADETS_E5 | THEIA_E5]
Note: Large storage capacity is needed to download, parse and save datasets and databases.
Note: Large storage capacity is needed to run experiments. A single run can generate more than 15GB of artifact files on E3 datasets, and much more with larger E5 datasets.
Launching Orthrus is as simple as running:
python src/orthrus.py [dataset] [config args...]
Running orthrus.py
will run by default the graph_construction
, edge_featurization
, detection
and attack_reconstruction
tasks configured within the config/orthrus.yml
file. This configuration can be updated directly in the YML file or from the CLI, as shown above.
To reproduce the experimental results of Orthrus on node detection:
CADETS_E3
python src/orthrus.py CADETS_E3 --detection.gnn_training.num_epochs=20 --detection.gnn_training.encoder.graph_attention.dropout=0.25 --detection.evaluation.node_evaluation.kmeans_top_K=30
THEIA_E3
python src/orthrus.py THEIA_E3
CLEARSCOPE_E3
python src/orthrus.py CLEARSCOPE_E3 --graph_construction.build_graphs.time_window_size=1.0 --detection.gnn_training.encoder.graph_attention.dropout=0.1
CADETS_E5
python src/orthrus.py
THEIA_E5
python src/orthrus.py THEIA_E5 --detection.gnn_training.lr=0.000005
CLEARSCOPE_E5
python src/orthrus.py CLEARSCOPE_E5 --detection.gnn_training.num_epochs=10 --detection.gnn_training.lr=0.0001 --detection.gnn_training.encoder.graph_attention.dropout=0.25
When run once, datasets are preprocessed and stored in the ROOT_ARTIFACT_DIR
path within config.py
. There is thus no need to recompute them. To avoid re-computing the graph_construction
and edge_featurization
tasks, Orthrus can be run directly from the detection
task using the arg --run_from_training
.
python src/orthrus.py CADETS_E3 --run_from_training
W&B is used as the default interface to visualize and historize experiments. First log into your account from the CLI using:
wandb login
Set your API key, which can be found on the website. Then you can push the logs and results of experiments to the interface using the --wandb
arg.
The preferred solution is to run the run.sh
script, which directly logs the experiments to the W&B interface.
python src/orthrus.py THEIA_E3 --wandb
See licence.