This project implements a digital asset processing pipeline that implements the Submission Information Package (SIP) component of the Open Archival Information System (OAIS) reference model and METS (Metadata Encoding and Transmission Standard) specifications. It processes and manages digital assets within a data archive by extracting metadata from METS files and organizing them into structured SIPs.
The system uses Dagster as its core data orchestrator, providing robust workflow management for complex archiving processes. The implementation ensures:
- OAIS SIP Processing: Implements the OAIS Submission Information Package model with structured metadata handling
- METS Standard Support: Full parsing and processing of METS XML files
- Data Validation: Robust validation using Pydantic models
- Scalable Architecture: Modular design for handling complex archiving workflows
We recommend using the fully automatic setup method using Nix Flakes and Direnv:
- Clone the repository
- Allow direnv in the project directory:
direnv allow
This will automatically:
- Create a Python 3.12 virtual environment in
.venv
- Install all dependencies using UV package manager
- Set up the development environment
If you need to manually activate the environment without direnv:
nix develop
Dependencies are managed using UV, a modern Python package manager:
pyproject.toml
: Defines project dependencies (requires Python 3.12+)uv.lock
: Locks dependencies to specific versions
Common UV commands:
# Update dependencies
uv sync
# Update lock file
uv lock
# Install dependencies (for manual setup)
uv install
Launch the Dagster web interface:
dagster dev
Access the UI at http://localhost:3000
The pipeline consists of the following components:
-
Assets:
sip_asset
: Parses METS XML files into a structured SIP modelintellectual_entities
: Extracts and processes Intellectual Entity modelsrepresentations
: Collects and processes file representationsfiles
: Extracts and processes file metadatafixities
: Extracts and processes file checksums
-
Jobs:
ingest_sip_job
: Orchestrates the complete SIP creation process
-
Sensors:
xml_file_sensor
: Monitors for new METS XML files and triggers processing
Execute the test suite:
pytest da_pipeline_tests
flake.nix
: Defines the development environment and dependencies.envrc
: Configures direnv to use the Nix flakepyproject.toml
: Defines Python package metadata and dependenciesworkspace.yaml
: Configures Dagster code locationsuv.lock
: Locks dependencies to specific versions