Hooks & Providers
Lifecycle hooks and data download providers in Optimus.
Optimus uses Kedro Hooks to inject behavior at key points in the pipeline lifecycle. Three hooks are registered in LIFO (last-in, first-out) order, meaning the last hook registered executes first:
# settings.py
HOOKS = (QualityChecksHooks(), ChecksumHooks(), OriginHooks())Execution order: OriginHooks (first) -> ChecksumHooks -> QualityChecksHooks (last).
OriginHooks
Trigger: before_dataset_loaded for landing.* datasets.
OriginHooks is the automatic data acquisition system. Before any landing dataset is loaded, it reads the metadata.origin field from the catalog YAML and downloads the data if it is not already present on disk.
metadata:
origin:
provider: opentargets
version: "25.06"
dataset_name: diseaseThe provider field selects which download strategy to use. Optimus routes this through a Pydantic discriminated union, so each provider is a self-contained Pydantic model with its own validation and download() method.
Fallback for Private Data
Some data sources (e.g., DrugBank) require credentials. If a landing dataset has no origin metadata and the file does not exist on disk, OriginHooks creates an empty placeholder (an empty CSV with schema headers, or a minimal XML root element). This allows the pipeline to continue with public data only, gracefully excluding private sources.
Providers
Providers encapsulate the download logic for different data source types. Each provider is a Pydantic model that validates its configuration and implements a download() method.
HttpProvider
Generic HTTP/HTTPS downloads. Used for most data sources that provide direct file URLs.
origin:
provider: http
url: https://example.com/data.csvOpenTargetsProvider
FTP-based bulk downloads from the EBI Open Targets platform. Uses wget --recursive to download entire dataset directories.
origin:
provider: opentargets
version: "25.06"
dataset_name: diseaseBioOntologyProvider
Downloads specific ontology versions from the BioPortal API. Resolves versioned ontology files by querying the API for the correct download URL.
origin:
provider: bioontology
ontology: GO
version: "2024-06-17"DrugBankProvider
Downloads from DrugBank release archives. Requires authentication credentials (configured locally).
origin:
provider: drugbank
version: "5.1.12"ChecksumHooks
Trigger: before_dataset_loaded for landing.*, bronze.*, and silver.* datasets.
ChecksumHooks ensures data integrity across the pipeline. Before loading any dataset in the first three layers, it:
- Reads the
metadata.checksumfield from the catalog YAML - Computes the BLAKE2b hash of the actual file or directory on disk
- Logs a warning if the computed checksum does not match the recorded one
metadata:
checksum: b7a669ddabfa209e939cc0a603095f3eChecksums serve two purposes:
- Reproducibility: Detect when upstream data sources change between pipeline runs.
- Integrity: Catch corrupted or incomplete downloads before they propagate through the pipeline.
Use uv run cli checksum <path> to compute the BLAKE2b hash of any file or directory, or uv run cli sync-catalog --dataset <name> to automatically update the checksum in the catalog YAML after a node rerun.
QualityChecksHooks
Trigger: after_node_run for silver and gold namespace nodes.
QualityChecksHooks enforces data quality standards on pipeline outputs. After every silver or gold node completes, it validates:
| Check | Severity | Description |
|---|---|---|
| Column naming | Error | All column names must be snake_case |
| Null IDs | Error | All id* columns must have zero null values |
| Valid relations | Error | All relation column values must be valid members of the Relation enum |
| OTHER relations | Warning | Logs the percentage of edges with OTHER as their relation type |
The relation validation is strict: if any edge has an unrecognized relation string, the hook raises a DatasetError and halts the pipeline. This ensures that all relationships in the knowledge graph use standardized, controlled vocabulary.