Grence

Hooks & Providers

Lifecycle hooks and data download providers in Optimus.

Optimus uses Kedro Hooks to inject behavior at key points in the pipeline lifecycle. Three hooks are registered in LIFO (last-in, first-out) order, meaning the last hook registered executes first:

# settings.py
HOOKS = (QualityChecksHooks(), ChecksumHooks(), OriginHooks())

Execution order: OriginHooks (first) -> ChecksumHooks -> QualityChecksHooks (last).

OriginHooks

Trigger: before_dataset_loaded for landing.* datasets.

OriginHooks is the automatic data acquisition system. Before any landing dataset is loaded, it reads the metadata.origin field from the catalog YAML and downloads the data if it is not already present on disk.

metadata:
  origin:
    provider: opentargets
    version: "25.06"
    dataset_name: disease

The provider field selects which download strategy to use. Optimus routes this through a Pydantic discriminated union, so each provider is a self-contained Pydantic model with its own validation and download() method.

Fallback for Private Data

Some data sources (e.g., DrugBank) require credentials. If a landing dataset has no origin metadata and the file does not exist on disk, OriginHooks creates an empty placeholder (an empty CSV with schema headers, or a minimal XML root element). This allows the pipeline to continue with public data only, gracefully excluding private sources.

Providers

Providers encapsulate the download logic for different data source types. Each provider is a Pydantic model that validates its configuration and implements a download() method.

HttpProvider

Generic HTTP/HTTPS downloads. Used for most data sources that provide direct file URLs.

origin:
  provider: http
  url: https://example.com/data.csv

OpenTargetsProvider

FTP-based bulk downloads from the EBI Open Targets platform. Uses wget --recursive to download entire dataset directories.

origin:
  provider: opentargets
  version: "25.06"
  dataset_name: disease

BioOntologyProvider

Downloads specific ontology versions from the BioPortal API. Resolves versioned ontology files by querying the API for the correct download URL.

origin:
  provider: bioontology
  ontology: GO
  version: "2024-06-17"

DrugBankProvider

Downloads from DrugBank release archives. Requires authentication credentials (configured locally).

origin:
  provider: drugbank
  version: "5.1.12"

ChecksumHooks

Trigger: before_dataset_loaded for landing.*, bronze.*, and silver.* datasets.

ChecksumHooks ensures data integrity across the pipeline. Before loading any dataset in the first three layers, it:

  1. Reads the metadata.checksum field from the catalog YAML
  2. Computes the BLAKE2b hash of the actual file or directory on disk
  3. Logs a warning if the computed checksum does not match the recorded one
metadata:
  checksum: b7a669ddabfa209e939cc0a603095f3e

Checksums serve two purposes:

  • Reproducibility: Detect when upstream data sources change between pipeline runs.
  • Integrity: Catch corrupted or incomplete downloads before they propagate through the pipeline.

Use uv run cli checksum <path> to compute the BLAKE2b hash of any file or directory, or uv run cli sync-catalog --dataset <name> to automatically update the checksum in the catalog YAML after a node rerun.

QualityChecksHooks

Trigger: after_node_run for silver and gold namespace nodes.

QualityChecksHooks enforces data quality standards on pipeline outputs. After every silver or gold node completes, it validates:

CheckSeverityDescription
Column namingErrorAll column names must be snake_case
Null IDsErrorAll id* columns must have zero null values
Valid relationsErrorAll relation column values must be valid members of the Relation enum
OTHER relationsWarningLogs the percentage of edges with OTHER as their relation type

The relation validation is strict: if any edge has an unrecognized relation string, the hook raises a DatasetError and halts the pipeline. This ensures that all relationships in the knowledge graph use standardized, controlled vocabulary.

On this page