Catalog & Datasets

The catalog is the single source of truth in Optimus. Every dataset in the pipeline (from landing to gold) is declaratively specified in version-controlled YAML files.

The Catalog

Catalog entries map dataset names (used as node inputs and outputs in the DAG) to their physical storage, schema, and metadata. They live under conf/base/catalog/ organized by layer:

bgee.yml

ctd.yml

drugbank.yml

opentargets.yml

...

Catalog Entry Anatomy

A typical catalog entry contains the dataset type, file path, load/save arguments with schema, and metadata for checksums, origin, and visualization:

landing.opentargets.disease:
  type: optimuskg.datasets.polars.ParquetDataset
  filepath: data/landing/opentargets/disease
  load_args:
    schema:
      id: pl.String
      code: pl.String
      name: pl.String
      description: pl.String
      dbXRefs: pl.List(pl.String)
  metadata:
    checksum: b7a669ddabfa209e939cc0a603095f3e
    origin:
      provider: opentargets
      version: "25.06"
      dataset_name: disease
    kedro-viz:
      layer: landing

Key Fields

Field	Purpose
`type`	The dataset class to use (e.g., `optimuskg.datasets.polars.ParquetDataset`)
`filepath`	Path to the data file or directory on disk
`load_args.schema`	Column names and Polars types, parsed via the `pl:` OmegaConf resolver
`metadata.checksum`	BLAKE2b hash for integrity validation (verified by ChecksumHooks)
`metadata.origin`	Download instructions for OriginHooks (provider type, version, parameters)
`metadata.kedro-viz.layer`	Layer assignment for Kedro Viz visualization

The `pl:` Resolver

Optimus registers a custom OmegaConf resolver that parses Polars type strings from YAML. This allows schemas to be expressed declaratively:

schema:
  id: pl.String
  score: pl.Float64
  genes: pl.List(pl.String)
  metadata: pl.Struct({"name": pl.String, "version": pl.Int32})

The resolver handles nested types like pl.List(pl.Struct(...)), making it possible to specify complex schemas without writing Python code.

Custom Datasets

Optimus provides six custom Kedro dataset types beyond the standard library, tailored for biomedical data processing:

ParquetDataset

The primary dataset type. Reads and writes Parquet files using Polars with full schema parsing from YAML. Supports filesystem abstraction via fsspec for local and remote storage.

bronze.bgee:
  type: optimuskg.datasets.polars.ParquetDataset
  filepath: data/bronze/bgee.parquet
  load_args:
    schema:
      gene_id: pl.String
      anatomical_entity_id: pl.String
      expression_score: pl.Float64

JsonDataset

Reads and writes JSON and NDJSON files using Polars. Used for datasets that arrive in JSON format from APIs.

LXMLDataset

Reads and writes XML files using lxml.etree. Used for data sources that distribute data in XML format (e.g., DrugBank).

OWLDataset

Reads OWL ontology files, returning the file path for downstream processing with owlready2. Used for ontology sources (GO, HP, MONDO, UBERON).

ZipDataset

Reads a single file from within a ZIP archive, delegating to an inner dataset type for parsing. This avoids extracting entire archives when only one file is needed.

landing.ctd:
  type: optimuskg.datasets.ZipDataset
  filepath: data/landing/ctd/CTD_chemicals_diseases.csv.gz
  inner_type: optimuskg.datasets.polars.ParquetDataset

Spins up a PostgreSQL container via Docker Compose, restores a SQL dump file, executes a query, and returns the result as a Polars DataFrame. Used for data sources distributed as database dumps (e.g., DrugCentral).

landing.drugcentral:
  type: optimuskg.datasets.SQLDumpQueryDataset
  filepath: data/landing/drugcentral/drugcentral.dump.sql
  load_args:
    query: "SELECT * FROM public.omop_relationship"

The SQLDumpQueryDataset requires Docker to be running, as it creates a temporary PostgreSQL container for each query.

Catalog & Datasets

On this page