Catalog & Datasets
The catalog system and custom dataset types in Optimus.
The catalog is the single source of truth in Optimus. Every dataset in the pipeline (from landing to gold) is declaratively specified in version-controlled YAML files.
The Catalog
Catalog entries map dataset names (used as node inputs and outputs in the DAG) to their physical storage, schema, and metadata. They live under conf/base/catalog/ organized by layer:
Catalog Entry Anatomy
A typical catalog entry contains the dataset type, file path, load/save arguments with schema, and metadata for checksums, origin, and visualization:
landing.opentargets.disease:
type: optimuskg.datasets.polars.ParquetDataset
filepath: data/landing/opentargets/disease
load_args:
schema:
id: pl.String
code: pl.String
name: pl.String
description: pl.String
dbXRefs: pl.List(pl.String)
metadata:
checksum: b7a669ddabfa209e939cc0a603095f3e
origin:
provider: opentargets
version: "25.06"
dataset_name: disease
kedro-viz:
layer: landingKey Fields
| Field | Purpose |
|---|---|
type | The dataset class to use (e.g., optimuskg.datasets.polars.ParquetDataset) |
filepath | Path to the data file or directory on disk |
load_args.schema | Column names and Polars types, parsed via the pl: OmegaConf resolver |
metadata.checksum | BLAKE2b hash for integrity validation (verified by ChecksumHooks) |
metadata.origin | Download instructions for OriginHooks (provider type, version, parameters) |
metadata.kedro-viz.layer | Layer assignment for Kedro Viz visualization |
The pl: Resolver
Optimus registers a custom OmegaConf resolver that parses Polars type strings from YAML. This allows schemas to be expressed declaratively:
schema:
id: pl.String
score: pl.Float64
genes: pl.List(pl.String)
metadata: pl.Struct({"name": pl.String, "version": pl.Int32})The resolver handles nested types like pl.List(pl.Struct(...)), making it possible to specify complex schemas without writing Python code.
Custom Datasets
Optimus provides six custom Kedro dataset types beyond the standard library, tailored for biomedical data processing:
ParquetDataset
The primary dataset type. Reads and writes Parquet files using Polars with full schema parsing from YAML. Supports filesystem abstraction via fsspec for local and remote storage.
bronze.bgee:
type: optimuskg.datasets.polars.ParquetDataset
filepath: data/bronze/bgee.parquet
load_args:
schema:
gene_id: pl.String
anatomical_entity_id: pl.String
expression_score: pl.Float64JsonDataset
Reads and writes JSON and NDJSON files using Polars. Used for datasets that arrive in JSON format from APIs.
LXMLDataset
Reads and writes XML files using lxml.etree. Used for data sources that distribute data in XML format (e.g., DrugBank).
OWLDataset
Reads OWL ontology files, returning the file path for downstream processing with owlready2. Used for ontology sources (GO, HP, MONDO, UBERON).
ZipDataset
Reads a single file from within a ZIP archive, delegating to an inner dataset type for parsing. This avoids extracting entire archives when only one file is needed.
landing.ctd:
type: optimuskg.datasets.ZipDataset
filepath: data/landing/ctd/CTD_chemicals_diseases.csv.gz
inner_type: optimuskg.datasets.polars.ParquetDatasetSQLDumpQueryDataset
Spins up a PostgreSQL container via Docker Compose, restores a SQL dump file, executes a query, and returns the result as a Polars DataFrame. Used for data sources distributed as database dumps (e.g., DrugCentral).
landing.drugcentral:
type: optimuskg.datasets.SQLDumpQueryDataset
filepath: data/landing/drugcentral/drugcentral.dump.sql
load_args:
query: "SELECT * FROM public.omop_relationship"The SQLDumpQueryDataset requires Docker to be running, as it creates a temporary PostgreSQL container for each query.