Overview
Optimus is a production-ready data pipeline framework designed to construct, validate, and maintain biomedical knowledge graphs following software engineering best practices.
Optimus is the framework that builds knowledge graphs. OptimusKG is the biomedical knowledge graph data product built using Optimus.
Optimus is grounded in three core principles:
- Ready-to-use: Pre-built processing nodes that unify many biomedical data sources into a single knowledge graph. Run one command to build the entire graph.
- Reproducible: All transformations are deterministic, validated through checksum verification, and infrastructure-agnostic. Every dataset is declaratively specified in version-controlled YAML.
- Extensible: Built as a superset of the Kedro framework (hosted by the Linux Foundation), providing a uniform project template, data abstraction, configuration management, and pipeline assembly.
Beyond these principles, Optimus is also AI-ready: the repository ships with a skills system, agent rules, and CLI tools that allow AI coding agents to operate within the pipeline autonomously.
Architectural Components
Optimus organizes its pipeline around nine core components:
| Component | Description |
|---|---|
| Catalog | The single source of truth for all datasets, their schemas, checksums, and metadata |
| Dataset | Typed abstractions for reading and writing data (Parquet, JSON, OWL, ZIP, SQL dumps) |
| Node | Pure Python functions that transform data, grouped into pipelines |
| Pipeline | Directed acyclic graphs (DAGs) of nodes, organized into medallion layers |
| Layer | Medallion architecture tiers: landing, bronze, silver, gold |
| Parameters | Runtime configuration values (export formats, feature flags) |
| Provider | Data download strategies for different sources (HTTP, FTP, APIs) |
| Hook | Lifecycle interceptors for downloading, checksum validation, and quality checks |
| Conf | Configuration directory with base environments and OmegaConf integration |
Continue Reading
Architecture
The medallion architecture, data flow, and pipeline layers.
Catalog & Datasets
The catalog system and custom dataset types.
Hooks & Providers
Lifecycle hooks and data download providers.
CLI
Command-line tools for catalog maintenance and analysis.
Agentic
AI-ready features: skills, agent rules, and plans.