Grence

Overview

Optimus is a production-ready data pipeline framework designed to construct, validate, and maintain biomedical knowledge graphs following software engineering best practices.

Optimus is the framework that builds knowledge graphs. OptimusKG is the biomedical knowledge graph data product built using Optimus.

Optimus is grounded in three core principles:

  • Ready-to-use: Pre-built processing nodes that unify many biomedical data sources into a single knowledge graph. Run one command to build the entire graph.
  • Reproducible: All transformations are deterministic, validated through checksum verification, and infrastructure-agnostic. Every dataset is declaratively specified in version-controlled YAML.
  • Extensible: Built as a superset of the Kedro framework (hosted by the Linux Foundation), providing a uniform project template, data abstraction, configuration management, and pipeline assembly.

Beyond these principles, Optimus is also AI-ready: the repository ships with a skills system, agent rules, and CLI tools that allow AI coding agents to operate within the pipeline autonomously.

Architectural Components

Optimus organizes its pipeline around nine core components:

ComponentDescription
CatalogThe single source of truth for all datasets, their schemas, checksums, and metadata
DatasetTyped abstractions for reading and writing data (Parquet, JSON, OWL, ZIP, SQL dumps)
NodePure Python functions that transform data, grouped into pipelines
PipelineDirected acyclic graphs (DAGs) of nodes, organized into medallion layers
LayerMedallion architecture tiers: landing, bronze, silver, gold
ParametersRuntime configuration values (export formats, feature flags)
ProviderData download strategies for different sources (HTTP, FTP, APIs)
HookLifecycle interceptors for downloading, checksum validation, and quality checks
ConfConfiguration directory with base environments and OmegaConf integration

Continue Reading

On this page