Architecture

Optimus implements a medallion architecture with four data layers, each providing increasing levels of data quality and standardization. The entire pipeline is expressed as a directed acyclic graph (DAG) of nodes built on Kedro.

Medallion Layers

Data flows through four layers, from raw ingestion to a ready-to-use knowledge graph.

Landing

The landing layer contains raw data files exactly as downloaded from external sources. No transformations are applied. Downloads are managed automatically by the OriginHooks system based on metadata in the catalog.

Files in this layer serve as the immutable foundation for reproducibility. Each landing dataset has a BLAKE2b checksum recorded in the catalog to detect upstream changes.

Optimus ships with built-in support for a wide range of biomedical data sources, including OpenTargets, Bgee, CTD, DrugBank, DrugCentral, DisGeNET, OnSIDES, Reactome, Gene Names, Ontologies (GO, HP, MONDO, UBERON), and PPI databases, among others.

Bronze

The bronze layer performs initial cleaning and normalization of raw data. Each bronze node reads from the landing layer and produces a cleaned dataset with:

Standardized column names (snake_case)
Filtered rows (removing irrelevant records)
Normalized identifiers (e.g., prefixing NCBI gene IDs with NCBIGene:)
Deduplicated and sorted records

Silver

The silver layer constructs the unified knowledge graph from bronze data. This is where entity harmonization, cross-source merging, and schema standardization happen. Silver is split into two groups: entity nodes and edge nodes.

Entity Nodes

Each entity node follows a standardized schema:

{
    "id": "MONDO:0005015",       # Unique identifier with prefix
    "label": "DIS",              # 3-letter node type code
    "properties": { ... }        # Nested struct with source-specific metadata
}

Node Type	Label
Anatomy	`ANA`
Biological Process	`BPO`
Cellular Component	`CCO`
Disease	`DIS`
Drug	`DRG`
Exposure	`EXP`
Gene	`GEN`
Molecular Function	`MFN`
Pathway	`PWY`
Phenotype	`PHE`
Protein	`PRO`

Edge Nodes

Each edge follows a standardized schema:

{
    "from": "DrugBank:DB00945",  # Source entity ID
    "to": "MONDO:0005015",       # Target entity ID
    "label": "DRG-DIS",          # Source-Target type code
    "relation": "indication",    # Standardized relation type
    "undirected": False,         # Directionality flag
    "properties": { ... }        # Nested struct with source-specific metadata
}

Edge Type	Label
Anatomy - Anatomy	`ANA-ANA`
Anatomy - Protein	`ANA-PRO`
Biological Process - Biological Process	`BPO-BPO`
Biological Process - Protein	`BPO-PRO`
Cellular Component - Cellular Component	`CCO-CCO`
Cellular Component - Protein	`CCO-PRO`
Disease - Disease	`DIS-DIS`
Disease - Phenotype	`DIS-PHE`
Disease - Protein	`DIS-PRO`
Drug - Disease	`DRG-DIS`
Drug - Drug	`DRG-DRG`
Drug - Phenotype	`DRG-PHE`
Drug - Protein	`DRG-PRO`
Exposure - Biological Process	`EXP-BPO`
Exposure - Cellular Component	`EXP-CCO`
Exposure - Disease	`EXP-DIS`
Exposure - Exposure	`EXP-EXP`
Exposure - Molecular Function	`EXP-MFN`
Exposure - Protein	`EXP-PRO`
Molecular Function - Molecular Function	`MFN-MFN`
Molecular Function - Protein	`MFN-PRO`
Pathway - Pathway	`PWY-PWY`
Pathway - Protein	`PWY-PRO`
Phenotype - Phenotype	`PHE-PHE`
Phenotype - Protein	`PHE-PRO`
Protein - Protein	`PRO-PRO`

The pipeline enforces that all node and edge properties contain a sources:{" "} {(direct, indirect)} struct, allowing every record in the final graph to be traced back to its originating data sources.

Gold

The gold layer exports the finalized knowledge graph to multiple formats. A single export_kg node takes all silver datasets and produces:

CSV: Individual files per node/edge type, plus consolidated nodes.csv and edges.csv. Properties are JSON-encoded strings.
Parquet: Individual files per node/edge type with native struct properties, plus consolidated files.
Neo4j: Uses BioCypher to generate CSV files compatible with neo4j-admin import, then bulk imports into a Neo4j container.

Pipeline Execution

The entire pipeline can be run with a single command:

uv run kedro run --to-nodes gold.export_kg --runner=ParallelRunner --async

Kedro's DAG resolver automatically determines the execution order. The ParallelRunner with --async enables concurrent node execution where dependencies allow.

To run a subset of the pipeline, use Kedro's node selection:

# Run only the bronze layer
uv run kedro run --namespace bronze

# Run a specific node
uv run kedro run --nodes bronze.bgee

# Run up to a specific node
uv run kedro run --to-nodes silver.drug_disease

# Run from a specific node to the end
uv run kedro run --from-nodes silver.drug_disease

All pipelines are auto-discovered via Kedro's find_pipelines() mechanism. A __default__ pipeline is registered as the union of all sub-pipelines, so kedro run without filters builds everything.

Architecture

On this page