Architecture
The medallion architecture, pipeline layers, and data flow in Optimus.
Optimus implements a medallion architecture with four data layers, each providing increasing levels of data quality and standardization. The entire pipeline is expressed as a directed acyclic graph (DAG) of nodes built on Kedro.
Medallion Layers
Data flows through four layers, from raw ingestion to a ready-to-use knowledge graph.
Landing
The landing layer contains raw data files exactly as downloaded from external sources. No transformations are applied. Downloads are managed automatically by the OriginHooks system based on metadata in the catalog.
Files in this layer serve as the immutable foundation for reproducibility. Each landing dataset has a BLAKE2b checksum recorded in the catalog to detect upstream changes.
Optimus ships with built-in support for a wide range of biomedical data sources, including OpenTargets, Bgee, CTD, DrugBank, DrugCentral, DisGeNET, OnSIDES, Reactome, Gene Names, Ontologies (GO, HP, MONDO, UBERON), and PPI databases, among others.
Bronze
The bronze layer performs initial cleaning and normalization of raw data. Each bronze node reads from the landing layer and produces a cleaned dataset with:
- Standardized column names (snake_case)
- Filtered rows (removing irrelevant records)
- Normalized identifiers (e.g., prefixing NCBI gene IDs with
NCBIGene:) - Deduplicated and sorted records
Silver
The silver layer constructs the unified knowledge graph from bronze data. This is where entity harmonization, cross-source merging, and schema standardization happen. Silver is split into two groups: entity nodes and edge nodes.
Entity Nodes
Each entity node follows a standardized schema:
{
"id": "MONDO:0005015", # Unique identifier with prefix
"label": "DIS", # 3-letter node type code
"properties": { ... } # Nested struct with source-specific metadata
}| Node Type | Label |
|---|---|
| Anatomy | ANA |
| Biological Process | BPO |
| Cellular Component | CCO |
| Disease | DIS |
| Drug | DRG |
| Exposure | EXP |
| Gene | GEN |
| Molecular Function | MFN |
| Pathway | PWY |
| Phenotype | PHE |
| Protein | PRO |
Edge Nodes
Each edge follows a standardized schema:
{
"from": "DrugBank:DB00945", # Source entity ID
"to": "MONDO:0005015", # Target entity ID
"label": "DRG-DIS", # Source-Target type code
"relation": "indication", # Standardized relation type
"undirected": False, # Directionality flag
"properties": { ... } # Nested struct with source-specific metadata
}| Edge Type | Label |
|---|---|
| Anatomy - Anatomy | ANA-ANA |
| Anatomy - Protein | ANA-PRO |
| Biological Process - Biological Process | BPO-BPO |
| Biological Process - Protein | BPO-PRO |
| Cellular Component - Cellular Component | CCO-CCO |
| Cellular Component - Protein | CCO-PRO |
| Disease - Disease | DIS-DIS |
| Disease - Phenotype | DIS-PHE |
| Disease - Protein | DIS-PRO |
| Drug - Disease | DRG-DIS |
| Drug - Drug | DRG-DRG |
| Drug - Phenotype | DRG-PHE |
| Drug - Protein | DRG-PRO |
| Exposure - Biological Process | EXP-BPO |
| Exposure - Cellular Component | EXP-CCO |
| Exposure - Disease | EXP-DIS |
| Exposure - Exposure | EXP-EXP |
| Exposure - Molecular Function | EXP-MFN |
| Exposure - Protein | EXP-PRO |
| Molecular Function - Molecular Function | MFN-MFN |
| Molecular Function - Protein | MFN-PRO |
| Pathway - Pathway | PWY-PWY |
| Pathway - Protein | PWY-PRO |
| Phenotype - Phenotype | PHE-PHE |
| Phenotype - Protein | PHE-PRO |
| Protein - Protein | PRO-PRO |
The pipeline enforces that all node and edge properties contain a sources:{" "} {(direct, indirect)} struct, allowing every record in the final graph to be
traced back to its originating data sources.
Gold
The gold layer exports the finalized knowledge graph to multiple formats. A single export_kg node takes all silver datasets and produces:
- CSV: Individual files per node/edge type, plus consolidated
nodes.csvandedges.csv. Properties are JSON-encoded strings. - Parquet: Individual files per node/edge type with native struct properties, plus consolidated files.
- Neo4j: Uses BioCypher to generate CSV files compatible with
neo4j-admin import, then bulk imports into a Neo4j container.
Pipeline Execution
The entire pipeline can be run with a single command:
uv run kedro run --to-nodes gold.export_kg --runner=ParallelRunner --asyncKedro's DAG resolver automatically determines the execution order. The ParallelRunner with --async enables concurrent node execution where dependencies allow.
To run a subset of the pipeline, use Kedro's node selection:
# Run only the bronze layer
uv run kedro run --namespace bronze
# Run a specific node
uv run kedro run --nodes bronze.bgee
# Run up to a specific node
uv run kedro run --to-nodes silver.drug_disease
# Run from a specific node to the end
uv run kedro run --from-nodes silver.drug_diseaseAll pipelines are auto-discovered via Kedro's find_pipelines() mechanism. A
__default__ pipeline is registered as the union of all sub-pipelines, so
kedro run without filters builds everything.