Grence

Architecture

The medallion architecture, pipeline layers, and data flow in Optimus.

Optimus implements a medallion architecture with four data layers, each providing increasing levels of data quality and standardization. The entire pipeline is expressed as a directed acyclic graph (DAG) of nodes built on Kedro.

Medallion Layers

Data flows through four layers, from raw ingestion to a ready-to-use knowledge graph.

Landing

The landing layer contains raw data files exactly as downloaded from external sources. No transformations are applied. Downloads are managed automatically by the OriginHooks system based on metadata in the catalog.

Files in this layer serve as the immutable foundation for reproducibility. Each landing dataset has a BLAKE2b checksum recorded in the catalog to detect upstream changes.

Optimus ships with built-in support for a wide range of biomedical data sources, including OpenTargets, Bgee, CTD, DrugBank, DrugCentral, DisGeNET, OnSIDES, Reactome, Gene Names, Ontologies (GO, HP, MONDO, UBERON), and PPI databases, among others.

Bronze

The bronze layer performs initial cleaning and normalization of raw data. Each bronze node reads from the landing layer and produces a cleaned dataset with:

  • Standardized column names (snake_case)
  • Filtered rows (removing irrelevant records)
  • Normalized identifiers (e.g., prefixing NCBI gene IDs with NCBIGene:)
  • Deduplicated and sorted records

Silver

The silver layer constructs the unified knowledge graph from bronze data. This is where entity harmonization, cross-source merging, and schema standardization happen. Silver is split into two groups: entity nodes and edge nodes.

Entity Nodes

Each entity node follows a standardized schema:

{
    "id": "MONDO:0005015",       # Unique identifier with prefix
    "label": "DIS",              # 3-letter node type code
    "properties": { ... }        # Nested struct with source-specific metadata
}
Node TypeLabel
AnatomyANA
Biological ProcessBPO
Cellular ComponentCCO
DiseaseDIS
DrugDRG
ExposureEXP
GeneGEN
Molecular FunctionMFN
PathwayPWY
PhenotypePHE
ProteinPRO

Edge Nodes

Each edge follows a standardized schema:

{
    "from": "DrugBank:DB00945",  # Source entity ID
    "to": "MONDO:0005015",       # Target entity ID
    "label": "DRG-DIS",          # Source-Target type code
    "relation": "indication",    # Standardized relation type
    "undirected": False,         # Directionality flag
    "properties": { ... }        # Nested struct with source-specific metadata
}
Edge TypeLabel
Anatomy - AnatomyANA-ANA
Anatomy - ProteinANA-PRO
Biological Process - Biological ProcessBPO-BPO
Biological Process - ProteinBPO-PRO
Cellular Component - Cellular ComponentCCO-CCO
Cellular Component - ProteinCCO-PRO
Disease - DiseaseDIS-DIS
Disease - PhenotypeDIS-PHE
Disease - ProteinDIS-PRO
Drug - DiseaseDRG-DIS
Drug - DrugDRG-DRG
Drug - PhenotypeDRG-PHE
Drug - ProteinDRG-PRO
Exposure - Biological ProcessEXP-BPO
Exposure - Cellular ComponentEXP-CCO
Exposure - DiseaseEXP-DIS
Exposure - ExposureEXP-EXP
Exposure - Molecular FunctionEXP-MFN
Exposure - ProteinEXP-PRO
Molecular Function - Molecular FunctionMFN-MFN
Molecular Function - ProteinMFN-PRO
Pathway - PathwayPWY-PWY
Pathway - ProteinPWY-PRO
Phenotype - PhenotypePHE-PHE
Phenotype - ProteinPHE-PRO
Protein - ProteinPRO-PRO

The pipeline enforces that all node and edge properties contain a sources:{" "} {(direct, indirect)} struct, allowing every record in the final graph to be traced back to its originating data sources.

Gold

The gold layer exports the finalized knowledge graph to multiple formats. A single export_kg node takes all silver datasets and produces:

  • CSV: Individual files per node/edge type, plus consolidated nodes.csv and edges.csv. Properties are JSON-encoded strings.
  • Parquet: Individual files per node/edge type with native struct properties, plus consolidated files.
  • Neo4j: Uses BioCypher to generate CSV files compatible with neo4j-admin import, then bulk imports into a Neo4j container.

Pipeline Execution

The entire pipeline can be run with a single command:

uv run kedro run --to-nodes gold.export_kg --runner=ParallelRunner --async

Kedro's DAG resolver automatically determines the execution order. The ParallelRunner with --async enables concurrent node execution where dependencies allow.

To run a subset of the pipeline, use Kedro's node selection:

# Run only the bronze layer
uv run kedro run --namespace bronze

# Run a specific node
uv run kedro run --nodes bronze.bgee

# Run up to a specific node
uv run kedro run --to-nodes silver.drug_disease

# Run from a specific node to the end
uv run kedro run --from-nodes silver.drug_disease

All pipelines are auto-discovered via Kedro's find_pipelines() mechanism. A __default__ pipeline is registered as the union of all sub-pipelines, so kedro run without filters builds everything.

On this page