Grence

Download

How to get OptimusKG release files and understand the export formats.

OptimusKG is distributed as a set of CSV and Parquet files, available as a release archive from GitHub.

Release Files

Download the latest release:

# Download the release archive
wget https://github.com/mims-harvard/optimus/releases/download/0.56.0/release.zip

# Extract
unzip release.zip

The archive contains the full knowledge graph exported in multiple formats.

File Structure

anatomy.csv
biological_process.csv
cellular_component.csv
disease.csv
drug.csv
exposure.csv
gene.csv
molecular_function.csv
pathway.csv
phenotype.csv
anatomy_anatomy.csv
anatomy_protein.csv
drug_disease.csv
protein_protein.csv
...
nodes.csv
edges.csv
nodes.parquet
edges.parquet

Each format includes individual files per node and edge type, plus consolidated nodes and edges files that combine all types.

CSV Format

Node CSV

Columns: id, label, properties

id,label,properties
MONDO:0005015,DIS,"{""sources"":{""direct"":[""opentargets""],""indirect"":[""mondo""]},""name"":""diabetes mellitus"",""description"":""A metabolic disease ...""}"
ENSG00000000003,GEN,"{""sources"":{""direct"":[""opentargets""],""indirect"":[]},""symbol"":""TSPAN6"",""biotype"":""protein_coding""}"

Edge CSV

Columns: from, to, label, relation, undirected, properties

from,to,label,relation,undirected,properties
DrugBank:DB00945,MONDO:0005015,DRG-DIS,indication,false,"{""sources"":{""direct"":[""opentargets""],""indirect"":[""chembl""]},""highest_clinical_trial_phase"":4.0}"

The properties column is a JSON-encoded string containing all type-specific properties and provenance metadata.

Parquet Format

Parquet files come in two variants:

  • Individual files (e.g., nodes/disease.parquet): Properties are stored as native Polars structs with full typing. This is the most efficient format for analysis, as nested fields can be queried directly without JSON parsing.

  • Consolidated files (nodes.parquet, edges.parquet): Properties are JSON-encoded strings (same as CSV) because different entity types have different property schemas and cannot be stored as a single native struct.

For analysis workflows, prefer the individual Parquet files over consolidated ones. They preserve native types (nested structs, lists, booleans) and are significantly faster to query.

Reading with Polars

import polars as pl

# Read nodes
nodes = pl.read_parquet("release/kg/parquet/nodes.parquet")

# Read edges
edges = pl.read_parquet("relase/kg/parquet/edges.parquet")

Neo4j Import

OptimusKG can also be exported to Neo4j using the BioCypher framework. The Neo4j export generates CSV files compatible with neo4j-admin import and bulk-loads them into a Neo4j database.

Neo4j export is available when building OptimusKG from source using the Optimus framework. See the Optimus CLI documentation for details.

On this page