What is OptimusKG?

OptimusKG is grounded in two foundational principles: reproducibility and extensibility.

To provide reproducible knowledge graphs, OptimusKG enforces all data transformations to be deterministic through infrastructure-agnostic, single-file python functions. To enable extensibility, OptimusKG is designed as a superset of the Kedro framework (hosted by the Linux Foundation), providing a uniform project template, data abstraction, configuration, and pipeline assembly.

We assume familiarity with the Kedro framework. If you are new to Kedro, we recommend reading the Kedro documentation before continuing.

Core Quality Attributes

Besides Kedro, OptimusKG provides a data lake architecture with a medallion architecture designed to overcome the limitations of ad hoc biomedical data integration systems.

Following the best practices for data data governance, OptimusKG prioritizes four key attributes: centralized data cataloging, built-in data quality management, data lineage, and data discovery.

Centralized Data Cataloging

In OptimusKG, the data catalog is the single source of truth of all datasets, their schemas, their lineage, and their metadata.

The catalog is specified with a set of version-controlled YAML files that maps the names of node inputs and outputs as keys to their corresponding datasets.

Under the hood, we're using the Kedro Data Catalog to store the catalog.

Built-in Data Quality Management

OptimusKG provides built-in data quality checks (injected via Kedro Hooks) in two stages of the pipeline:

Data Ingestion: OptimusKG has schema enforcement mechanisms like column-level type checks (e.g. validating HGNC gene symbols as strings).
Data Transformation: OptimusKG has validation hooks that check file formats (e.g. .owl ontology structures), column naming conventions (e.g. enforcing snake_case in .csv files), and data quality metrics (e.g. checking for missing values).

You can find more information about the data quality checks in the Hooks section.

Data Lineage

All datasets in OptimusKG have lineage metadata that is tracked through the pipeline. This way, you can trace all nodes and edges in the final graph to their corresponding datasets.

OptimusKG comes with Kedro Viz to have a complete visualization of the data lineage.

Data Discovery

OptimusKG catalog, and pipelines use semantic URIs to allow you to query datasets, their lineage, and their metadata. We provide a unified interface to find datasets, create new ones, and process them.

You can find more information about the catalog in the The Catalog section.