What is OptimusKG?
An explanation of OptimusKG's purpose, importance, and its design principles.
OptimusKG is grounded in two foundational principles: reproducibility and extensibility.
To provide reproducible knowledge graphs, OptimusKG enforces all data transformations to be deterministic through infrastructure-agnostic, single-file python functions. To enable extensibility, OptimusKG is designed as a superset of the Kedro framework (hosted by the Linux Foundation), providing a uniform project template, data abstraction, configuration, and pipeline assembly.
We assume familiarity with the Kedro framework. If you are new to Kedro, we recommend reading the Kedro documentation before continuing.
Core Quality Attributes
Besides Kedro, OptimusKG provides a data lake architecture with a medallion architecture designed to overcome the limitations of ad hoc biomedical data integration systems.
Following the best practices for data data governance, OptimusKG prioritizes four key attributes: centralized data cataloging, built-in data quality management, data lineage, and data discovery.
Centralized Data Cataloging
In OptimusKG, the data catalog is the single source of truth of all datasets, their schemas, their lineage, and their metadata.
The catalog is specified with a set of version-controlled YAML files that maps the names of node inputs and outputs as keys to their corresponding datasets.
Under the hood, we're using the Kedro Data Catalog to store the catalog.
Built-in Data Quality Management
OptimusKG provides built-in data quality checks (injected via Kedro Hooks) in two stages of the pipeline:
- Data Ingestion: OptimusKG has schema enforcement mechanisms like column-level type checks (e.g. validating
HGNCgene symbols as strings). - Data Transformation: OptimusKG has validation hooks that check file formats (e.g.
.owlontology structures), column naming conventions (e.g. enforcingsnake_casein.csvfiles), and data quality metrics (e.g. checking for missing values).
You can find more information about the data quality checks in the Hooks section.
Data Lineage
All datasets in OptimusKG have lineage metadata that is tracked through the pipeline. This way, you can trace all nodes and edges in the final graph to their corresponding datasets.
OptimusKG comes with Kedro Viz to have a complete visualization of the data lineage.
Data Discovery
OptimusKG catalog, and pipelines use semantic URIs to allow you to query datasets, their lineage, and their metadata. We provide a unified interface to find datasets, create new ones, and process them.
You can find more information about the catalog in the The Catalog section.
How is this guide?