The Catalog
A brief introduction to the OptimusKG catalog.
To uniquely identify a resource in the catalog, the catalog folder structure is organized as follows:
Where:
layeris the layer of the medallion architecture where the dataset is located (e.g. landing, bronze, silver, gold).namespaceis short name for the data provider or topic name. Namespaces must be in snake case and unique within the catalog.dataset_nameis the short name for the data product. Dataset names must be in snake case and unique within the namespace.
For example, the catalog configuration for the raw data from the bgee provider for the dataset homo_sapiens_expressions_advanced would be located at:
And, inside the homo_sapiens_expressions_advanced.yml file, we would have the following configuration:
As we see from this example, the actual data of the dataset will be stored in the data/<layer>/<namespace>/<dataset_name>.<file_format> file.
Built-in datasets
OptimusKG comes with a set of built-in datasets (with ontologies) and transformation pipelines that are used to bootstrap the knowledge graph. The following datasets are currently available:
Ontologies (landing.ontology.*)
- Biolink Model Ontology
- Disease Ontology database of human disease relationships
- Gene Ontology pathway database
- Uberon Anatomical Ontology
- Human Phenotype Ontology
- MONDO Disease Ontology
- Orphanet Rare Disease Ontology
High-Throughput & Expression Data
- Bgee Homo sapiens Expression (
landing.bgee.homo_sapiens_expressions_advanced) - Entrez gene database of gene-specific information (
landing.ncbigene.gene2go) - Gene Name Mappings (
landing.gene_names.gene_names)
Chemical & Toxicogenomics
- Comparative Toxicogenomics Database (
landing.ctd.ctd_exposure_events) - DrugBank pharmaceutical database (
landing.drugbank.*) - Drug Central database of drug-disease interactions (
landing.drugcentral.psql_dump)
Pathways & Networks
- Reactome pathway knowledge base (
landing.reactome.reactome_pathways,landing.reactome.reactome_pathways_relation) - NCBI2Reactome Mappings (
landing.reactome.ncbi2_reactome) - Human Protein-Protein Interaction (PPI) Network
OpenTargets Partitioned Data
- Evidence by Source (
landing.opentargets.evidence.{source_id}) - Targets, Molecules, Diseases, Phenotype Links (
landing.opentargets.targets,landing.opentargets.molecule,landing.opentargets.diseases,landing.opentargets.disease_to_phenotype) - PrimeKG Node & Edge Tables (
landing.opentargets.primekg_nodes,landing.opentargets.primekg_edges) - Cross-Ontology Mappings (
landing.opentargets.mondo_efo_mappings,landing.opentargets.drug_mappings)
Data ingestion
OptimusKG provides a set of tools to ingest versioned data from a variety of sources.
How is this guide?