The Catalog

To uniquely identify a resource in the catalog, the catalog folder structure is organized as follows:

catalog/<layer>/<namespace>/<dataset_name>.yml

Where:

layer is the layer of the medallion architecture where the dataset is located (e.g. landing, bronze, silver, gold).
namespace is short name for the data provider or topic name. Namespaces must be in snake case and unique within the catalog.
dataset_name is the short name for the data product. Dataset names must be in snake case and unique within the namespace.

For example, the catalog configuration for the raw data from the bgee provider for the dataset homo_sapiens_expressions_advanced would be located at:

catalog/landing/bgee/homo_sapiens_expressions_advanced.yml

And, inside the homo_sapiens_expressions_advanced.yml file, we would have the following configuration:

# file: catalog/landing/bgee/homo_sapiens_expressions_advanced.yml

landing.bgee.homo_sapiens_expressions_advanced:
  type: polars.CSVDataset
  filepath: "data/landing/bgee/Homo_sapiens_expr_advanced.tsv"
  load_args:
    separator: "\t"
    infer_schema: false
    schema:
      "Gene ID": "${pl:String}"
      "Gene name": "${pl:String}"
      "Anatomical entity ID": "${pl:String}"
      "Anatomical entity name": "${pl:String}"
      "Expression": "${pl:String}"
      "Call quality": "${pl:String}"
      "Expression rank": "${pl:Float64}"
      "Including observed data": "${pl:Utf8}"
      "Affymetrix data": "${pl:Utf8}"
      "Affymetrix  experiment count showing expression of this gene in this condition or in sub-conditions with a high quality": "${pl:Int64}"
      "Affymetrix experiment count showing expression of this gene in this condition or in sub-conditions with a low quality": "${pl:Int64}"
      "Affymetrix experiment count showing absence of expression of this gene in this condition or valid parent conditions with a high quality": "${pl:Int64}"
      "Affymetrix experiment count showing absence of expression of this gene in this condition or valid parent conditions with a low quality": "${pl:Int64}"
      "Including Affymetrix observed data": "${pl:Utf8}"
      "EST data": "${pl:Utf8}"
      "EST experiment count showing expression of this gene in this condition or in sub-conditions with a high quality": "${pl:Int64}"
      "EST experiment count showing expression of this gene in this condition or in sub-conditions with a low quality": "${pl:Int64}"
      "Including EST observed data": "${pl:Utf8}"
      "In situ hybridization data": "${pl:Utf8}"
      "In situ hybridization experiment count showing expression of this gene in this condition or in sub-conditions with a high quality": "${pl:Int64}"
      "In situ hybridization experiment count showing expression of this gene in this condition or in sub-conditions with a low quality": "${pl:Int64}"
      "In situ hybridization experiment count showing absence of expression of this gene in this condition or valid parent conditions with a high quality": "${pl:Int64}"
      "In situ hybridization experiment count showing absence of expression of this gene in this condition or valid parent conditions with a low quality": "${pl:Int64}"
      "Including in situ hybridization observed data": "${pl:Utf8}"
      "RNA-Seq data": "${pl:Utf8}"
      "RNA-Seq  experiment count showing expression of this gene in this condition or in sub-conditions with a high quality": "${pl:Int64}"
      "RNA-Seq  experiment count showing expression of this gene in this condition or in sub-conditions with a low quality": "${pl:Int64}"
      "RNA-Seq  experiment count showing absence of expression of this gene in this condition or valid parent conditions with a high quality": "${pl:Int64}"
      "RNA-Seq  experiment count showing absence of expression of this gene in this condition or valid parent conditions with a low quality": "${pl:Int64}"
      "Including RNA-Seq observed data": "${pl:Utf8}"

As we see from this example, the actual data of the dataset will be stored in the data/<layer>/<namespace>/<dataset_name>.<file_format> file.

Built-in datasets

OptimusKG comes with a set of built-in datasets (with ontologies) and transformation pipelines that are used to bootstrap the knowledge graph. The following datasets are currently available:

Ontologies (landing.ontology.*)

High-Throughput & Expression Data

Bgee Homo sapiens Expression (landing.bgee.homo_sapiens_expressions_advanced)
Entrez gene database of gene-specific information (landing.ncbigene.gene2go)
Gene Name Mappings (landing.gene_names.gene_names)

Chemical & Toxicogenomics

Comparative Toxicogenomics Database (landing.ctd.ctd_exposure_events)
DrugBank pharmaceutical database (landing.drugbank.*)
Drug Central database of drug-disease interactions (landing.drugcentral.psql_dump)

Pathways & Networks

Reactome pathway knowledge base (landing.reactome.reactome_pathways, landing.reactome.reactome_pathways_relation)
NCBI2Reactome Mappings (landing.reactome.ncbi2_reactome)
Human Protein-Protein Interaction (PPI) Network

OpenTargets Partitioned Data

Evidence by Source (landing.opentargets.evidence.{source_id})
Targets, Molecules, Diseases, Phenotype Links (landing.opentargets.targets, landing.opentargets.molecule, landing.opentargets.diseases, landing.opentargets.disease_to_phenotype)
PrimeKG Node & Edge Tables (landing.opentargets.primekg_nodes, landing.opentargets.primekg_edges)
Cross-Ontology Mappings (landing.opentargets.mondo_efo_mappings, landing.opentargets.drug_mappings)

Data ingestion

OptimusKG provides a set of tools to ingest versioned data from a variety of sources.

Built-in datasets

Data ingestion

On this page