Data Sources
A big part of the OptimusKG project is the ability to handle biomedical data sources. However, these types of data are very heterogeneous, sometimes not open nor well-documented, and hosted in (usually bad) servers.
Altough some data sources do not provide a correct (or at least deterministic) way to download the data, OptimusKG has a best-effort approach to handle as many cases as possible.
Since, for licensing purposes, we cannot host the data in a OptimusKG object storage, we provide a way to download the data to a local file system while implementing an interface for data sources that allows us to:
- Download the data from a remote source using a unified provider configuration, allowing versioning when possible.
- Store the data in the Landing layer of the local file system, creating all necessary subfolders.
- Verify the integrity of the data comparing the checksum of the downloaded file with the one defined in the catalog. We use the ChecksumHook to do this.
- Provide a way to load the data into a Kedro dataset, using Custom Datasets when necessary and validating the data against the defined schema in the catalog.
Data sources
Bgee
A database for retrieval and comparison of gene expression patterns across multiple animal species.
The Comparative Toxicogenomics Database
A database focused on the impact of environmental exposures on human health.
Drug Bank
A database that contains pharmacological knowledge.
Drug Central
A resource that curates information about drug-disease interactions.
Gene Names
Gene Names is a database of gene names.
Ontologies
Ontologies are a collection of terms and definitions that are used to describe the data in a knowledge graph.
Open Targets
Open Targets is a database of drugs and drug interactions.
Reactome
Reactome is a database of biological pathways.
How is this guide?