Grence

Data Sources

A big part of the OptimusKG project is the ability to handle biomedical data sources. However, these types of data are very heterogeneous, sometimes not open nor well-documented, and hosted in (usually bad) servers.

Altough some data sources do not provide a correct (or at least deterministic) way to download the data, OptimusKG has a best-effort approach to handle as many cases as possible.

Since, for licensing purposes, we cannot host the data in a OptimusKG object storage, we provide a way to download the data to a local file system while implementing an interface for data sources that allows us to:

  • Download the data from a remote source using a unified provider configuration, allowing versioning when possible.
  • Store the data in the Landing layer of the local file system, creating all necessary subfolders.
  • Verify the integrity of the data comparing the checksum of the downloaded file with the one defined in the catalog. We use the ChecksumHook to do this.
  • Provide a way to load the data into a Kedro dataset, using Custom Datasets when necessary and validating the data against the defined schema in the catalog.

Data sources

How is this guide?

On this page