Grence

Gene2Go Reader Dataset

Dataset to handle Gene Ontology (GO) terms.

The Gene2GoReaderDataset dataset is used to read the Gene Ontology (GO) terms from the landing.ncbigene.gene2go file. The dataset uses goatools library under the hood.

class optimuskg.datasets.Gene2GoReaderDataset(
    *,
    filepath: str,
    load_args: dict[str, Any] | None = None,
    version: Version | None = None,
    credentials: dict[str, Any] | None = None,
    fs_args: dict[str, Any] | None = None,
    metadata: dict[str, Any] | None = None,
)

Gene2GoReaderDataset loads data from a Gene Annotation File (GAF) using goatools.Gene2GoReader, with built-in support for local or remote filesystems (via fsspec) and optional on-the-fly decompression of gzip-compressed GAFs.

Example usage for the YAML API:

gene2go:
  type: optimuskg.datasets.Gene2GoReaderDataset
  filepath: data/01_raw/gene2go.gaf.gz
  load_args:
    taxids: [9606]
  fs_args:
    open_args_load:
      encoding: "utf-8"
  metadata:
    description: "Human Gene Ontology associations"

Attributes

DEFAULT_LOAD_ARGS

DEFAULT_FS_ARGS

Methods

__init__(*, filepath, load_args=None, version=None, credentials=None, fs_args=None, metadata=None) : Constructs the dataset, configuring protocol, filesystem, versioning, and optional gzip decompression settings.

_describe() → dict[str, Any] : Returns a dictionary with the dataset's identifying parameters: filepath, protocol, load_args, and version.

load() → Gene2GoReader : Loads (and if necessary decompresses) the GAF file and returns a Gene2GoReader instance.

  • Detects .gz extension, decompresses to .gaf if no decompressed copy exists.
  • Suppresses stdout during Gene2GoReader initialization.
  • Raises DatasetError on failure to decompress or load.

save(data: Gene2GoReader) → None : Always raises DatasetError because the dataset is read-only.

_exists() → bool : Returns True if the (possibly compressed) source file exists on the configured filesystem, False otherwise.

_release() → None : Releases any cached resources and invalidates the filesystem cache for the dataset's path.

_invalidate_cache() → None : Helper method to clear fsspec's internal cache for the dataset's filepath.

How is this guide?

On this page