Gene2Go Reader Dataset

The Gene2GoReaderDataset dataset is used to read the Gene Ontology (GO) terms from the landing.ncbigene.gene2go file. The dataset uses goatools library under the hood.

class optimuskg.datasets.Gene2GoReaderDataset(
    *,
    filepath: str,
    load_args: dict[str, Any] | None = None,
    version: Version | None = None,
    credentials: dict[str, Any] | None = None,
    fs_args: dict[str, Any] | None = None,
    metadata: dict[str, Any] | None = None,
)

Gene2GoReaderDataset loads data from a Gene Annotation File (GAF) using goatools.Gene2GoReader, with built-in support for local or remote filesystems (via fsspec) and optional on-the-fly decompression of gzip-compressed GAFs.

Example usage for the YAML API:

gene2go:
  type: optimuskg.datasets.Gene2GoReaderDataset
  filepath: data/01_raw/gene2go.gaf.gz
  load_args:
    taxids: [9606]
  fs_args:
    open_args_load:
      encoding: "utf-8"
  metadata:
    description: "Human Gene Ontology associations"

Attributes

DEFAULT_LOAD_ARGS

DEFAULT_FS_ARGS

Methods

__init__(*, filepath, load_args=None, version=None, credentials=None, fs_args=None, metadata=None) : Constructs the dataset, configuring protocol, filesystem, versioning, and optional gzip decompression settings.

_describe() → dict[str, Any] : Returns a dictionary with the dataset's identifying parameters: filepath, protocol, load_args, and version.

load() → Gene2GoReader : Loads (and if necessary decompresses) the GAF file and returns a Gene2GoReader instance.

Detects .gz extension, decompresses to .gaf if no decompressed copy exists.
Suppresses stdout during Gene2GoReader initialization.
Raises DatasetError on failure to decompress or load.

save(data: Gene2GoReader) → None : Always raises DatasetError because the dataset is read-only.

_exists() → bool : Returns True if the (possibly compressed) source file exists on the configured filesystem, False otherwise.

_release() → None : Releases any cached resources and invalidates the filesystem cache for the dataset's path.

_invalidate_cache() → None : Helper method to clear fsspec's internal cache for the dataset's filepath.

Gene2Go Reader Dataset

Attributes

Methods

On this page