EncodingDataset

class trove.data.ir_encoding_dataset.EncodingDataset(data_args=None, data_args_overrides=None, dataset_name=None, query_path=None, corpus_path=None, format_query=None, format_passage=None, global_id_suffix=None, cache_path=None, load_cache_on_init=True, prefer_cache=True)
__init__(data_args=None, data_args_overrides=None, dataset_name=None, query_path=None, corpus_path=None, format_query=None, format_passage=None, global_id_suffix=None, cache_path=None, load_cache_on_init=True, prefer_cache=True)

Dataset for encoding query and passage texts.

This class supports reading/writing dense embedding vectors to cache on disk.

Parameters:
  • data_args (Optional[DataArguments]) – general arguments for loading and processing the data. Currently only used to find out dataset_name if not explicitely provided. And that is only useful if your query/passage formatting functions require a dataset argument.

  • data_args_overrides (Optional[Dict[str, Any]]) – A mapping from a subset of DataArguments attribute names to their new values. These key values override the corresponding attributes in data_args argument. It is useful if you want to create multiple datasets from the same DataArguments instance but make small changes for each dataset without creating new DataArguments instances.

  • dataset_name (Optional[str]) – Name of the dataset that is being encoded. only useful if your query/passage formatting functions require a dataset argument.

  • query_path (Optional[os.PathLike]) – Path to JSONL file containing query texts.

  • corpus_path (Optional[os.PathLike]) – Path to JSONL file containing passage texts.

  • format_query (Optional[Callable]) – callable that takes query text and dataset name and returns the formatted query text.

  • format_passage (Optional[Callable]) – callable that takes passage text and title and dataset name and returns the formatted passage text.

  • global_id_suffix (Optional[str]) – unique file id used to create globally unique query/passage ids across files.

  • cache_path (Optional[os.PathLike]) – path to cache file

  • load_cache_on_init (bool) – if true and cache_path is provided and exists, load the cache arrow table at the end of init method

  • prefer_cache (bool) – If true, use cache if present. If false, ignore cache even if it exists.

property all_rec_ids: List[str]

Returns the list of all record ids in the dataset across all shards.

ignore_cache()

Return raw data even if cache exists.

Return type:

None

prefer_cache()

Return cached embeddings if possible.

Return type:

None

disable_cache()

Context manager to temporarily disable vector cache.

shard(shard_idx, num_shards=None, shard_weights=None, hard_shard=False)

Shard the dataset.

It can do both a hard and soft shard.

In a soft shard, we only mask the index of rows that are not in the shard. And, the underlying data is not changed and you can reverse the sharding by calling the unshard() method.

In a hard shard, however, the underlying data is changed and the effects of sharding are irreversible for this instance. Also keep in mind that after a hard shard, the dataset will seem unsharded to other methods/classes like VectorCacheMixin.

In most cases, you should just use the default soft sharding. Hard sharding is mostly useful if you plan to shard the dataset twice. For example, if you want to encode each shard in a separate session which itself runs in a distributed environment, You can do a hard shard in each session before passing the dataset to trove.RetrievalEvaluator(), for instance. This allows RetrievalEvaluator to further shard the dataset in each session into smaller pieces for each process.

Parameters:
  • shard_idx (int) – Index of current shards

  • num_shards (Optional[int]) – Total number of shards. If shard_weights is provided, num_shards should be either None or equal to len(shard_weights).

  • shard_weights (Optional[List[float]]) – relative number of items in each shard. If provided, shard the dataset according to these weights. If not provided, all shards are of the same size.

  • hard_shard (bool) – Whether to a hard/permanent shard or a soft/superficial shard.

Return type:

None

unshard()

Reverse the sharding.

I.e., unmask the index of rows accessible to dataset

Return type:

None