EncodingDataset
- class trove.data.ir_encoding_dataset.EncodingDataset(data_args=None, data_args_overrides=None, dataset_name=None, query_path=None, corpus_path=None, format_query=None, format_passage=None, global_id_suffix=None, cache_path=None, load_cache_on_init=True, prefer_cache=True)
- __init__(data_args=None, data_args_overrides=None, dataset_name=None, query_path=None, corpus_path=None, format_query=None, format_passage=None, global_id_suffix=None, cache_path=None, load_cache_on_init=True, prefer_cache=True)
Dataset for encoding query and passage texts.
This class supports reading/writing dense embedding vectors to cache on disk.
- Parameters:
data_args (Optional[DataArguments]) – general arguments for loading and processing the data. Currently only used to find out
dataset_name
if not explicitely provided. And that is only useful if your query/passage formatting functions require adataset
argument.data_args_overrides (Optional[Dict[str, Any]]) – A mapping from a subset of
DataArguments
attribute names to their new values. These key values override the corresponding attributes indata_args
argument. It is useful if you want to create multiple datasets from the sameDataArguments
instance but make small changes for each dataset without creating newDataArguments
instances.dataset_name (Optional[str]) – Name of the dataset that is being encoded. only useful if your query/passage formatting functions require a
dataset
argument.query_path (Optional[os.PathLike]) – Path to JSONL file containing query texts.
corpus_path (Optional[os.PathLike]) – Path to JSONL file containing passage texts.
format_query (Optional[Callable]) – callable that takes query text and dataset name and returns the formatted query text.
format_passage (Optional[Callable]) – callable that takes passage text and title and dataset name and returns the formatted passage text.
global_id_suffix (Optional[str]) – unique file id used to create globally unique query/passage ids across files.
cache_path (Optional[os.PathLike]) – path to cache file
load_cache_on_init (bool) – if true and
cache_path
is provided and exists, load the cache arrow table at the end of init methodprefer_cache (bool) – If true, use cache if present. If false, ignore cache even if it exists.
- property all_rec_ids: List[str]
Returns the list of all record ids in the dataset across all shards.
- ignore_cache()
Return raw data even if cache exists.
- Return type:
None
- prefer_cache()
Return cached embeddings if possible.
- Return type:
None
- disable_cache()
Context manager to temporarily disable vector cache.
- shard(shard_idx, num_shards=None, shard_weights=None, hard_shard=False)
Shard the dataset.
It can do both a hard and soft shard.
In a soft shard, we only mask the index of rows that are not in the shard. And, the underlying data is not changed and you can reverse the sharding by calling the
unshard()
method.In a hard shard, however, the underlying data is changed and the effects of sharding are irreversible for this instance. Also keep in mind that after a hard shard, the dataset will seem unsharded to other methods/classes like
VectorCacheMixin
.In most cases, you should just use the default soft sharding. Hard sharding is mostly useful if you plan to shard the dataset twice. For example, if you want to encode each shard in a separate session which itself runs in a distributed environment, You can do a hard shard in each session before passing the dataset to
trove.RetrievalEvaluator()
, for instance. This allowsRetrievalEvaluator
to further shard the dataset in each session into smaller pieces for each process.- Parameters:
shard_idx (int) – Index of current shards
num_shards (Optional[int]) – Total number of shards. If
shard_weights
is provided,num_shards
should be eitherNone
or equal tolen(shard_weights)
.shard_weights (Optional[List[float]]) – relative number of items in each shard. If provided, shard the dataset according to these weights. If not provided, all shards are of the same size.
hard_shard (bool) – Whether to a hard/permanent shard or a soft/superficial shard.
- Return type:
None
- unshard()
Reverse the sharding.
I.e., unmask the index of rows accessible to dataset
- Return type:
None