BinaryDataset

class trove.data.ir_dataset_binary.BinaryDataset(data_args, format_query, format_passage, positive_configs=None, negative_configs=None, train_cache_path=None, data_args_overrides=None, trainer=None, outside_trainer=False, num_proc=None)
__init__(data_args, format_query, format_passage, positive_configs=None, negative_configs=None, train_cache_path=None, data_args_overrides=None, trainer=None, outside_trainer=False, num_proc=None)

IR training dataset with only two levels of relevance (i.e., only positive and negative passages).

  • collections of positive (negative) passages are created from positive_configs (negative_configs) argument.

  • passages that appear in either of the positive or negative collections will be included in the resulting dataset.

Parameters:
  • data_args (DataArguments) – general arguments for loading and processing the data

  • format_query (Callable[[str, Optional[str]], str]) – A callable that takes the query text and optionally the dataset name and returns the formatted query text for the modle.

  • format_passage (Callable[[str, str, Optional[str]], str]) – A callable that takes the passage text and title and dataset name and returns the formatted passage text for the model.

  • positive_configs (Optional[Union[MaterializedQRelConfig, List[MaterializedQRelConfig]]]) – Config for one or multiple collections of queries, documents, and the relation between them. The passages from these collections are used as positives.

  • negative_configs (Optional[Union[MaterializedQRelConfig, List[MaterializedQRelConfig]]]) – Config for one or multiple collections of queries, documents, and the relation between them. The passages from these collections are used as negatives.

  • train_cache_path (Optional[os.PathLike]) – DO NOT USE. For internal operations only and not stable. If given, create a dataset only for training from this cache file. This is much more memory efficient compared to creating the dataset on-the-fly. You should use export_and_load_train_cache() method to take advantage of this.

  • data_args_overrides (Optional[Dict[str, Any]]) – A mapping from a subset of DataArguments attribute names to their new values. These key values override the corresponding attributes in data_args argument. It is useful if you want to create multiple datasets from the same DataArguments instance but make small changes for each dataset without creating new DataArguments instances.

  • trainer (Optional[Trainer]) – An instance of the transformers.Trainer class. The random seed and epoch from trainer instance are used to sample positive and negative documents.

  • outside_trainer (bool) – If true, do not use trainer instance and set seed and epoch both to zero. Useful for debugging without a trainer instance.

  • num_proc (Optional[int]) – arg to to methods like datasets.Dataset.*

update_metadata()

Updates the metadata for the dataset.

It creates a new fingerprint and metadata dict for the dataset.

Return type:

None

property fingerprint: str

Calculates a unique fingerprint for the contents and output of this dataset.

Datasets with the same fingerprint are backed by the same underlying data but do NOT necessarily generate the same samples. For example, using different sampling strategies or query and document formatting functions leads to different output from the same underlying data and thus the same fingerprint.

This fingerprint is for internal operations only and you should not rely on it. And if you do, just use it to identify the underlying data (e.g., for caching and loading) and not the exact samples.

datasets with the same fingerprint are backed by the same data sources.

property info: Dict
set_index_lookup_storage_type(storage)

Select if the key to row index lookup table should be stored in memory or in memory- mapped lmdb dict.

Return type:

None

create_group_from_cache(index)

Loads one query and the related negative and positive passages.

Input and output is the same as create_group_on_the_fly() but reads the data from cache file rather than creating it on-the-fly.

Return type:

Dict[str, Union[str, List[Dict]]]

create_group_on_the_fly(index)

Loads one query and the related negative and positive passages.

The content of query/passages is also loaded (i.e., records are materialized.). The return format is based on tevatron .

Parameters:

index (int) – Index of the query to load

Return type:

Dict[str, Union[str, List[Dict]]]

Returns:

A dict of the following format:

{
    'query_id': 'Unique ID of query across all files in this dataset'.
    'query': 'query text',
    'positive_passages': [ # list of positive documents for this query
    {'_id': 'globally unique id of the passage', 'text': '...', 'title': '...'}, # There could be additional fields in this dict, which should be ignored
    ...,
    {'_id': ....}
    ],
    'negative_passages': [ # list of negative documents for this query
    {'_id': ....}, # The same data structure and field names as positive documents.
    ...,
    {'_id': ....}
    ]

}

set_trainer(trainer)

Set the trainer attribute.

Return type:

None

epoch_and_seed()

If trainer instance is available, load seed and current epoch from trainer.

Return type:

Tuple[int, Union[float, int]]

export_and_load_train_cache(cache_file=None, cache_pardir=None, num_proc=None, batch_size=16)

Export the training groups to a cache file and load them into a new instance of BinaryDataset.

To reduce memory consumption, it generates all the training groups, write them into a json lines file and returns a new BinaryDataset instance from those cached records. To benefit from the reduced memory consumption, make sure you do not keep any references to the old dataset instance, so it can be garbage collected by the interpreter.

You can do something like:

dataset = BinaryDataset(...)
dataset = dataset.export_and_load_train_cache()
gc.collect() # if you want to force the interpreter to release the memory right away
Parameters:
  • cache_file (Optional[PathLike]) – a json lines files to save the cached training groups. If None, a unique cache file is created based on the dataset fingerprint.

  • cache_pardir (Optional[PathLike]) – the parent directory to save the cache file to. If provided, we create a a subdir in this directory based on the dataset fingerprint and save the dataset cache in that subdir.

  • num_proc (Optional[int]) – number of workers to use to generate the training groups.

  • batch_size (int) – read the training groups in batches of the given size

Returns:

A new instance of BinaryDataset for training that is backed by the cached training groups.