MultiLevelDataset

class trove.data.ir_dataset_multilevel.MultiLevelDataset(data_args, format_query, format_passage, qrel_config=None, eval_cache_path=None, train_cache_path=None, data_args_overrides=None, num_proc=None)
__init__(data_args, format_query, format_passage, qrel_config=None, eval_cache_path=None, train_cache_path=None, data_args_overrides=None, num_proc=None)

IR training dataset with multiple levels of relevancy (supports more than two levels).

  • Collection of related documents for queries are created from one or more qrel_config

  • If there are multiple collections and a passage shows up in multiple collections, the data from the last collection takes precedence (i.e., whatever record the object corresponding to qrel_config[-1] returns will be used for that specific passage)

Parameters:
  • data_args (DataArguments) – general arguments for loading and processing the data

  • format_query (Callable[[str, Optional[str]], str]) – A callable that takes the query text and optionally the dataset name and returns the formatted query text for the modle.

  • format_passage (Callable[[str, str, Optional[str]], str]) – A callable that takes the passage text and title and dataset name and returns the formatted passage text for the model.

  • data_args_overrides (Optional[Dict[str, Any]]) – A mapping from a subset of DataArguments attribute names to their new values. These key values override the corresponding attributes in data_args argument. It is useful if you want to create multiple datasets from the same DataArguments instance but make small changes for each dataset without creating new DataArguments instances.

  • qrel_config (Optional[Union[MaterializedQRelConfig, List[MaterializedQRelConfig]]]) – Config for one or more collections of queries, passages, and the relation between them. The combination of these collections will make up the content of this dataset.

  • eval_cache_path (Optional[os.PathLike]) – DO NOT USE. For internal operations only and not stable. If given, create a dataset only for evaluation from cache files in this directory. This is much more memory efficient compared to creating the dataset on-the-fly. You should use export_and_load_eval_cache() method to take advantage of this.

  • train_cache_path (Optional[os.PathLike]) – DO NOT USE. For internal operations only and not stable. If given, create a dataset only for training from this cache file. This is much more memory efficient compared to creating the dataset on-the-fly. You should use export_and_load_train_cache() method to take advantage of this.

  • num_proc (Optional[int]) – arg to to methods like datasets.Dataset.*

update_metadata()

Updates the metadata for the dataset.

It creates a new fingerprint and metadata dict for the dataset.

Return type:

None

property fingerprint: str

A unique fingerprint for the contents and output of this dataset.

Datasets with the same fingerprint are backed by the same underlying data but do NOT necessarily generate the same samples. For example, using different sampling strategies or query and document formatting functions leads to different output from the same underlying data and thus the same fingerprint.

This fingerprint is for internal operations only and you should not rely on it. And if you do, just use it to identify the underlying data (e.g., for caching and loading) and not the exact samples.

datasets with the same fingerprint are backed by the same data sources.

property info: Dict
set_index_lookup_storage_type(storage)

Select if the key to row index lookup table should be stored in memory or in memory- mapped lmdb dict.

Return type:

None

create_group_from_cache(index)

Loads one query and its related passages.

Input and output is the same as create_group_on_the_fly() but reads the data from cache file rather than creating it on-the-fly.

Return type:

Dict[str, Union[str, List[Dict]]]

create_group_on_the_fly(index)

Loads one query and its related passages.

The content of query/passages is also loaded (i.e., records are materialized.).

If a passage exists in multiple collections, we use the data from the last collection in the self.qrel_collections list.

Parameters:

index (int) – Index of the query to load

Return type:

Dict[str, Union[str, List[Dict]]]

Returns:

A dict of the following format:

{
    'query_id': 'Unique ID of query across all files in this dataset'.
    'query': 'query text',
    'passages': [ # list of related passages for this query
    {'_id': 'globally unique id of the passage', 'text': '...', 'title': '...'} # There could be additional fields in this dict, which should be ignored
    ...,
    {'_id': ....}
    ]
}

get_encoding_datasets(get_cache_path=None, encoding_cache_pardir=None, **kwargs)

Generates encoding datasets for query and corpuses used in this dataset.

This is most useful for inference and evaluation. You can use the datasets in the output of this method to calculate the query/corpus embeddings and then similarity scores between them. You can also use get_qrel_nested_dict() to get the groundtruth qrels as a nested dict. With the groundtruth qrels and the calculated similarity scores, you can compute the IR evaluation metrics for the model.

You can optionally assign new cache path to the encoding datasets. If you do so, the cache path in qrel collection arguments is ignored. You can use the arguments to this function to dynamically set the cache path for encoding datasets.

Parameters:
  • get_cache_path (Optional[Callable]) –

    A callable that generates the cache path for each encoding dataset. The callable should take three keyword arguments:

    • filepath (os.PathLike): path to the input data file

    • file_id (str): globally unique _id for this file (see code of __init__ function for more info)

    • orig_cache_path (Optional[PathLike]): the corresponding cache path for this filepath saved in MaterializedQRel.args

    And it should return None or the filepath to the cache for this dataset

  • encoding_cache_pardir (Optional[os.PathLike]) –

    If

    • get_cache_path is not provided (i.e., is None)

    • and encoding_cache_pardir is not None

    • and orig_cache_path is a relative filepath that does not exist on disk

    then, we assume orig_cache_path is a relative filepath and use Path(encoding_cache_pardir, orig_cache_path) as the cache path for the encoding dataset.

  • kwargs – keyword arguments passed to EncodingDataset.__init__

Return type:

Tuple[List[EncodingDataset], List[EncodingDataset]]

Returns:

A tuple. The first item is a list of encoding datasets for queries used in this dataset The second item is a list of encoding datasets for corpuses used in this dataset.

get_qrel_nested_dict(return_global_ids=True)

Collect the qrel triplets from all qrel collections into the nested dict format used by pytrec_eval.

Parameters:

return_global_ids (bool) – if true, use global IDs for queries and documents.

Return type:

Dict[str, Dict[str, Union[int, float]]]

Returns:

a nested dict where dict[qid][docid] is the score between query qid and document docid in qrel files.

export_and_load_train_cache(cache_file=None, cache_pardir=None, num_proc=None, batch_size=16)

Export the training groups to a cache file and load them into a new instance of MultiLevelDataset.

To reduce memory consumption, it generates all the training groups, write them into a json lines file and returns a new MultiLevelDataset instance from those cached records.

To benefit from the reduced memory consumption, make sure you do not keep any references to the old dataset instance, so it can be garbage collected by the interpreter. You can do something like:

dataset = MultiLevelDataset(...)
dataset = dataset.export_and_load_train_cache()
gc.collect() # if you want to force the interpreter to release the memory right away
Parameters:
  • cache_file (Optional[PathLike]) – a json lines files to save the cached training groups. If None, a unique cache file is created based on the dataset fingerprint.

  • cache_pardir (Optional[PathLike]) – the directory to save the cache file to. If provided, we create a a subdir in dis directory based on the dataset fingerprint and save the dataset cache in that subdir.

  • num_proc (Optional[int]) – number of workers to use to generate the training groups.

  • batch_size (int) – read the training groups in batches of this size

Returns:

A new instance of MultiLevelDataset for training that is backed by the cached training groups.

export_and_load_eval_cache(cache_dir=None, cache_pardir=None)

Export the data required for evaluation to cache files and load them into a new instance of MultiLevelDataset.

To reduce memory consumption, it creates all the data that is required for evaluation and writes them into cache files and returns a new MultiLevelDataset instance from those cache files.

To benefit from the reduced memory consumption, make sure you do not keep any references to the old dataset instance, so it can be garbage collected by the interpreter. You can do something like:

dataset = MultiLevelDataset(...)
dataset = dataset.export_and_load_eval_cache()
gc.collect() # if you want to force the interpreter to release the memory right away
Parameters:
  • cache_dir (Optional[PathLike]) – a directory where cache files should be saved. If None, create a unique cache directory based on the dataset fingerprint.

  • cache_pardir (Optional[PathLike]) – the parent directory to save the cache file to. If provided, we create a a subdir in this directory based on the dataset fingerprint and save the dataset cache in that subdir.

Returns:

A new instance of MultiLevelDataset for evaluation that is backed by the cached files.