MultiLevelDataset
- class trove.data.ir_dataset_multilevel.MultiLevelDataset(data_args, format_query, format_passage, qrel_config=None, eval_cache_path=None, train_cache_path=None, data_args_overrides=None, num_proc=None)
- __init__(data_args, format_query, format_passage, qrel_config=None, eval_cache_path=None, train_cache_path=None, data_args_overrides=None, num_proc=None)
IR training dataset with multiple levels of relevancy (supports more than two levels).
Collection of related documents for queries are created from one or more
qrel_config
If there are multiple collections and a passage shows up in multiple collections, the data from the last collection takes precedence (i.e., whatever record the object corresponding to
qrel_config[-1]
returns will be used for that specific passage)
- Parameters:
data_args (DataArguments) – general arguments for loading and processing the data
format_query (Callable[[str, Optional[str]], str]) – A callable that takes the query text and optionally the dataset name and returns the formatted query text for the modle.
format_passage (Callable[[str, str, Optional[str]], str]) – A callable that takes the passage text and title and dataset name and returns the formatted passage text for the model.
data_args_overrides (Optional[Dict[str, Any]]) – A mapping from a subset of
DataArguments
attribute names to their new values. These key values override the corresponding attributes indata_args
argument. It is useful if you want to create multiple datasets from the sameDataArguments
instance but make small changes for each dataset without creating newDataArguments
instances.qrel_config (Optional[Union[MaterializedQRelConfig, List[MaterializedQRelConfig]]]) – Config for one or more collections of queries, passages, and the relation between them. The combination of these collections will make up the content of this dataset.
eval_cache_path (Optional[os.PathLike]) – DO NOT USE. For internal operations only and not stable. If given, create a dataset only for evaluation from cache files in this directory. This is much more memory efficient compared to creating the dataset on-the-fly. You should use
export_and_load_eval_cache()
method to take advantage of this.train_cache_path (Optional[os.PathLike]) – DO NOT USE. For internal operations only and not stable. If given, create a dataset only for training from this cache file. This is much more memory efficient compared to creating the dataset on-the-fly. You should use
export_and_load_train_cache()
method to take advantage of this.num_proc (Optional[int]) – arg to to methods like
datasets.Dataset.*
- update_metadata()
Updates the metadata for the dataset.
It creates a new fingerprint and metadata dict for the dataset.
- Return type:
None
- property fingerprint: str
A unique fingerprint for the contents and output of this dataset.
Datasets with the same fingerprint are backed by the same underlying data but do NOT necessarily generate the same samples. For example, using different sampling strategies or query and document formatting functions leads to different output from the same underlying data and thus the same fingerprint.
This fingerprint is for internal operations only and you should not rely on it. And if you do, just use it to identify the underlying data (e.g., for caching and loading) and not the exact samples.
datasets with the same fingerprint are backed by the same data sources.
- property info: Dict
- set_index_lookup_storage_type(storage)
Select if the key to row index lookup table should be stored in memory or in memory- mapped lmdb dict.
- Return type:
None
- create_group_from_cache(index)
Loads one query and its related passages.
Input and output is the same as
create_group_on_the_fly()
but reads the data from cache file rather than creating it on-the-fly.- Return type:
Dict
[str
,Union
[str
,List
[Dict
]]]
- create_group_on_the_fly(index)
Loads one query and its related passages.
The content of query/passages is also loaded (i.e., records are materialized.).
If a passage exists in multiple collections, we use the data from the last collection in the self.qrel_collections list.
- Parameters:
index (int) – Index of the query to load
- Return type:
Dict
[str
,Union
[str
,List
[Dict
]]]- Returns:
A dict of the following format:
{ 'query_id': 'Unique ID of query across all files in this dataset'. 'query': 'query text', 'passages': [ # list of related passages for this query {'_id': 'globally unique id of the passage', 'text': '...', 'title': '...'} # There could be additional fields in this dict, which should be ignored ..., {'_id': ....} ] }
- get_encoding_datasets(get_cache_path=None, encoding_cache_pardir=None, **kwargs)
Generates encoding datasets for query and corpuses used in this dataset.
This is most useful for inference and evaluation. You can use the datasets in the output of this method to calculate the query/corpus embeddings and then similarity scores between them. You can also use
get_qrel_nested_dict()
to get the groundtruth qrels as a nested dict. With the groundtruth qrels and the calculated similarity scores, you can compute the IR evaluation metrics for the model.You can optionally assign new cache path to the encoding datasets. If you do so, the cache path in qrel collection arguments is ignored. You can use the arguments to this function to dynamically set the cache path for encoding datasets.
- Parameters:
get_cache_path (Optional[Callable]) –
A callable that generates the cache path for each encoding dataset. The callable should take three keyword arguments:
filepath (os.PathLike): path to the input data file
file_id (str): globally unique _id for this file (see code of
__init__
function for more info)orig_cache_path (Optional[PathLike]): the corresponding cache path for this filepath saved in
MaterializedQRel.args
And it should return
None
or the filepath to the cache for this datasetencoding_cache_pardir (Optional[os.PathLike]) –
If
get_cache_path
is not provided (i.e., isNone
)and
encoding_cache_pardir
is notNone
and
orig_cache_path
is a relative filepath that does not exist on disk
then, we assume
orig_cache_path
is a relative filepath and usePath(encoding_cache_pardir, orig_cache_path)
as the cache path for the encoding dataset.kwargs – keyword arguments passed to EncodingDataset.__init__
- Return type:
Tuple
[List
[EncodingDataset
],List
[EncodingDataset
]]- Returns:
A tuple. The first item is a list of encoding datasets for queries used in this dataset The second item is a list of encoding datasets for corpuses used in this dataset.
- get_qrel_nested_dict(return_global_ids=True)
Collect the qrel triplets from all qrel collections into the nested dict format used by
pytrec_eval
.- Parameters:
return_global_ids (bool) – if true, use global IDs for queries and documents.
- Return type:
Dict
[str
,Dict
[str
,Union
[int
,float
]]]- Returns:
a nested dict where
dict[qid][docid]
is the score between queryqid
and documentdocid
in qrel files.
- export_and_load_train_cache(cache_file=None, cache_pardir=None, num_proc=None, batch_size=16)
Export the training groups to a cache file and load them into a new instance of
MultiLevelDataset
.To reduce memory consumption, it generates all the training groups, write them into a json lines file and returns a new
MultiLevelDataset
instance from those cached records.To benefit from the reduced memory consumption, make sure you do not keep any references to the old dataset instance, so it can be garbage collected by the interpreter. You can do something like:
dataset = MultiLevelDataset(...) dataset = dataset.export_and_load_train_cache() gc.collect() # if you want to force the interpreter to release the memory right away
- Parameters:
cache_file (
Optional
[PathLike
]) – a json lines files to save the cached training groups. IfNone
, a unique cache file is created based on the dataset fingerprint.cache_pardir (
Optional
[PathLike
]) – the directory to save the cache file to. If provided, we create a a subdir in dis directory based on the dataset fingerprint and save the dataset cache in that subdir.num_proc (
Optional
[int
]) – number of workers to use to generate the training groups.batch_size (
int
) – read the training groups in batches of this size
- Returns:
A new instance of
MultiLevelDataset
for training that is backed by the cached training groups.
- export_and_load_eval_cache(cache_dir=None, cache_pardir=None)
Export the data required for evaluation to cache files and load them into a new instance of
MultiLevelDataset
.To reduce memory consumption, it creates all the data that is required for evaluation and writes them into cache files and returns a new
MultiLevelDataset
instance from those cache files.To benefit from the reduced memory consumption, make sure you do not keep any references to the old dataset instance, so it can be garbage collected by the interpreter. You can do something like:
dataset = MultiLevelDataset(...) dataset = dataset.export_and_load_eval_cache() gc.collect() # if you want to force the interpreter to release the memory right away
- Parameters:
cache_dir (
Optional
[PathLike
]) – a directory where cache files should be saved. IfNone
, create a unique cache directory based on the dataset fingerprint.cache_pardir (
Optional
[PathLike
]) – the parent directory to save the cache file to. If provided, we create a a subdir in this directory based on the dataset fingerprint and save the dataset cache in that subdir.
- Returns:
A new instance of
MultiLevelDataset
for evaluation that is backed by the cached files.