RetrievalEvaluator
- class trove.evaluation.evaluator.RetrievalEvaluator(args=None, model=None, tokenizer=None, data_collator=None, eval_dataset=None, compute_metrics=None, logit_collector=None, tracker_init_kwargs=None, tracker_extra_configs=None, tracker_callbacks=None)
- __init__(args=None, model=None, tokenizer=None, data_collator=None, eval_dataset=None, compute_metrics=None, logit_collector=None, tracker_init_kwargs=None, tracker_extra_configs=None, tracker_callbacks=None)
Simple class for evaluating retrieval performance, mining hard negatives, or just computing the embeddings.
- Parameters:
args (Optional[EvaluationArguments]) – general arguments to control the evaluation/encoding process.
model (Union[PretrainedRetriever, PreTrainedModel, nn.Module]) – retriever model to use
tokenizer (Optional[PreTrainedTokenizerBase]) – (Not used currently)
data_collator (Optional[Callable]) – callable to create a batch from a list of examples. It should be able to tokenize the text if embeddings are not precomputed.
eval_dataset (Optional[MultiLevelDataset]) – Evaluate the performance on this dataset.
compute_metrics (Optional[IRMetrics]) – an instance of IRMetrics class to calculate the IR metrics from predicted scores and ground truth qrels.
logit_collector (Optional[ResultHeapq]) – an instance of ResultHeapq that should take the similarity scores for each batch and keep the topk most similar documents with their scores for each query.
tracker_init_kwargs (Optional[Dict]) – extra kwargs for initializing experiment trackers. See
RetrievalEvaluatorUtilsMixin
for details.tracker_extra_configs (Optional[Union[List[Dict], Dict]]) – extra configs to log with experiment trackers. See
RetrievalEvaluatorUtilsMixin
for details.tracker_callbacks (Optional[Any]) – One or multiple custom experiment tracker callbacks. See
RetrievalEvaluatorUtilsMixin
for details.
- get_shard_weights(dataset)
Calculates relative shard sizes based on device performance.
In each process, it runs the model on a small subset of the given dataset (identical across processes) and judges the model throughput based on the time that it takes to process this subset. Throughput in each process is used as the shard weight for that processes.
- Parameters:
dataset (EncodingDataset) – dataset to use for benchmarking. We only use a very small subset of it.
- Return type:
Optional
[List
[float
]]- Returns:
None
if fair sharding is disabled. A list of shard weights such thatoutput[rank]
is the shard weight for the process withrank
.
- encode(eval_dataset=None, encoding_dataset=None, cache_pardir=None, display_name='eval')
Encode texts and cache embeddings.
For
eval_dataset
, the query and corpus files that it uses are encoded.If an
EncodingDataset
ends up without any cache filepath, it raises an exception (there is probably something wrong if you are just computing the embeddings and immediately throwing them away).
- Parameters:
eval_dataset (Optional[Union[MultiLevelDataset, Dict, List]]) – If given, encode the query and corpus files used in this dataset.
encoding_dataset (Optional[Union[EncodingDataset, Dict, List]]) – encode the data generated by these
EncodingDataset
instances.cache_pardir (Optional[os.PathLike]) –
Save the embedding cache here. The order of priority for where cache is saved is as following:
cache file path already attached to EncodingDataset instances
some file in
cache_pardir
given to this functionsome file in
EvaluationArguments.encoding_cache_dir
if providedno cache is saved (raises an exception)
display_name (str) – Name to use in console logs and progress bars. Ideally, it should contain some information about the dataset being encoded.
- Return type:
None
- nearest_neighbor_search(query_dataset, corpus_dataset, logit_collector, cache_pardir, display_name)
Find the nearest neighbors for each query.
Note
To save memory in distributed environments, this method only returns the output in process with
rank == 0
and returnsNone
in other processes.- Parameters:
query_dataset (Union[EncodingDataset, List[EncodingDataset]]) – One or multiple datasets holding the search queries.
corpus_dataset (Union[EncodingDataset, List[EncodingDataset]]) – One or multiple datasets holding the documents to search.
logit_collector (Union[ResultHeapq, FastResultHeapq]) – Instance of
ResultHeapq
orFastResultHeapq
that takes the scores for each batch and keeps the topk most similar documents and their scores for each query.cache_pardir (os.PathLike) – Write the embedding cache files to this directory.
display_name (str) – Name to use in console logs and progress bars.
- Return type:
Optional
[Dict
[str
,Dict
[str
,float
]]]- Returns:
a mapping containing the collected topk similarities, which is the output of the
logit_collector
.
- evaluate(eval_dataset=None, logit_collector=None, cache_pardir=None, display_name='eval', broadcast_output=None)
Run evaluation and return metrics and collected logits (i.e., scores).
The intermediate embeddings are written to a temporary cache. If the user does not explicitly ask to cache the embeddings, the temporary cache is deleted before returning from this function.
- Parameters:
eval_dataset (Optional[) – dataset to evaluate (if not provided, use
RetrievalEvaluator.eval_dataset
)logit_collector (Optional[Union[Union[ResultHeapq, FastResultHeapq], Dict[str, Union[ResultHeapq, FastResultHeapq]]]]) – One or multiple instances of
ResultHeapq
orFastResultHeapq
that take the scores for each batch and keep track of the topk most similar documents for each query.cache_pardir (Optional[os.PathLike]) – Write the embedding cache files to this directory (if not provided, use
RetrievalEvaluator.cache_pardir
)display_name (str) – Name to use in console logs and progress bars.
broadcast_output (Optional[bool]) – (only for distributed environments) If true, the output is duplicated across all processes (i.e., this method returns identical output in all processes). If False, only the main process returns the output and other processes return None. Set it to False to save memory on machines with multiple GPUs.
- Return type:
Union
[Dict
[str
,Union
[Any
,Dict
[str
,float
]]],Dict
[str
,Dict
[str
,Union
[Any
,Dict
[str
,float
]]]],None
]- Returns:
a mapping with two keys,
metrics
andlogits
.metrics
is a mapping from metric name to metric value.logits
is the subset of scores collected bylogit_collector
and is obtained by callinglogit_collector.as_qrel_nested_dict
. Ifeval_dataset
is a dict, we return a mapping from keys ineval_dataset
to the results for the corresponding dataset.
- mine_hard_negatives(eval_dataset=None, query_filepath=None, corpus_filepath=None, logit_collector=None, cache_pardir=None, display_name='eval', num_negs=None, broadcast_output=None)
Mine the most similar documents for each query as hard negatives.
Retrieves the topk most similar documents for each query as hard negatives.
In the case that the resulting
eval_dataset
object (whether given or created fromquery_filepath
, andcorpus_filepath
) contains a valid set of qrel triplets, queries that have no corresponding qrel triplet are ignored. I.e., no documents are mined for them and they are not included in the returned results.Mined hard negatives are returned in nested qrel format. But, instead of returning one qrel object, it creates one qrel object for each pair of query and corpus files (this allows us to read queries and documents from multiple files with potentially none-unique IDs across files). It returns a list of dicts, each corresponding to a pair of query and corpus files:
[ { 'query_file': 'path to file that contains the corresponding queries.', 'corpus_file': 'path to file that contains the corresponding documents.', 'qrel': '''a subset of mined hard negatives that contain only queries and documents from 'query_file' and 'corpus_file'. it is in nested qrel format (i.e., qrel[qid][docid]=similarity(qid, docid)).''' } ... ]
If
args.output_dir
is provided, mined hard negatives are also written to disk in grouped qrel format in a json lines file.- Parameters:
eval_dataset (
Union
[MultiLevelDataset
,Dict
[str
,MultiLevelDataset
],None
]) – dataset to mine hard negatives for.query_filepath (
Union
[PathLike
,List
[PathLike
],None
]) – file to read queries from.corpus_filepath (
Union
[PathLike
,List
[PathLike
],None
]) – file to read documents from.logit_collector (
Union
[ResultHeapq
,FastResultHeapq
,Dict
[str
,Union
[ResultHeapq
,FastResultHeapq
]],None
]) – Instance ofResultHeapq
orFastResultHeapq
that takes the scores for each batch and keeps the topk most similar documents and their scores for each query.cache_pardir (
Optional
[PathLike
]) – Write the embedding cache files to this directory.display_name (
str
) – Name to use in console logs and progress bars.num_negs (
Optional
[int
]) – number of hard negatives to mine per query. If not provided, use the value ofargs.search_topk
.broadcast_output (Optional[bool]) – (only for distributed environments) If true, the output is duplicated across all processes (i.e., this method returns identical output in all processes). If false, only the main process returns the output and other processes return
None
. Set it to false to save memory on machines with multiple GPUs.
- Return type:
Union
[List
[Dict
[str
,Union
[PathLike
,Dict
[str
,Dict
[str
,float
]]]]],Dict
[str
,List
[Dict
[str
,Union
[PathLike
,Dict
[str
,Dict
[str
,float
]]]]]],None
]- Returns:
A list of mined hard negatives with one entry for each contributing pair of query and corpus files. See extended method docstring for details. If
eval_dataset
is a dict, we return a mapping from keys ineval_dataset
to the described results for the corresponding dataset.
- remove_temp_cache_pardir()
Delete cache pardir and all its content if it was supposed to be temporary.
Does not raise exception on failure.
- Return type:
None