RetrievalEvaluator

class trove.evaluation.evaluator.RetrievalEvaluator(args=None, model=None, tokenizer=None, data_collator=None, eval_dataset=None, compute_metrics=None, logit_collector=None, tracker_init_kwargs=None, tracker_extra_configs=None, tracker_callbacks=None)
__init__(args=None, model=None, tokenizer=None, data_collator=None, eval_dataset=None, compute_metrics=None, logit_collector=None, tracker_init_kwargs=None, tracker_extra_configs=None, tracker_callbacks=None)

Simple class for evaluating retrieval performance, mining hard negatives, or just computing the embeddings.

Parameters:
  • args (Optional[EvaluationArguments]) – general arguments to control the evaluation/encoding process.

  • model (Union[PretrainedRetriever, PreTrainedModel, nn.Module]) – retriever model to use

  • tokenizer (Optional[PreTrainedTokenizerBase]) – (Not used currently)

  • data_collator (Optional[Callable]) – callable to create a batch from a list of examples. It should be able to tokenize the text if embeddings are not precomputed.

  • eval_dataset (Optional[MultiLevelDataset]) – Evaluate the performance on this dataset.

  • compute_metrics (Optional[IRMetrics]) – an instance of IRMetrics class to calculate the IR metrics from predicted scores and ground truth qrels.

  • logit_collector (Optional[ResultHeapq]) – an instance of ResultHeapq that should take the similarity scores for each batch and keep the topk most similar documents with their scores for each query.

  • tracker_init_kwargs (Optional[Dict]) – extra kwargs for initializing experiment trackers. See RetrievalEvaluatorUtilsMixin for details.

  • tracker_extra_configs (Optional[Union[List[Dict], Dict]]) – extra configs to log with experiment trackers. See RetrievalEvaluatorUtilsMixin for details.

  • tracker_callbacks (Optional[Any]) – One or multiple custom experiment tracker callbacks. See RetrievalEvaluatorUtilsMixin for details.

get_shard_weights(dataset)

Calculates relative shard sizes based on device performance.

In each process, it runs the model on a small subset of the given dataset (identical across processes) and judges the model throughput based on the time that it takes to process this subset. Throughput in each process is used as the shard weight for that processes.

Parameters:

dataset (EncodingDataset) – dataset to use for benchmarking. We only use a very small subset of it.

Return type:

Optional[List[float]]

Returns:

None if fair sharding is disabled. A list of shard weights such that output[rank] is the shard weight for the process with rank.

encode(eval_dataset=None, encoding_dataset=None, cache_pardir=None, display_name='eval')

Encode texts and cache embeddings.

  • For eval_dataset, the query and corpus files that it uses are encoded.

  • If an EncodingDataset ends up without any cache filepath, it raises an exception (there is probably something wrong if you are just computing the embeddings and immediately throwing them away).

Parameters:
  • eval_dataset (Optional[Union[MultiLevelDataset, Dict, List]]) – If given, encode the query and corpus files used in this dataset.

  • encoding_dataset (Optional[Union[EncodingDataset, Dict, List]]) – encode the data generated by these EncodingDataset instances.

  • cache_pardir (Optional[os.PathLike]) –

    Save the embedding cache here. The order of priority for where cache is saved is as following:

    • cache file path already attached to EncodingDataset instances

    • some file in cache_pardir given to this function

    • some file in EvaluationArguments.encoding_cache_dir if provided

    • no cache is saved (raises an exception)

  • display_name (str) – Name to use in console logs and progress bars. Ideally, it should contain some information about the dataset being encoded.

Return type:

None

Find the nearest neighbors for each query.

Note

To save memory in distributed environments, this method only returns the output in process with rank == 0 and returns None in other processes.

Parameters:
  • query_dataset (Union[EncodingDataset, List[EncodingDataset]]) – One or multiple datasets holding the search queries.

  • corpus_dataset (Union[EncodingDataset, List[EncodingDataset]]) – One or multiple datasets holding the documents to search.

  • logit_collector (Union[ResultHeapq, FastResultHeapq]) – Instance of ResultHeapq or FastResultHeapq that takes the scores for each batch and keeps the topk most similar documents and their scores for each query.

  • cache_pardir (os.PathLike) – Write the embedding cache files to this directory.

  • display_name (str) – Name to use in console logs and progress bars.

Return type:

Optional[Dict[str, Dict[str, float]]]

Returns:

a mapping containing the collected topk similarities, which is the output of the logit_collector.

evaluate(eval_dataset=None, logit_collector=None, cache_pardir=None, display_name='eval', broadcast_output=None)

Run evaluation and return metrics and collected logits (i.e., scores).

The intermediate embeddings are written to a temporary cache. If the user does not explicitly ask to cache the embeddings, the temporary cache is deleted before returning from this function.

Parameters:
  • eval_dataset (Optional[) – dataset to evaluate (if not provided, use RetrievalEvaluator.eval_dataset)

  • logit_collector (Optional[Union[Union[ResultHeapq, FastResultHeapq], Dict[str, Union[ResultHeapq, FastResultHeapq]]]]) – One or multiple instances of ResultHeapq or FastResultHeapq that take the scores for each batch and keep track of the topk most similar documents for each query.

  • cache_pardir (Optional[os.PathLike]) – Write the embedding cache files to this directory (if not provided, use RetrievalEvaluator.cache_pardir)

  • display_name (str) – Name to use in console logs and progress bars.

  • broadcast_output (Optional[bool]) – (only for distributed environments) If true, the output is duplicated across all processes (i.e., this method returns identical output in all processes). If False, only the main process returns the output and other processes return None. Set it to False to save memory on machines with multiple GPUs.

Return type:

Union[Dict[str, Union[Any, Dict[str, float]]], Dict[str, Dict[str, Union[Any, Dict[str, float]]]], None]

Returns:

a mapping with two keys, metrics and logits. metrics is a mapping from metric name to metric value. logits is the subset of scores collected by logit_collector and is obtained by calling logit_collector.as_qrel_nested_dict. If eval_dataset is a dict, we return a mapping from keys in eval_dataset to the results for the corresponding dataset.

mine_hard_negatives(eval_dataset=None, query_filepath=None, corpus_filepath=None, logit_collector=None, cache_pardir=None, display_name='eval', num_negs=None, broadcast_output=None)

Mine the most similar documents for each query as hard negatives.

Retrieves the topk most similar documents for each query as hard negatives.

In the case that the resulting eval_dataset object (whether given or created from query_filepath, and corpus_filepath) contains a valid set of qrel triplets, queries that have no corresponding qrel triplet are ignored. I.e., no documents are mined for them and they are not included in the returned results.

Mined hard negatives are returned in nested qrel format. But, instead of returning one qrel object, it creates one qrel object for each pair of query and corpus files (this allows us to read queries and documents from multiple files with potentially none-unique IDs across files). It returns a list of dicts, each corresponding to a pair of query and corpus files:

[
    {
        'query_file': 'path to file that contains the corresponding queries.',
        'corpus_file': 'path to file that contains the corresponding documents.',
        'qrel': '''a subset of mined hard negatives that contain only queries and documents
            from 'query_file' and 'corpus_file'. it is in nested qrel format (i.e., qrel[qid][docid]=similarity(qid, docid)).'''
    }

    ...
]

If args.output_dir is provided, mined hard negatives are also written to disk in grouped qrel format in a json lines file.

Parameters:
  • eval_dataset (Union[MultiLevelDataset, Dict[str, MultiLevelDataset], None]) – dataset to mine hard negatives for.

  • query_filepath (Union[PathLike, List[PathLike], None]) – file to read queries from.

  • corpus_filepath (Union[PathLike, List[PathLike], None]) – file to read documents from.

  • logit_collector (Union[ResultHeapq, FastResultHeapq, Dict[str, Union[ResultHeapq, FastResultHeapq]], None]) – Instance of ResultHeapq or FastResultHeapq that takes the scores for each batch and keeps the topk most similar documents and their scores for each query.

  • cache_pardir (Optional[PathLike]) – Write the embedding cache files to this directory.

  • display_name (str) – Name to use in console logs and progress bars.

  • num_negs (Optional[int]) – number of hard negatives to mine per query. If not provided, use the value of args.search_topk.

  • broadcast_output (Optional[bool]) – (only for distributed environments) If true, the output is duplicated across all processes (i.e., this method returns identical output in all processes). If false, only the main process returns the output and other processes return None. Set it to false to save memory on machines with multiple GPUs.

Return type:

Union[List[Dict[str, Union[PathLike, Dict[str, Dict[str, float]]]]], Dict[str, List[Dict[str, Union[PathLike, Dict[str, Dict[str, float]]]]]], None]

Returns:

A list of mined hard negatives with one entry for each contributing pair of query and corpus files. See extended method docstring for details. If eval_dataset is a dict, we return a mapping from keys in eval_dataset to the described results for the corresponding dataset.

remove_temp_cache_pardir()

Delete cache pardir and all its content if it was supposed to be temporary.

Does not raise exception on failure.

Return type:

None