IRMetrics

class trove.evaluation.metrics.IRMetrics(k_values=None, measures=None, relevance_threshold=1, include_in_batch_negs=False)

__init__(k_values=None, measures=None, relevance_threshold=1, include_in_batch_negs=False)

Calculates IR metrics given qrels and scores.

you can either use this class by calling compute() which just returns the average metrics and does not change the instance internal state.
You can call add_batch() multiple times and then call aggregate_results() to get the average metrics across all examples seen so far.
You can also pass an instance of this class to transformers.Trainer as the compute_metrics argument. But, you have to set batch_eval_metrics=True to be compatible with this class.

This is how to use it:

metric_calculator = IRMetrics(k_values=[10, 100])
args = TrainingArguments(..., batch_eval_metrics=True, label_names=["YOUR_LABEL_NAME"], ...)
trainer = Trainer(args=args, ..., compute_metrics=compute_metrics, ...)
...

Parameters:

k_values (Union[int, List[int]]) – a single or a list of cutoff values to calculate IR metrics for. E.g., nDCG@K or recall@K. If provided, use a predefined set of IR metrics with these cutoff values.
measures (Optional[Set[str]]) – a set of measures used with pytrec_eval. You should either specify measures directly or specify k_values and we instantiate a default set of IR metrics with the given k_values.
relevance_threshold (int) – minimum groundtruth relevancy level (inclusive) for a document to be considered relevant. Basically documents with relevancy levels smaller than this value will be excluded from the groundtruth. See see pytrec_eval.RelevanceEvaluator.__init__() for details.
include_in_batch_negs (bool) – If true, include the in-batch negatives in metric calculation if available. It is only used if an instance of the class is passed to transformers.Trainer for compute_metrics argument.

reset_state()

Clear the metrics collected so far.

Useful if you are using the same instance for multiple eval loops.

Return type:: None

compute(scores, qrels)

Calculates the average IR metrics for the given scores and qrels. Does not modify the internal state of the class.

Parameters:

scores (Dict[str, Dict[str, Union[int, float]]]) – a mapping where scores[qid][docid] is the predicted similarity score between query qid and document docid
qrels (Dict[str, Dict[str, int]]) – The ground truth relevance scores. qrels[qid][docid] is the groundtruth similarity score for query qid and document docid

Return type:

Dict[str, float]

Returns:

A mapping from metric name to the average value of the metric for the given examples.

add_batch(scores, qrels)

Calculates the average metrics for the given scores and qrels.

It also updates the sum of the metric values and total number of examples in the instance’s internal state. However, in each individual call, it only returns the average metrics for examples in the given call and not across all examples seen so far.

See compute() for description of arguments.

Return type:: Dict[str, float]
Returns:: The average IR metrics for examples in this batch.

aggregate_results()

Calculates the average IR metrics across all examples seen so far.

Return type:: Dict[str, float]

check_training_arguments(args): If used with transformers.Trainer, check that we support the arguments that it uses.