IRMetrics
- class trove.evaluation.metrics.IRMetrics(k_values=None, measures=None, relevance_threshold=1, include_in_batch_negs=False)
- __init__(k_values=None, measures=None, relevance_threshold=1, include_in_batch_negs=False)
Calculates IR metrics given qrels and scores.
you can either use this class by calling
compute()
which just returns the average metrics and does not change the instance internal state.You can call
add_batch()
multiple times and then callaggregate_results()
to get the average metrics across all examples seen so far.You can also pass an instance of this class to
transformers.Trainer
as thecompute_metrics
argument. But, you have to setbatch_eval_metrics=True
to be compatible with this class.
This is how to use it:
metric_calculator = IRMetrics(k_values=[10, 100]) args = TrainingArguments(..., batch_eval_metrics=True, label_names=["YOUR_LABEL_NAME"], ...) trainer = Trainer(args=args, ..., compute_metrics=compute_metrics, ...) ...
- Parameters:
k_values (Union[int, List[int]]) – a single or a list of cutoff values to calculate IR metrics for. E.g., nDCG@K or recall@K. If provided, use a predefined set of IR metrics with these cutoff values.
measures (Optional[Set[str]]) – a set of measures used with
pytrec_eval
. You should either specifymeasures
directly or specifyk_values
and we instantiate a default set of IR metrics with the givenk_values
.relevance_threshold (int) – minimum groundtruth relevancy level (inclusive) for a document to be considered relevant. Basically documents with relevancy levels smaller than this value will be excluded from the groundtruth. See see
pytrec_eval.RelevanceEvaluator.__init__()
for details.include_in_batch_negs (bool) – If true, include the in-batch negatives in metric calculation if available. It is only used if an instance of the class is passed to
transformers.Trainer
for compute_metrics argument.
- reset_state()
Clear the metrics collected so far.
Useful if you are using the same instance for multiple eval loops.
- Return type:
None
- compute(scores, qrels)
Calculates the average IR metrics for the given scores and qrels. Does not modify the internal state of the class.
- Parameters:
scores (Dict[str, Dict[str, Union[int, float]]]) – a mapping where
scores[qid][docid]
is the predicted similarity score between queryqid
and documentdocid
qrels (Dict[str, Dict[str, int]]) – The ground truth relevance scores.
qrels[qid][docid]
is the groundtruth similarity score for queryqid
and documentdocid
- Return type:
Dict
[str
,float
]- Returns:
A mapping from metric name to the average value of the metric for the given examples.
- add_batch(scores, qrels)
Calculates the average metrics for the given scores and qrels.
It also updates the sum of the metric values and total number of examples in the instance’s internal state. However, in each individual call, it only returns the average metrics for examples in the given call and not across all examples seen so far.
See
compute()
for description of arguments.- Return type:
Dict
[str
,float
]- Returns:
The average IR metrics for examples in this batch.
- aggregate_results()
Calculates the average IR metrics across all examples seen so far.
- Return type:
Dict
[str
,float
]
- check_training_arguments(args)
If used with
transformers.Trainer
, check that we support the arguments that it uses.