MaterializedQRel

class trove.containers.materialized_qrel.MaterializedQRel(args, file_name_to_id=None, record_mapping_collection=None, num_proc=None)

__init__(args, file_name_to_id=None, record_mapping_collection=None, num_proc=None)

Represents a collection of query and documents and the relation between them.

You can use this class to combine multiple query files with multiple corpus files and represent the relation between them with a combination of multiple qrel files.

For the collection of queries:

Each query should have a unique ID

Each query should appear only once in each file

Each query should only appear in one file

In sum, each query ID should only appear once across all your files.

At the moment, we do not impose these restrictions strictly as it is computationally expensive for large collections. You will get logically wrong results (or in consistent, at best) if your files do not follow these restrictions.

The same is true for the collection of corpus files and the collection of qrel files. For qrel files, each record should be unique with respect to the combination of qid and docid. I.e, each (qid, docid) combination should only appear once across all your qrel files.

If you do not specify a qrel file (i.e., args.qrel_path is empty) You can still use this class for further organization as a namespace container to hold a query and a corpus collection (or just one of them) without any known relation between them. E.g. hold a query and corpus collection for hard negative mining without knowing the relevance levels between the two. Another example is a MaterializedQRel instance that only has a collection of documents (without qrels or even queries). You can mix such an instance with other MaterializedQRel instances in a trove.data.ir_dataset_multilevel.MultiLevelDataset instance to expand the document pool during nearest neighbor search without impacting the qrel triplets (without adding new qrel triplets).

Parameters:

args (MaterializedQRelConfig) – Information about files that contain the raw data and how to process them.
file_name_to_id (Optional[Dict[os.PathLike, str]]) – A mapping from file name to a unique ID (of type str) for that file. Although each qid and docid is unique in each instance of MaterializedQRel, you might want to combine multiple instances of MaterializedQRel that could potentially assign the same qid or docid to different examples. To make that possible and uniquely identify each record across files, we can create a new ID by combining the original ID (which is unique in each file) with a suffix that uniquely identifies the file that contains the record with that ID. file_name_to_id is the mapping from file names to fild IDs that are used for this purposes. if not provided, we use create this mapping based on the hash of the file bytes. You can update the file IDs later, but that leads to repeating a lot of the computations.
record_mapping_collection (Optional[Dict[str, RowsByKeySingleSource]]) – instance of trove.data.ir_dataset_multilevel.RowsByKeySingleSource for query and corpus files. If provided, use these instances instead of loading the data into new instances.
num_proc (Optional[int]) – passed to datasets.Dataset.* methods.

static create_score_transform(args_score_transform)

Create score transform function from class arguments.

This method should not be used when applying the score_transform function. Instead, you should call this in __init__ to generate the score_transform callable and use that in the rest of the code.

Parameters:: args_score_transform (Union[str, int, float, Callable[[Dict[str, Any]], Union[int, float]], None]) – value of score_transform in MaterializedQRelConfig object.
Return type:: Callable[[Dict[str, Any]], Union[int, float]]
Returns:: The score transform function that should be applied to each query/document score.

static create_group_filter_fn(args_group_filter_fn, group_top_k, group_bottom_k, group_first_k, group_random_k)

Create group_filter_fn function from class arguments.

You should call this in __init__ to generate the group_filter_fn callable and use that in the rest of the code.

Arguments are a subset of trove.containers.materialized_qrel_config.MaterializedQRelConfig attributes. See its docstring for that class for details.

Return type:: Optional[Callable[[List[Dict[str, Any]]], List[Dict[str, Any]]]]

update_metadata()

Updates the metadata for the container.

It creates a new fingerprint, cache_dir, and metadata dict for the container.

Return type:: None

property fingerprint: str

A unique fingerprint for the contents and output of this container.

Containers with the same fingerprint generate the same output.

property info: Dict

property cache_dir: Path

set_index_lookup_storage_type(storage)

Select if the key to row index lookup table should be stored in memory or in memory- mapped lmdb dict.

Return type:: None

local_to_global_id(_id, _file, **_)

Calculate the unique global ID across files for each local ID.

The global ID is the local ID (i.e., original ID) plus the unique file ID appended at the end: global_id = original_id + '_' + unique_file_id

Parameters:

_id (str) – local id to generate a global id from
_file (os.PathLike) – the file that contains the record with this ID
**_ – Not used. Just to capture extra arguments

Return type:

str

Returns:

The global ID (unique across files) corresponding the local _id

get_global_qids()

Created global IDs for all queries.

Return type:: List[str]

update_file_name_to_id(name_to_id_mapping)

Assign new global IDs to files used in this class.

Parameters:: name_to_id_mapping (Dict[os.PathLike, str]) – A mapping from filepath to its global ID.
Return type:: None

get_related_recs_for_local_qid(qid, materialize=False, return_global_ids=False, strict=True)

Retrieve the related query and document records for a given qid.

See MaterializedQRel._get_related_recs for more details.

Parameters:

qid (str) – See MaterializedQRel._get_related_recs for more details.
materialize (bool) – Even if this is false, the records could be materialized if return_global_ids is true. See self._get_related_recs() for more details.
return_global_ids (bool) – Whether to return global or local IDs in records
strict (bool) – Decides what to do if there are no corresponding qrel triplets for this qid. In such cases, if strict == False, it returns a tuple of (None, None), and if strict == True , it raises an exception.

Return type:

Tuple[Optional[Dict], Optional[List[Dict]]]

Returns:

A tuple of query record and related document records. See self._get_related_recs() for more details.

get_related_recs_for_global_qid(qid, materialize=False, return_global_ids=False, strict=True)

Get related query and document records given the global id a query.

See self.get_related_recs_for_local_qid() for details.

Parameters:: strict (bool) – Decides what to do if there are no corresponding qrel triplets for this qid. In such cases, if strict == False, it returns a tuple of (None, None), and if strict == True, it raises an exception.
Return type:: Tuple[Optional[Dict], Optional[List[Dict]]]
Returns:: A tuple of query record and related document records. And (None, None) if there are no triplets for query with the given ID (when strict == False). See self.get_related_recs_for_local_qid() for more details.

get_related_recs(global_qid=None, local_qid=None, materialize=False, return_global_ids=True, strict=True)

Get the related records for some query.

If strict==False and there are no triplets for query with the given ID, it returns (None, None). See MaterializedQRel.get_related_recs_for_local_qid() for more details.

Parameters:

global_qid (Optional[str]) – global ID of the query. Takes precedence over local_qid if provided.
local_qid (Optional[str]) – local ID of the query. It is ignored if global_qid is provided.

Return type:

Tuple[Optional[Dict], Optional[List[Dict]]]

get_qrel_nested_dict(return_global_ids=False)

Converts the qrel triplets to the nested dict format used by pytrec_eval.

Parameters:: return_global_ids (bool) – if true, use global ids for queries and documents.
Return type:: Dict[str, Dict[str, Union[int, float]]]
Returns:: a nested dict where dict[qid][docid] is the score between query qid and document docid.