MaterializedQRel
- class trove.containers.materialized_qrel.MaterializedQRel(args, file_name_to_id=None, record_mapping_collection=None, num_proc=None)
- __init__(args, file_name_to_id=None, record_mapping_collection=None, num_proc=None)
Represents a collection of query and documents and the relation between them.
You can use this class to combine multiple query files with multiple corpus files and represent the relation between them with a combination of multiple qrel files.
For the collection of queries:
Each query should have a unique ID
Each query should appear only once in each file
Each query should only appear in one file
In sum, each query ID should only appear once across all your files.
At the moment, we do not impose these restrictions strictly as it is computationally expensive for large collections. You will get logically wrong results (or in consistent, at best) if your files do not follow these restrictions.
The same is true for the collection of corpus files and the collection of qrel files. For qrel files, each record should be unique with respect to the combination of
qid
anddocid
. I.e, each (qid
,docid
) combination should only appear once across all your qrel files.If you do not specify a qrel file (i.e.,
args.qrel_path
is empty) You can still use this class for further organization as a namespace container to hold a query and a corpus collection (or just one of them) without any known relation between them. E.g. hold a query and corpus collection for hard negative mining without knowing the relevance levels between the two. Another example is aMaterializedQRel
instance that only has a collection of documents (without qrels or even queries). You can mix such an instance with other MaterializedQRel instances in atrove.data.ir_dataset_multilevel.MultiLevelDataset
instance to expand the document pool during nearest neighbor search without impacting the qrel triplets (without adding new qrel triplets).- Parameters:
args (MaterializedQRelConfig) – Information about files that contain the raw data and how to process them.
file_name_to_id (Optional[Dict[os.PathLike, str]]) – A mapping from file name to a unique ID (of type
str
) for that file. Although eachqid
anddocid
is unique in each instance ofMaterializedQRel
, you might want to combine multiple instances ofMaterializedQRel
that could potentially assign the sameqid
ordocid
to different examples. To make that possible and uniquely identify each record across files, we can create a new ID by combining the original ID (which is unique in each file) with a suffix that uniquely identifies the file that contains the record with that ID.file_name_to_id
is the mapping from file names to fild IDs that are used for this purposes. if not provided, we use create this mapping based on the hash of the file bytes. You can update the file IDs later, but that leads to repeating a lot of the computations.record_mapping_collection (Optional[Dict[str, RowsByKeySingleSource]]) – instance of
trove.data.ir_dataset_multilevel.RowsByKeySingleSource
for query and corpus files. If provided, use these instances instead of loading the data into new instances.num_proc (Optional[int]) – passed to datasets.Dataset.* methods.
- static create_score_transform(args_score_transform)
Create score transform function from class arguments.
This method should not be used when applying the
score_transform
function. Instead, you should call this in __init__ to generate the score_transform callable and use that in the rest of the code.- Parameters:
args_score_transform (
Union
[str
,int
,float
,Callable
[[Dict
[str
,Any
]],Union
[int
,float
]],None
]) – value ofscore_transform
inMaterializedQRelConfig
object.- Return type:
Callable
[[Dict
[str
,Any
]],Union
[int
,float
]]- Returns:
The score transform function that should be applied to each query/document score.
- static create_group_filter_fn(args_group_filter_fn, group_top_k, group_bottom_k, group_first_k, group_random_k)
Create
group_filter_fn
function from class arguments.You should call this in __init__ to generate the
group_filter_fn
callable and use that in the rest of the code.Arguments are a subset of
trove.containers.materialized_qrel_config.MaterializedQRelConfig
attributes. See its docstring for that class for details.- Return type:
Optional
[Callable
[[List
[Dict
[str
,Any
]]],List
[Dict
[str
,Any
]]]]
- update_metadata()
Updates the metadata for the container.
It creates a new fingerprint,
cache_dir
, and metadata dict for the container.- Return type:
None
- property fingerprint: str
A unique fingerprint for the contents and output of this container.
Containers with the same fingerprint generate the same output.
- property info: Dict
- property cache_dir: Path
- set_index_lookup_storage_type(storage)
Select if the key to row index lookup table should be stored in memory or in memory- mapped lmdb dict.
- Return type:
None
- local_to_global_id(_id, _file, **_)
Calculate the unique global ID across files for each local ID.
The global ID is the local ID (i.e., original ID) plus the unique file ID appended at the end:
global_id = original_id + '_' + unique_file_id
- Parameters:
_id (str) – local id to generate a global id from
_file (os.PathLike) – the file that contains the record with this ID
**_ – Not used. Just to capture extra arguments
- Return type:
str
- Returns:
The global ID (unique across files) corresponding the local
_id
- get_global_qids()
Created global IDs for all queries.
- Return type:
List
[str
]
- update_file_name_to_id(name_to_id_mapping)
Assign new global IDs to files used in this class.
- Parameters:
name_to_id_mapping (Dict[os.PathLike, str]) – A mapping from filepath to its global ID.
- Return type:
None
Retrieve the related query and document records for a given
qid
.See
MaterializedQRel._get_related_recs
for more details.- Parameters:
qid (str) – See
MaterializedQRel._get_related_recs
for more details.materialize (bool) – Even if this is false, the records could be materialized if
return_global_ids
is true. Seeself._get_related_recs()
for more details.return_global_ids (bool) – Whether to return global or local IDs in records
strict (bool) – Decides what to do if there are no corresponding qrel triplets for this qid. In such cases, if
strict == False
, it returns a tuple of(None, None)
, and ifstrict == True
, it raises an exception.
- Return type:
Tuple
[Optional
[Dict
],Optional
[List
[Dict
]]]- Returns:
A tuple of query record and related document records. See self._get_related_recs() for more details.
Get related query and document records given the global id a query.
See
self.get_related_recs_for_local_qid()
for details.- Parameters:
strict (bool) – Decides what to do if there are no corresponding qrel triplets for this
qid
. In such cases, ifstrict == False
, it returns a tuple of(None, None)
, and ifstrict == True
, it raises an exception.- Return type:
Tuple
[Optional
[Dict
],Optional
[List
[Dict
]]]- Returns:
A tuple of query record and related document records. And
(None, None)
if there are no triplets for query with the given ID (whenstrict == False
). Seeself.get_related_recs_for_local_qid()
for more details.
Get the related records for some query.
If
strict==False
and there are no triplets for query with the given ID, it returns(None, None)
. SeeMaterializedQRel.get_related_recs_for_local_qid()
for more details.- Parameters:
global_qid (Optional[str]) – global ID of the query. Takes precedence over
local_qid
if provided.local_qid (Optional[str]) – local ID of the query. It is ignored if
global_qid
is provided.
- Return type:
Tuple
[Optional
[Dict
],Optional
[List
[Dict
]]]
- get_qrel_nested_dict(return_global_ids=False)
Converts the qrel triplets to the nested dict format used by
pytrec_eval
.- Parameters:
return_global_ids (bool) – if true, use global ids for queries and documents.
- Return type:
Dict
[str
,Dict
[str
,Union
[int
,float
]]]- Returns:
a nested dict where dict[qid][docid] is the score between query
qid
and documentdocid
.