MaterializedQRelConfig
- class trove.containers.materialized_qrel_config.MaterializedQRelConfig(qrel_path=None, corpus_path=None, query_path=None, corpus_cache=None, query_cache=None, query_subset_path=None, min_score=None, max_score=None, filter_fn=None, group_top_k=None, group_bottom_k=None, group_first_k=None, group_random_k=None, group_filter_fn=None, score_transform=None)
Information about a collection of queries, documents, and (optionally) the relation between them.
You can use both path to local files and remote HF hub fsspec URIs for
qrel_path
,corpus_path
,query_path
, andquery_subset_path
. HF hub URIs must start withhf://
. See HF hub documentation for the exact structure.If you do not set the value of
qrel_path
or set it to an empty list,MaterializedQRel
will be a namespace container without any information about the relation between queries and documents. Seetrove.containers.materialized_qrel.MaterializedQRel
docstring for details.-
qrel_path:
Union
[PathLike
,List
[PathLike
],None
] = None One or multiple files that contain triplets of
('qid', 'docid', 'score')
. The files do not need to explicitely contain such triplets. We can also infer them from other types of data. Look atfile_reader.load_qrel()
for supported files.
-
corpus_path:
Union
[PathLike
,List
[PathLike
],None
] = None One or multiple files that contain the passage text and optionally titles.
-
query_path:
Union
[PathLike
,List
[PathLike
],None
] = None One or multiple files that contain the query texts.
-
corpus_cache:
Union
[PathLike
,List
[PathLike
],None
] = None (Not directly used) A Corresponding cache file name for each of the
corpus_path
to read/write the resulting embedding vectors. We do not directly use this. We just save it as part of theMaterializedQRel.args
that you can use later. For example, you can use this to store a unique relative filepath for each of the corpus files. Then, during runtime, calculate a parent directory (e.g., based on embedding model name, etc.) and combine it with the relative filepath to get the complete path to cache files.
-
query_cache:
Union
[PathLike
,List
[PathLike
],None
] = None (Not directly used) A Corresponding cache file name for each of the
query_path
to read/write the resulting embedding vectors. See docstring forcorpus_cache
for details.
-
query_subset_path:
Union
[PathLike
,List
[PathLike
],None
] = None One or multiple files that it is possible to read a list of query IDs from. The available qrel triplets are limited to these queries. See
file_reader.load_qids()
for the supported files.
-
min_score:
Union
[int
,float
,None
] = None If provided, filter the qrel triplets and only keep ones with
min_score <= score
(Endpoint is included in the interval)
-
max_score:
Union
[int
,float
,None
] = None If provided, filter the qrel triplets and only keep ones with
score < max_score
(Endpoint is NOT included in the interval)
-
filter_fn:
Optional
[Callable
[[Dict
[str
,Any
]],bool
]] = None A callable used for filtering qrel triplets. If provided,
min_score
andmax_score
are ignored. filter_fn should take a dict (content of the qrel triplet withqid
,docid
, andscore
keys) as input and return a boolean as output. It is used likedatasets.Dataset.filter(filter_fn, ...)
. I.e., keep the record iffilter_fn
returnsTrue
.
-
group_top_k:
Optional
[int
] = None If given, filter the available documents for each query and only choose the
group_top_k
documents with the highest score for each query.
-
group_bottom_k:
Optional
[int
] = None If given, filter the available documents for each query and only choose the
group_bottom_k
documents with the lowest score for each query.
-
group_first_k:
Optional
[int
] = None If given, filter the available documents for each query and only keep the first
group_first_k
documents (in their original ordering) for each query.
-
group_random_k:
Optional
[int
] = None If given, filter the available documents for each query and choose
group_random_k
documents randomly for each query. Return all documents if number of available documents per query is smaller thangroup_random_k
.
-
group_filter_fn:
Optional
[Callable
[[List
[Dict
[str
,Any
]]],List
[Dict
[str
,Any
]]]] = None A callable used to filter the qrel triplets for each query. If given, it overrides
group_first_k
,group_top_k
,group_bottom_k
, andgroup_random_k
. There are several differences betweengroup_filter_fn
andfilter_fn
.filter_fn
is used in the __init__ function to filter all the triplets for all queries and get the collection of available qrel triplets. Butgroup_filter_fn
is called whenever you attempt to get a list of available triplets for some query (i.e., every time you call methods likeget_related_recs_for_*
). Unlikefilter_fn
, results ofgroup_filter_fn
are not cached.filter_fn
operates on individual qrel triplets. But,group_filter_fn
operates on the list of all available qrel triplets for some query.
group_filter_fn
must be a callable that takes one positional argument. The argument is a list of dict objects. Each dict object is a qrel triplet for the query. The dict object contains keysqid
,docid
,score
, and potentially other keys. The input list contains all the available qrel triplets for this query (the list could be empty). This callable should return an output with the same format as its input (i.e., a list of dicts). The behavior of this class is unknown if the callable receives a non-empty list but returns an empty list. If given, this callable is called before callingscore_transform
. This argument is useful for filtering documents based on other documents available for each query. For example, to only keep the N most similar items for each query.
-
score_transform:
Union
[str
,int
,float
,Callable
[[Dict
[str
,Any
]],Union
[int
,float
]],None
] = None A transformation applied to scores at the very last step right before returning them. Acceptable types for
score_transform
are:None
: return the scores as iscallable
: it should take a dict (content of the qrel triplet withqid
,docid
, andscore
keys) as input and return the transformed score as output. it will be used likenew_score = score_transform(rec)
Union[int, float]
: This value is used as the score for all qrel triplets. I.e., score is a constant for all query-document pairsstr
: A predefined behavior. At the momentfloor
andceil
are valid behaviors.floor
andceil
returnint(triplet['score'])
andmath.ceil(triplet['score'])
, respectively
- ensure_list_of_correct_dtype()
Ensure everything of type
List[str]
.
- to_dict()
Return a json serializable view of the class attributes.
- Return type:
Dict
-
qrel_path: