MaterializedQRelConfig

class trove.containers.materialized_qrel_config.MaterializedQRelConfig(qrel_path=None, corpus_path=None, query_path=None, corpus_cache=None, query_cache=None, query_subset_path=None, min_score=None, max_score=None, filter_fn=None, group_top_k=None, group_bottom_k=None, group_first_k=None, group_random_k=None, group_filter_fn=None, score_transform=None)

Information about a collection of queries, documents, and (optionally) the relation between them.

You can use both path to local files and remote HF hub fsspec URIs for qrel_path, corpus_path, query_path, and query_subset_path. HF hub URIs must start with hf://. See HF hub documentation for the exact structure.

If you do not set the value of qrel_path or set it to an empty list, MaterializedQRel will be a namespace container without any information about the relation between queries and documents. See trove.containers.materialized_qrel.MaterializedQRel docstring for details.

qrel_path: Union[PathLike, List[PathLike], None] = None

One or multiple files that contain triplets of ('qid', 'docid', 'score'). The files do not need to explicitely contain such triplets. We can also infer them from other types of data. Look at file_reader.load_qrel() for supported files.

corpus_path: Union[PathLike, List[PathLike], None] = None

One or multiple files that contain the passage text and optionally titles.

query_path: Union[PathLike, List[PathLike], None] = None

One or multiple files that contain the query texts.

corpus_cache: Union[PathLike, List[PathLike], None] = None

(Not directly used) A Corresponding cache file name for each of the corpus_path to read/write the resulting embedding vectors. We do not directly use this. We just save it as part of the MaterializedQRel.args that you can use later. For example, you can use this to store a unique relative filepath for each of the corpus files. Then, during runtime, calculate a parent directory (e.g., based on embedding model name, etc.) and combine it with the relative filepath to get the complete path to cache files.

query_cache: Union[PathLike, List[PathLike], None] = None

(Not directly used) A Corresponding cache file name for each of the query_path to read/write the resulting embedding vectors. See docstring for corpus_cache for details.

query_subset_path: Union[PathLike, List[PathLike], None] = None

One or multiple files that it is possible to read a list of query IDs from. The available qrel triplets are limited to these queries. See file_reader.load_qids() for the supported files.

min_score: Union[int, float, None] = None

If provided, filter the qrel triplets and only keep ones with min_score <= score (Endpoint is included in the interval)

max_score: Union[int, float, None] = None

If provided, filter the qrel triplets and only keep ones with score < max_score (Endpoint is NOT included in the interval)

filter_fn: Optional[Callable[[Dict[str, Any]], bool]] = None

A callable used for filtering qrel triplets. If provided, min_score and max_score are ignored. filter_fn should take a dict (content of the qrel triplet with qid, docid, and score keys) as input and return a boolean as output. It is used like datasets.Dataset.filter(filter_fn, ...). I.e., keep the record if filter_fn returns True.

group_top_k: Optional[int] = None

If given, filter the available documents for each query and only choose the group_top_k documents with the highest score for each query.

group_bottom_k: Optional[int] = None

If given, filter the available documents for each query and only choose the group_bottom_k documents with the lowest score for each query.

group_first_k: Optional[int] = None

If given, filter the available documents for each query and only keep the first group_first_k documents (in their original ordering) for each query.

group_random_k: Optional[int] = None

If given, filter the available documents for each query and choose group_random_k documents randomly for each query. Return all documents if number of available documents per query is smaller than group_random_k.

group_filter_fn: Optional[Callable[[List[Dict[str, Any]]], List[Dict[str, Any]]]] = None

A callable used to filter the qrel triplets for each query. If given, it overrides group_first_k, group_top_k, group_bottom_k, and group_random_k. There are several differences between group_filter_fn and filter_fn.

  • filter_fn is used in the __init__ function to filter all the triplets for all queries and get the collection of available qrel triplets. But group_filter_fn is called whenever you attempt to get a list of available triplets for some query (i.e., every time you call methods like get_related_recs_for_*). Unlike filter_fn, results of group_filter_fn are not cached.

  • filter_fn operates on individual qrel triplets. But, group_filter_fn operates on the list of all available qrel triplets for some query.

group_filter_fn must be a callable that takes one positional argument. The argument is a list of dict objects. Each dict object is a qrel triplet for the query. The dict object contains keys qid, docid, score, and potentially other keys. The input list contains all the available qrel triplets for this query (the list could be empty). This callable should return an output with the same format as its input (i.e., a list of dicts). The behavior of this class is unknown if the callable receives a non-empty list but returns an empty list. If given, this callable is called before calling score_transform. This argument is useful for filtering documents based on other documents available for each query. For example, to only keep the N most similar items for each query.

score_transform: Union[str, int, float, Callable[[Dict[str, Any]], Union[int, float]], None] = None

A transformation applied to scores at the very last step right before returning them. Acceptable types for score_transform are:

  • None : return the scores as is

  • callable : it should take a dict (content of the qrel triplet with qid, docid, and score keys) as input and return the transformed score as output. it will be used like new_score = score_transform(rec)

  • Union[int, float] : This value is used as the score for all qrel triplets. I.e., score is a constant for all query-document pairs

  • str : A predefined behavior. At the moment floor and ceil are valid behaviors. floor and ceil return int(triplet['score']) and math.ceil(triplet['score']), respectively

ensure_list_of_correct_dtype()

Ensure everything of type List[str].

to_dict()

Return a json serializable view of the class attributes.

Return type:

Dict