file_reader_functions

Loader functions to read various file formats.

trove.data.file_reader_functions.qrel_from_grouped_triplets(filepath, num_proc=None)

Load grouped qrel triplets from JSONL file.

Each line should have three fields:

{
    'qid': '...',
    'docid': ['docid1', 'docid2', ...],
    'score': [score1, score2, ...]
}

Parameters:

filepath (os.PathLike) – Path to a JSONL file with grouped qrel triplets
num_proc (Optional[int]) – Max number of processes when reading and pre-processing the data.

Return type:

Dataset

Returns:

huggingface datasets.Dataset with 'qid', 'docid', and 'score' columns of str, List[str], and List[float] dtype, respectively.

trove.data.file_reader_functions.qrel_from_csv(filepath, num_proc=None)

Load qrel file from a delimiter separated file.

It only supports a file with exactly three columns: qid, docid, score If the headers are missing, it is assumed the columns are in the following order: ['qid', 'docid', 'score'].

Parameters:

filepath (os.PathLike) – path to CSV/TSV file
num_proc (Optional[int]) – Max number of processes when reading and pre-processing the data.

Return type:

Dataset

Returns:

huggingface datasets.Dataset with qid, docid, and score columns of str, List[str], and List[float] dtype, respectively.

trove.data.file_reader_functions.qrel_from_pickle(filepath, num_proc=None)

Load qrel triplets from pickle files.

The file is expected to contain a single object of type dict. object[qid][docid] is the corresponding score for query qid and document docid. qid and docid should be of type str.

Parameters:

filepath (os.PathLike) – pickle file to load.
num_proc (Optional[int]) – Max number of processes when reading and pre-processing the data.

Return type:

Dataset

Returns:

huggingface datasets.Dataset with qid, docid, and score columns of str, List[str], and List[float] dtype, respectively.

trove.data.file_reader_functions.qrel_from_tevatron_training_data(filepath, num_proc=None)

Convert a tevatron training file to qrel triplets.

The data is expected to be a JSONL file with the structure that tevatron uses for training files. Each record (i.e., line) should be:

{
    'query_id': 'target query id',
    'query': 'text of the query',
    'positive_passages': [{'docid': 'od of pos doc', 'title': 'title of pos doc', 'text': 'text of pos doc'}, ..., ...],
    'negative_passages': [{'docid': 'od of neg doc', 'title': 'title of neg doc', 'text': 'text of neg doc'}, ..., ...]
}

When creating (qid, docid, score) triplets, we give a score of 0 to all negative passages and a score of 1 to all positive passages.

Parameters:

filepath (os.PathLike) – path to JSONL file in tevatron training format.
num_proc (Optional[int]) – arg to datasets.Dataset.map

Return type:

Dataset

Returns:

huggingface datasets.Dataset with qid, docid, and score columns of str, List[str], and List[float] dtype, respectively.

trove.data.file_reader_functions.qrel_from_sydir_corpus(filepath, num_proc=None)

Infer qrel triplets from sydir docids.

sydir docids are formatted as f"{qid}_l_{level}_d_{doc_idx}". We parse this and use level as the score field in the qrel triplets.

Parameters:

filepath (os.PathLike) – sydir corpus file
num_proc (Optional[int]) – Max number of processes when reading and pre-processing the data.

Return type:

Dataset

Returns:

huggingface datasets.Dataset with qid, docid, and score columns of str, List[str], and List[float] dtype, respectively.

trove.data.file_reader_functions.qids_from_queries_jsonl(filepath, num_proc=None)

Load qids from the original queries.jsonl files.

It expect an _id field in each record of the JSONL file.

Parameters:

filepath (os.PathLike) – queries.jsonl file to read.
num_proc (Optional[int]) – Max number of processes when reading and pre-processing the data.

Return type:

List[str]

Returns:

A list of query IDs