file_reader_functions
Loader functions to read various file formats.
- trove.data.file_reader_functions.qrel_from_grouped_triplets(filepath, num_proc=None)
Load grouped qrel triplets from JSONL file.
Each line should have three fields:
{ 'qid': '...', 'docid': ['docid1', 'docid2', ...], 'score': [score1, score2, ...] }
- Parameters:
filepath (os.PathLike) – Path to a JSONL file with grouped qrel triplets
num_proc (Optional[int]) – Max number of processes when reading and pre-processing the data.
- Return type:
Dataset- Returns:
huggingface
datasets.Datasetwith'qid','docid', and'score'columns ofstr,List[str], andList[float]dtype, respectively.
- trove.data.file_reader_functions.qrel_from_csv(filepath, num_proc=None)
Load qrel file from a delimiter separated file.
It only supports a file with exactly three columns:
qid,docid,scoreIf the headers are missing, it is assumed the columns are in the following order:['qid', 'docid', 'score'].- Parameters:
filepath (os.PathLike) – path to CSV/TSV file
num_proc (Optional[int]) – Max number of processes when reading and pre-processing the data.
- Return type:
Dataset- Returns:
huggingface
datasets.Datasetwithqid,docid, andscorecolumns ofstr,List[str], andList[float]dtype, respectively.
- trove.data.file_reader_functions.qrel_from_pickle(filepath, num_proc=None)
Load qrel triplets from pickle files.
The file is expected to contain a single object of type dict.
object[qid][docid]is the corresponding score for queryqidand documentdocid.qidanddocidshould be of type str.- Parameters:
filepath (os.PathLike) – pickle file to load.
num_proc (Optional[int]) – Max number of processes when reading and pre-processing the data.
- Return type:
Dataset- Returns:
huggingface datasets.Dataset with
qid,docid, andscorecolumns ofstr,List[str], andList[float]dtype, respectively.
- trove.data.file_reader_functions.qrel_from_tevatron_training_data(filepath, num_proc=None)
Convert a tevatron training file to qrel triplets.
The data is expected to be a JSONL file with the structure that tevatron uses for training files. Each record (i.e., line) should be:
{ 'query_id': 'target query id', 'query': 'text of the query', 'positive_passages': [{'docid': 'od of pos doc', 'title': 'title of pos doc', 'text': 'text of pos doc'}, ..., ...], 'negative_passages': [{'docid': 'od of neg doc', 'title': 'title of neg doc', 'text': 'text of neg doc'}, ..., ...] }
When creating (qid, docid, score) triplets, we give a score of
0to all negative passages and a score of1to all positive passages.- Parameters:
filepath (os.PathLike) – path to JSONL file in tevatron training format.
num_proc (Optional[int]) – arg to
datasets.Dataset.map
- Return type:
Dataset- Returns:
huggingface datasets.Dataset with
qid,docid, andscorecolumns ofstr,List[str], andList[float]dtype, respectively.
- trove.data.file_reader_functions.qrel_from_sydir_corpus(filepath, num_proc=None)
Infer qrel triplets from sydir docids.
sydir docids are formatted as
f"{qid}_l_{level}_d_{doc_idx}". We parse this and uselevelas thescorefield in the qrel triplets.- Parameters:
filepath (os.PathLike) – sydir corpus file
num_proc (Optional[int]) – Max number of processes when reading and pre-processing the data.
- Return type:
Dataset- Returns:
huggingface datasets.Dataset with
qid,docid, andscorecolumns ofstr,List[str], andList[float]dtype, respectively.
- trove.data.file_reader_functions.qids_from_queries_jsonl(filepath, num_proc=None)
Load qids from the original queries.jsonl files.
It expect an
_idfield in each record of the JSONL file.- Parameters:
filepath (os.PathLike) –
queries.jsonlfile to read.num_proc (Optional[int]) – Max number of processes when reading and pre-processing the data.
- Return type:
List[str]- Returns:
A list of query IDs