file_reader_functions
Loader functions to read various file formats.
- trove.data.file_reader_functions.qrel_from_grouped_triplets(filepath, num_proc=None)
Load grouped qrel triplets from JSONL file.
Each line should have three fields:
{ 'qid': '...', 'docid': ['docid1', 'docid2', ...], 'score': [score1, score2, ...] }
- Parameters:
filepath (os.PathLike) – Path to a JSONL file with grouped qrel triplets
num_proc (Optional[int]) – Max number of processes when reading and pre-processing the data.
- Return type:
Dataset
- Returns:
huggingface
datasets.Dataset
with'qid'
,'docid'
, and'score'
columns ofstr
,List[str]
, andList[float]
dtype, respectively.
- trove.data.file_reader_functions.qrel_from_csv(filepath, num_proc=None)
Load qrel file from a delimiter separated file.
It only supports a file with exactly three columns:
qid
,docid
,score
If the headers are missing, it is assumed the columns are in the following order:['qid', 'docid', 'score']
.- Parameters:
filepath (os.PathLike) – path to CSV/TSV file
num_proc (Optional[int]) – Max number of processes when reading and pre-processing the data.
- Return type:
Dataset
- Returns:
huggingface
datasets.Dataset
withqid
,docid
, andscore
columns ofstr
,List[str]
, andList[float]
dtype, respectively.
- trove.data.file_reader_functions.qrel_from_pickle(filepath, num_proc=None)
Load qrel triplets from pickle files.
The file is expected to contain a single object of type dict.
object[qid][docid]
is the corresponding score for queryqid
and documentdocid
.qid
anddocid
should be of type str.- Parameters:
filepath (os.PathLike) – pickle file to load.
num_proc (Optional[int]) – Max number of processes when reading and pre-processing the data.
- Return type:
Dataset
- Returns:
huggingface datasets.Dataset with
qid
,docid
, andscore
columns ofstr
,List[str]
, andList[float]
dtype, respectively.
- trove.data.file_reader_functions.qrel_from_tevatron_training_data(filepath, num_proc=None)
Convert a tevatron training file to qrel triplets.
The data is expected to be a JSONL file with the structure that tevatron uses for training files. Each record (i.e., line) should be:
{ 'query_id': 'target query id', 'query': 'text of the query', 'positive_passages': [{'docid': 'od of pos doc', 'title': 'title of pos doc', 'text': 'text of pos doc'}, ..., ...], 'negative_passages': [{'docid': 'od of neg doc', 'title': 'title of neg doc', 'text': 'text of neg doc'}, ..., ...] }
When creating (qid, docid, score) triplets, we give a score of
0
to all negative passages and a score of1
to all positive passages.- Parameters:
filepath (os.PathLike) – path to JSONL file in tevatron training format.
num_proc (Optional[int]) – arg to
datasets.Dataset.map
- Return type:
Dataset
- Returns:
huggingface datasets.Dataset with
qid
,docid
, andscore
columns ofstr
,List[str]
, andList[float]
dtype, respectively.
- trove.data.file_reader_functions.qrel_from_sydir_corpus(filepath, num_proc=None)
Infer qrel triplets from sydir docids.
sydir docids are formatted as
f"{qid}_l_{level}_d_{doc_idx}"
. We parse this and uselevel
as thescore
field in the qrel triplets.- Parameters:
filepath (os.PathLike) – sydir corpus file
num_proc (Optional[int]) – Max number of processes when reading and pre-processing the data.
- Return type:
Dataset
- Returns:
huggingface datasets.Dataset with
qid
,docid
, andscore
columns ofstr
,List[str]
, andList[float]
dtype, respectively.
- trove.data.file_reader_functions.qids_from_queries_jsonl(filepath, num_proc=None)
Load qids from the original queries.jsonl files.
It expect an
_id
field in each record of the JSONL file.- Parameters:
filepath (os.PathLike) –
queries.jsonl
file to read.num_proc (Optional[int]) – Max number of processes when reading and pre-processing the data.
- Return type:
List
[str
]- Returns:
A list of query IDs