file_reader
- exception trove.data.file_reader.FileLoaderNotFoundError
No loader found for the given file.
- trove.data.file_reader.available_loaders()
Reports what loader functions are available for each output type.
- Return type:
None
- trove.data.file_reader.register_loader(output)
A decorator to register functions that read the data from file.
Later we loop through the registered functions to find one that can load the desired data from a given file. All file reader functions should take two keyword arguments:
'filepath'
and'num_proc'
.'filepath'
is the file that should be loaded.'num_proc'
is the number of processes that the loader can launch in parallel to load and preprocess the data.
Each reader function should first check if it can load the given file. The loader functions should return
None
if it cannot read the given file (e.g., if reader can load CSV but the file is in JSON format).We go through all the readers until we find one that can load the file. So, make sure the initial check for ability to read the file is fast. In worst case scenario, it is possible that all readers need to do this check before we find the correct loader.
The expected output for each loader is as following:
'qrel'
: instance of huggingfacedatasets.Dataset
with'qid'
,'docid'
, and'score'
columns.'qid'
is of type str and represents the query id for the record'docid'
is a list of ‘str’ values (List[str]
), where each item is the id of one related document for this query.'score'
is a list of int or float (List[Union[int, float]]
), wheredatasets.Dataset[i]['score'][idx]
is the similarity score betwen query'qid'
and documentdatasets.Dataset[i]['docid'][idx]
.
'qid'
: a list of query IDs of type str (List[str]
). The returned query IDs MUST be unique without duplicate query IDs in the list.'record'
: instance of huggingfacedatasets.Dataset
. It just needs to load the records in the given file as-is without any further processing. It is recommended to keep the records in the same order as they appear in the given file. There are no restrictions on what the columns or their data type are.
- Parameters:
output (str) – The data that the loader returns. Accepted values are
'qrel'
,'qid'
, and'record'
.- Return type:
Callable
- Returns:
A wrapper that registers the given loader function under
'output'
key.
- trove.data.file_reader.load_qrel(filepath, num_proc=None)
Load qrel triplets from file.
It loads grouped triplets of (qid, docid, score) from files. It loops through all registered qrel readers until it finds one that can load the given file. It returns huggingface
datasets.Dataset
with'qid'
,'docid'
, and'score'
columns ofstr
,List[str]
, andList[float]
dtype, respectively. For each record,'docid'
is a list of document ids that are related to'qid'
and'score'
is the list of similarity score between'qid'
and each document in'docid'
.- Parameters:
filepath (os.PathLike) – file to read qrels from
num_proc (Optional[int]) – Max number of processes when reading and pre-processing the data.
- Return type:
Dataset
- Returns:
huggingface datasets.Dataset with
'qid'
,'docid'
, and'score'
columns ofstr
,List[str]
, andList[float]
dtype, respectively.
- trove.data.file_reader.load_qids(filepath, num_proc=None)
Load a list of unique qids from file.
It loops through all registered qid readers until it finds one that can load the given file. If there is no dedicated qid reader that can load this file, it assumes it is a file with qrel triplets and returns the list of unique qids in qrel triplets.
- Parameters:
filepath (os.PathLike) – file to read qids from
num_proc (Optional[int]) – Max number of processes when reading and pre-processing the data.
- Return type:
List
[str
]- Returns:
A list of unique query IDs (
List[str]
)
- trove.data.file_reader.load_records(filepath, num_proc=None)
Load records/rows from file.
It loops through all registered record readers until it finds one that can load the given file. It returns an instance of huggingface
datasets.Dataset
that contain the records in the file without any further processing or modification.Load records/rows from a given file into an instance of
datasets.Dataset
- Parameters:
filepath (os.PathLike) – file to read the records from.
num_proc (Optional[int]) – Max number of processes when reading and pre-processing the data.
- Return type:
Dataset
- Returns:
an instance of
datasets.Dataset
containing records from the given file.