DataArguments

class trove.data.data_args.DataArguments(dataset_name=None, group_size=8, positive_passage_no_shuffle=False, negative_passage_no_shuffle=False, passage_selection_strategy='most_relevant', query_max_len=32, passage_max_len=128, pad_to_multiple_of=16)

dataset_name: Optional[str] = None: Name of the dataset. Only used if your query/passage formatting functions behave differently for different datasets

group_size: int = 8: Number of passages used for each query during training or approximate evaluation during training (i.e., only used with RetrievalTrainer and NOT used with RetrievalEvaluator).

positive_passage_no_shuffle: bool = False: (for binary IR dataset) always use the first positive passage for training

negative_passage_no_shuffle: bool = False: (for binary IR dataset) always use the first n negative passages for training

passage_selection_strategy: str = 'most_relevant': (Only for MultiLevelDataset) How to choose a subset of passages for each query. Valid options are None, ‘random’, ‘least_relevant’, and ‘most_relevant’.

query_max_len: Optional[int] = 32: The maximum total input sequence length after tokenization for query. Sequences longer than this will be truncated, sequences shorter will be padded.

passage_max_len: Optional[int] = 128: The maximum total input sequence length after tokenization for passage. Sequences longer than this will be truncated, sequences shorter will be padded.

pad_to_multiple_of: Optional[int] = 16: If set will pad the sequence to a multiple of the provided value. This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability >= 7.5 (Volta).

to_dict()

Return a json serializable view of the class attributes.

Return type:: Dict