VectorCacheMixin
- class trove.data.vector_cache_mixin.VectorCacheMixin(cache_file_name=None)
- __init__(cache_file_name=None)
Mixin class for reading and writing dense vectors to pyarrow tables.
This class provides the functionality to read and write dense vectors to pyarrow tables. It is intended to be used as a Mixin that provides this functionality to other classes with minimal effort. We use it to read and write query and passage embeddings to a cache file during evaluation of IR models.
It only supports reading and writing
(_id, value)
tuples._id
should be of typestr
value
is a 1D numpy array of type float32_id
must uniquely identify each record in the entire dataset
- Parameters:
cache_file_name (Optional[os.PathLike]) – Path to cache file. Cache is disabled if not provided. It is also possible to add it after initialization.
- property is_cache_available: bool
- property effective_cache_file_name: PathLike | None
Effective cache file name after taking into account the
cache_variant_subdir
property.You should always use this property to read/write cache files.
- reset_state()
Resets the internal state of the class like a fresh instance without a cache filename.
Does NOT change files on disk.
- Return type:
None
- unload_cache()
Purge the cache and index lookup tables that are loaded into memory.
It reverses the impact of
load_cache()
- Return type:
None
- update_cache_subdir(subdir, append=True, load=True)
Update the cache nested subdir (e.g., after some permanent change to the corresponding dataset).
When some permanent change has been made to the corresponding dataset (e.g., hard sharding), you should to let the
VectorCache
know where to save/load the cache files for the new dataset. This dataset also reset the state of loaded cache files, etc. (But does not change files on disk.).After updating cache subdir, it calls to
update_cache_file_name()
initialize the new cache if it exists.- Parameters:
subdir (Optional[os.PathLike]) – New cache files are saved in a subdirectory with this name in the parent directory that would’ve contained the original cache files.
append (bool) – overwrite or append to existing cache
subdir
if it exists.load (bool) – If true, load the new cache file if it exists (the state of the previous cache tables is always cleared even if load is set to
False
).
- Return type:
None
- update_cache_file_name(file_name=None, load=True)
Resets the state of the class and points to the new cache file.
- Parameters:
file_name (Optional[os.PathLike]) – New cache file.
load (bool) – If true and
file_name
is notNone
, load the new cache file.
- Return type:
None
- load_cache()
Load the cache file as a memory mapped Arrow table.
- Return type:
None
- get_cached_value(_id)
If possible, loads the cached value for the given
_id
Warning
The returned cached value shares memory with the arrow table and should not be modified in place. Do not change this array in place.
- Parameters:
_id (str) – Load the cached value corresponding to this unique
_id
- Return type:
Optional
[ndarray
]- Returns:
If cache exists, return the cached value as numpy array, otherwise, return
None
.
- open_cache_io_streams()
A context manager to prepare for writing to cache files.
It opens the cache file and creates the necessary write handlers. Should be used like:
with instance.open_cache_io_streams(): instance.cache_records(...) # You write to cache here
- cache_records(rec_id, value, chunk_size=1_000)
Add a batch of records to the cache.
The write is not immediate. It is written in chunks once a sufficient number of records are buffered.
- Parameters:
rec_id (List[str]) – a list of unique
_id
valuesvalue (Union[torch.Tensor, np.ndarray, List[np.ndarray]]) – values to cache. It should either be a 2D
torch.Tensor
or 2Dnp.ndarray
. Each row is treated as a record to be written to the cache.chunk_size (int) – The number of records to write at a time.
- Return type:
None
- flush(chunk_size=1_000)
Write buffered records to file.
- Parameters:
chunk_size (int) – The size of chunked arrays to write to arrow file.
- Return type:
None
- property all_rec_ids: List[str]
Returns the list of all record ids in the dataset across all shards.