VectorCacheMixin

class trove.data.vector_cache_mixin.VectorCacheMixin(cache_file_name=None)

__init__(cache_file_name=None)

Mixin class for reading and writing dense vectors to pyarrow tables.

This class provides the functionality to read and write dense vectors to pyarrow tables. It is intended to be used as a Mixin that provides this functionality to other classes with minimal effort. We use it to read and write query and passage embeddings to a cache file during evaluation of IR models.

It only supports reading and writing (_id, value) tuples.

_id should be of type str

value is a 1D numpy array of type float32

_id must uniquely identify each record in the entire dataset

Parameters:: cache_file_name (Optional[os.PathLike]) – Path to cache file. Cache is disabled if not provided. It is also possible to add it after initialization.

property is_cache_available: bool

property effective_cache_file_name: PathLike | None

Effective cache file name after taking into account the cache_variant_subdir property.

You should always use this property to read/write cache files.

reset_state()

Resets the internal state of the class like a fresh instance without a cache filename.

Does NOT change files on disk.

Return type:: None

unload_cache()

Purge the cache and index lookup tables that are loaded into memory.

It reverses the impact of load_cache()

Return type:: None

update_cache_subdir(subdir, append=True, load=True)

Update the cache nested subdir (e.g., after some permanent change to the corresponding dataset).

When some permanent change has been made to the corresponding dataset (e.g., hard sharding), you should to let the VectorCache know where to save/load the cache files for the new dataset. This dataset also reset the state of loaded cache files, etc. (But does not change files on disk.).

After updating cache subdir, it calls to update_cache_file_name() initialize the new cache if it exists.

Parameters:

subdir (Optional[os.PathLike]) – New cache files are saved in a subdirectory with this name in the parent directory that would’ve contained the original cache files.
append (bool) – overwrite or append to existing cache subdir if it exists.
load (bool) – If true, load the new cache file if it exists (the state of the previous cache tables is always cleared even if load is set to False).

Return type:

None

update_cache_file_name(file_name=None, load=True)

Resets the state of the class and points to the new cache file.

Parameters:

file_name (Optional[os.PathLike]) – New cache file.
load (bool) – If true and file_name is not None, load the new cache file.

Return type:

None

load_cache()

Load the cache file as a memory mapped Arrow table.

Return type:: None

get_cached_value(_id)

If possible, loads the cached value for the given _id

Warning

The returned cached value shares memory with the arrow table and should not be modified in place. Do not change this array in place.

Parameters:: _id (str) – Load the cached value corresponding to this unique _id
Return type:: Optional[ndarray]
Returns:: If cache exists, return the cached value as numpy array, otherwise, return None.

open_cache_io_streams()

A context manager to prepare for writing to cache files.

It opens the cache file and creates the necessary write handlers. Should be used like:

with instance.open_cache_io_streams():
    instance.cache_records(...) # You write to cache here

cache_records(rec_id, value, chunk_size=1_000)

Add a batch of records to the cache.

The write is not immediate. It is written in chunks once a sufficient number of records are buffered.

Parameters:

rec_id (List[str]) – a list of unique _id values
value (Union[torch.Tensor, np.ndarray, List[np.ndarray]]) – values to cache. It should either be a 2D torch.Tensor or 2D np.ndarray. Each row is treated as a record to be written to the cache.
chunk_size (int) – The number of records to write at a time.

Return type:

None

flush(chunk_size=1_000)

Write buffered records to file.

Parameters:: chunk_size (int) – The size of chunked arrays to write to arrow file.
Return type:: None

property all_rec_ids: List[str]: Returns the list of all record ids in the dataset across all shards.