FastResultHeapq

class trove.containers.result_heapq_fast.FastResultHeapq(topk=100, special_docids=None)
__init__(topk=100, special_docids=None)

keeps track of the topk largest scores for each query.

This is much faster than the simple python heapq used in ResultHeapq. This class requires to iterate over document and query batches in nested for loops, and crucially, the outer loop generates the document batches. I.e.,

for doc_batch in corpus:
    for query_batch in queries:
        fast_result_heapq(scores, ...)

It keeps track of the topk most similar documents seen so far. It buffers all score batches for each doc_batch and merges them with results collected so far right before moving on to the next doc_batch.

Unlike ResultHeapq, it uses GPUs for computation if it receives tensors already on GPU.

This class does not provide all the nice utilities for input/output formatting that ResultHeapq does. To use those utilities, after you are done collecting the similarities, export the data from this class and import it into a fresh instance of ResultHeapq.

The arguments are the same as ResultHeapq. See ResultHeapq docstring for details.

export_result_dump(reset_state=False)

Export all the results collected so far with the same format as ResultHeapq

see ResultHeapq for details.

Return type:

Dict[str, Dict[str, List[Tuple[str, float]]]]

reset_state()

Clear the data collected so far.

Return type:

None

update_topk_records()

Merges topk records collected so far with topk records from the most recent batch selects the new topk records.

Return type:

None