FastResultHeapq
- class trove.containers.result_heapq_fast.FastResultHeapq(topk=100, special_docids=None)
- __init__(topk=100, special_docids=None)
keeps track of the topk largest scores for each query.
This is much faster than the simple python heapq used in
ResultHeapq
. This class requires to iterate over document and query batches in nested for loops, and crucially, the outer loop generates the document batches. I.e.,for doc_batch in corpus: for query_batch in queries: fast_result_heapq(scores, ...)
It keeps track of the topk most similar documents seen so far. It buffers all score batches for each
doc_batch
and merges them with results collected so far right before moving on to the nextdoc_batch
.Unlike
ResultHeapq
, it uses GPUs for computation if it receives tensors already on GPU.This class does not provide all the nice utilities for input/output formatting that
ResultHeapq
does. To use those utilities, after you are done collecting the similarities, export the data from this class and import it into a fresh instance ofResultHeapq
.The arguments are the same as
ResultHeapq
. SeeResultHeapq
docstring for details.
- export_result_dump(reset_state=False)
Export all the results collected so far with the same format as
ResultHeapq
see
ResultHeapq
for details.- Return type:
Dict
[str
,Dict
[str
,List
[Tuple
[str
,float
]]]]
- reset_state()
Clear the data collected so far.
- Return type:
None
- update_topk_records()
Merges topk records collected so far with topk records from the most recent batch selects the new topk records.
- Return type:
None