caching issues using dictionary #14

sbng · 2024-10-01T22:47:13Z

After running version 0.2.4 on a large data set, I have observed that caching large number of key value pair sometime causes cache error and corrupt some of the the cache entries. These occurrence are totally random and happens only when the cache size reach a significant size. I was not able to pin point the exact cause.

sbng · 2024-10-03T05:11:36Z

Upon further debugging. I found something very strange in the Encoder class. When this object cache gets sufficiently large ( about 69145 entries). I am getting some corrupted value. Example:

ipdb> self.encode(1153669, 1)
b'((\x0c'
ipdb> self._encode_pointer(1153669)
b'0\t\x92\x85'

the pointer value of 1153669 decode to two very different value. Incidentally, the _encode_pointer returns the correct value. Pointer 1153668-1153671 will always decode to the incorrect value for this Encoder object. Yes, we have a different objects, the error would happens on another range of pointers. Example of a correctly encoded pointer.

ipdb> self.encode(1153675,1)
b'0\t\x92\x8b'
ipdb> self._encode_pointer(1153675)
b'0\t\x92\x8b'

After confirming this behavior, I narrow down to this

type_id = self.python_type_id(value)

I suspect the mapping of the type_decoder may have some kind of caching of these function and call the incorrect function when the object grows to a significant size. (just speculating). Anyhow, since, I know the root cause. I patch the code to call the _encode_pointer(value) directly instead of going thru the function mapping. The random corruption of prefixes/data is completely gone.

                 self.data_list.append(res)
                 pointer_position = self.data_pointer
                 self.data_pointer += len(res)
-                pointer = self.encode(pointer_position, 1)
+                pointer = self._encode_pointer(pointer_position)
                 self.data_cache[cache_key] = pointer
                 return pointer
         return res

Do you foresee any issue with my change? I am not sure if this is the correct way to fix this issue.

Pointers don't need to be cached

vimt · 2024-10-04T09:57:42Z

Thank you for bringing this issue. There is indeed a problem here. I think the bug is that point type value should not be cached.

I've created a fix for this bug in a new branch: fix/14-caching-issues-using-dictionary. Could you please test this fix and let me know if it resolves the problem for you?

Your contribution in identifying this bug is greatly appreciated. Thank you for helping me improve the project.

Pointers don't need to be cached

sbng · 2024-10-04T14:10:25Z

I have run the code through my largest DB of 11 millions prefix/data entries. No pointer corruption reported. I think this patch should be good to go. Thanks for the fix.

sbng · 2024-10-04T14:18:23Z

Some interesting info. With this fix, I am seeing about 20% drop in performance. Usually, using the old code, I observer a rate of 10,000 prefix per second process rate while after the patch, I am now seeing 8,000 prefixes per second. I think data accuracy out weights the performance. FYI.

fix: caching point issue (#14)

vimt · 2024-10-04T15:34:48Z

I've released new version v0.2.5, please try it.

About performance, I'm trying to reimplement the entire logic using rust pyo3. Using the rust should give a big performance boost. But it still work in progress.😂

sbng closed this as completed Oct 3, 2024

sbng reopened this Oct 3, 2024

vimt added a commit that referenced this issue Oct 4, 2024

fix: caching point issue (#14)

bb84e19

Pointers don't need to be cached

vimt added a commit that referenced this issue Oct 4, 2024

fix: caching point issue (#14)

e134be0

Pointers don't need to be cached

vimt mentioned this issue Oct 4, 2024

Fix/14 caching issues using dictionary #15

Merged

vimt linked a pull request Oct 4, 2024 that will close this issue

Fix/14 caching issues using dictionary #15

Merged

vimt added a commit that referenced this issue Oct 4, 2024

Fix/14 caching issues using dictionary (#15)

3059db6

fix: caching point issue (#14)

vimt closed this as completed in #15 Oct 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

caching issues using dictionary #14

caching issues using dictionary #14

sbng commented Oct 1, 2024 •

edited

Loading

sbng commented Oct 3, 2024

vimt commented Oct 4, 2024

sbng commented Oct 4, 2024

sbng commented Oct 4, 2024

vimt commented Oct 4, 2024

caching issues using dictionary #14

caching issues using dictionary #14

Comments

sbng commented Oct 1, 2024 • edited Loading

sbng commented Oct 3, 2024

vimt commented Oct 4, 2024

sbng commented Oct 4, 2024

sbng commented Oct 4, 2024

vimt commented Oct 4, 2024

sbng commented Oct 1, 2024 •

edited

Loading