Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

caching issues using dictionary #14

Closed
sbng opened this issue Oct 1, 2024 · 5 comments · Fixed by #15
Closed

caching issues using dictionary #14

sbng opened this issue Oct 1, 2024 · 5 comments · Fixed by #15

Comments

@sbng
Copy link

sbng commented Oct 1, 2024

After running version 0.2.4 on a large data set, I have observed that caching large number of key value pair sometime causes cache error and corrupt some of the the cache entries. These occurrence are totally random and happens only when the cache size reach a significant size. I was not able to pin point the exact cause.

@sbng
Copy link
Author

sbng commented Oct 3, 2024

Upon further debugging. I found something very strange in the Encoder class. When this object cache gets sufficiently large ( about 69145 entries). I am getting some corrupted value. Example:

ipdb> self.encode(1153669, 1)
b'((\x0c'
ipdb> self._encode_pointer(1153669)
b'0\t\x92\x85'

the pointer value of 1153669 decode to two very different value. Incidentally, the _encode_pointer returns the correct value. Pointer 1153668-1153671 will always decode to the incorrect value for this Encoder object. Yes, we have a different objects, the error would happens on another range of pointers. Example of a correctly encoded pointer.

ipdb> self.encode(1153675,1)
b'0\t\x92\x8b'
ipdb> self._encode_pointer(1153675)
b'0\t\x92\x8b'

After confirming this behavior, I narrow down to this

type_id = self.python_type_id(value)

I suspect the mapping of the type_decoder may have some kind of caching of these function and call the incorrect function when the object grows to a significant size. (just speculating). Anyhow, since, I know the root cause. I patch the code to call the _encode_pointer(value) directly instead of going thru the function mapping. The random corruption of prefixes/data is completely gone.

                 self.data_list.append(res)
                 pointer_position = self.data_pointer
                 self.data_pointer += len(res)
-                pointer = self.encode(pointer_position, 1)
+                pointer = self._encode_pointer(pointer_position)
                 self.data_cache[cache_key] = pointer
                 return pointer
         return res 

Do you foresee any issue with my change? I am not sure if this is the correct way to fix this issue.

@sbng sbng closed this as completed Oct 3, 2024
@sbng sbng reopened this Oct 3, 2024
vimt added a commit that referenced this issue Oct 4, 2024
Pointers don't need to be cached
@vimt
Copy link
Owner

vimt commented Oct 4, 2024

Thank you for bringing this issue. There is indeed a problem here. I think the bug is that point type value should not be cached.

I've created a fix for this bug in a new branch: fix/14-caching-issues-using-dictionary. Could you please test this fix and let me know if it resolves the problem for you?

Your contribution in identifying this bug is greatly appreciated. Thank you for helping me improve the project.

vimt added a commit that referenced this issue Oct 4, 2024
Pointers don't need to be cached
@sbng
Copy link
Author

sbng commented Oct 4, 2024

I have run the code through my largest DB of 11 millions prefix/data entries. No pointer corruption reported. I think this patch should be good to go. Thanks for the fix.

@sbng
Copy link
Author

sbng commented Oct 4, 2024

Some interesting info. With this fix, I am seeing about 20% drop in performance. Usually, using the old code, I observer a rate of 10,000 prefix per second process rate while after the patch, I am now seeing 8,000 prefixes per second. I think data accuracy out weights the performance. FYI.

@vimt vimt linked a pull request Oct 4, 2024 that will close this issue
vimt added a commit that referenced this issue Oct 4, 2024
@vimt vimt closed this as completed in #15 Oct 4, 2024
@vimt
Copy link
Owner

vimt commented Oct 4, 2024

I've released new version v0.2.5, please try it.

About performance, I'm trying to reimplement the entire logic using rust pyo3. Using the rust should give a big performance boost. But it still work in progress.😂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants