Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cached string decoder for map keys (CachedKeyDecoder) #54

Merged
merged 9 commits into from
Aug 2, 2019

Conversation

sergeyzenchenko
Copy link
Collaborator

@sergeyzenchenko sergeyzenchenko commented Jun 3, 2019

Check this out @gfx We are using it in our project.

Big part of all strings are map keys and most of the time total number of uniq keys is small.
For example in our payload in 260k objects 67% of all string are keys with 100 uniq values.
I've tried to cache this values and skip string decoding at all for keys.
It's using bytes representation of key string and stores actual key value. So instead of decoding key string we just need to find this key value in cache.

Results before:

Benchmark on NodeJS/v12.3.1

operation                                                         |   op   |   ms  |  op/s 
----------------------------------------------------------------- | ------: | ----: | ------:
buf = Buffer.from(JSON.stringify(obj));                           |  557200 |  5000 |  111440
buf = JSON.stringify(obj);                                        | 1078100 |  5000 |  215620
obj = JSON.parse(buf);                                            |  394300 |  5001 |   78844
buf = require("msgpack-lite").encode(obj);                        |  416400 |  5000 |   83280
obj = require("msgpack-lite").decode(buf);                        |  313600 |  5000 |   62720
buf = require("@msgpack/msgpack").encode(obj);                    |  646100 |  5000 |  129220
obj = require("@msgpack/msgpack").decode(buf);                    |  561800 |  5000 |  112360
✨  Done in 36.69s.

After:

Benchmark on NodeJS/v12.3.1

operation                                                         |   op   |   ms  |  op/s 
----------------------------------------------------------------- | ------: | ----: | ------:
buf = Buffer.from(JSON.stringify(obj));                           |  598900 |  5000 |  119780
buf = JSON.stringify(obj);                                        | 1151000 |  5000 |  230200
obj = JSON.parse(buf);                                            |  397600 |  5000 |   79520
buf = require("msgpack-lite").encode(obj);                        |  429000 |  5000 |   85800
obj = require("msgpack-lite").decode(buf);                        |  323800 |  5000 |   64760
buf = require("@msgpack/msgpack").encode(obj);                    |  666800 |  5000 |  133360
obj = require("@msgpack/msgpack").decode(buf);                    |  744000 |  5000 |  148800
✨  Done in 36.95s.

I am not sure that we should include it directly into this library, but maybe it can be added as optional speedup that user can add if it's required.

It works better if Decoder is reused, because cache is need to be created first.

Let me know what do you think about it.

@gfx
Copy link
Member

gfx commented Jun 3, 2019

Fascinating idea! 👏 👏 👏

I'll merge the string caching in decoders, but there are some improvement and changes needed.

The reuse of Decoder

If one would like to decode it in the fastest way, they should reuse the instance of Decoder, anyway. We should re-write the benchmark script to reuse Decoder.

And, I think decode() should create Decoder for each time in order to control configuration easily.

Default behavior

I'd like to use caching by default, so it should be merged to Decoder.

Configuration of cacheSize: number may be a good idea to control memory usage because uncontrolled caching is a kind of memory leak.

@sergeyzenchenko
Copy link
Collaborator Author

Sure, this is just a prototype solution that we use, it's not intended to be merged directly. It requires changes and tests.

I think we can allow user to pass string decoder to options, so user can pass custom decoder if needed.

What do you think about it?

@sergeyzenchenko
Copy link
Collaborator Author

I am not sure about how to properly implement cacheSize yet.
In worst case scenario cacheSize can be filled with low frequently used keys if they are frequently used in first documents.

Maybe I can add cache.compact() function that user can call on demand.
Or this can be called periodically inside Decoder.
This function will remove less frequently used keys.

Current version tracks number of hits for each key and orders them by this number to make most frequently used keys faster.

@sergeyzenchenko
Copy link
Collaborator Author

Lol @gfx , I've just improved performance more :D Now it's 176660 ops/s

@gfx
Copy link
Member

gfx commented Jun 4, 2019

Lol @gfx , I've just improved performance more :D Now it's 176660 ops/s

Amazing! 😆


cacheSize is not necessarily correct; it's just a hint.


I think we can allow user to pass string decoder to options, so user can pass custom decoder if needed

You mean decode(buffer, { stringDecoder: sharedCachingStringDecoder })? I prefer { stringCache: sharedStringCache } because its intent is clearer so users can focus on the cache engine, not the whole string decoder.


BTW I have another idea to share encoders and decoders, not directly related to this PR: #56

@gfx
Copy link
Member

gfx commented Jul 26, 2019

I'd like to proceed this patch this week. FYI: Chrome 76 is going to be released at the end of this month, which includes JSON.parse performance improvement.

@gfx gfx force-pushed the cached_key_string_decoder branch from 838b277 to 8be4057 Compare August 2, 2019 02:26
@gfx gfx changed the title Experimental cache string decoder for map keys Cached string decoder for map keys (CachedKeyDecoder) Aug 2, 2019
if (records.length >= this.maxLengthPerKey) {
// `records` are full!
// Set `record` to a randomized position.
records[(Math.random() * records.length) | 0] = record;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have replaced the original hit-count based sorting algorithm by random-picking algorithm to reduce overhead in #get().

@codecov-io
Copy link

codecov-io commented Aug 2, 2019

Codecov Report

Merging #54 into master will increase coverage by 0.08%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #54      +/-   ##
==========================================
+ Coverage   98.16%   98.25%   +0.08%     
==========================================
  Files          15       16       +1     
  Lines         927      972      +45     
  Branches      189      197       +8     
==========================================
+ Hits          910      955      +45     
  Misses         17       17
Impacted Files Coverage Δ
src/CachedKeyDecoder.ts 100% <100%> (ø)
src/Decoder.ts 98.88% <100%> (+0.03%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a0be621...11803d8. Read the comment docs.

@gfx
Copy link
Member

gfx commented Aug 2, 2019

Benchmark on NodeJS/v12.7.0

operation op ms op/s
buf = Buffer.from(JSON.stringify(obj)); 507700 5000 101540
buf = JSON.stringify(obj); 958100 5000 191620
obj = JSON.parse(buf); 346500 5000 69300
buf = require("msgpack-lite").encode(obj); 361800 5001 72345
obj = require("msgpack-lite").decode(buf); 267400 5000 53480
buf = require("@msgpack/msgpack").encode(obj); 510200 5000 102040
obj = require("@msgpack/msgpack").decode(buf); 825500 5000 165100

@gfx
Copy link
Member

gfx commented Aug 2, 2019

Now, this PR is finished. cc: @sergeyzenchenko

@@ -59,6 +62,9 @@ export class Decoder {
headByte = HEAD_BYTE_REQUIRED;
readonly stack: Array<StackState> = [];

// TODO: parameterize this property.
readonly cachedKeyDecoder = sharedCachedKeyDecoder;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking about the interface to customize a key decoder, so it's not yet public for now.

@gfx gfx merged commit 0a5966e into master Aug 2, 2019
@gfx gfx deleted the cached_key_string_decoder branch August 2, 2019 05:48
@gfx gfx mentioned this pull request Oct 4, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants