Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Stream based token parser #225

Closed
wants to merge 15 commits into from

Conversation

arthurschreiber
Copy link
Collaborator

This replaces the token parser inside tedious with a stream based version based on the dissolve module.

This is pretty much a big ugly hack at the moment, but it passes all tedious unit and integration tests. (Also, the application I'm working on passes all it's tests with the new parser). Beware, here be dragons.

The idea behind these changes is to make the token parsing code more performant. Right now, if a token is split over multiple tds messages, parsing is retried each time a new message is received from scratch. Obviously, this results in incredibly bad performance.

The dissolve module gives us a way to do chunk based parsing of tokens, without having to throw away the parsing results even in the case that tokens span multiple packages.

Here's some performance numbers (see benchmarks/benchmarks.coffee):

Old Parser

$ node_modules/.bin/coffee benchmarks/query.coffee
Benchmarking 'nvarchar (small)'
nvarchar (small) x 68.06 ops/sec ±0.32% (22 runs sampled)
Memory: 5.4609375
Benchmarking 'nvarchar (large)'
nvarchar (large) x 1.84 ops/sec ±1.45% (14 runs sampled)
Memory: 19.265625
Benchmarking 'varbinary (small)'
varbinary (small) x 66.09 ops/sec ±0.14% (14 runs sampled)
Memory: 3.6171875
Benchmarking 'varbinary (large)'
varbinary (large) x 1.89 ops/sec ±1.64% (14 runs sampled)
Memory: 242.21484375
Benchmarking 'varbinary (huge)'
varbinary (huge) x 0.02 ops/sec ±9.94% (5 runs sampled)
Memory: 324.49609375
Done!

New Parser

$ node_modules/.bin/coffee benchmarks/query.coffee
Benchmarking 'nvarchar (small)'
nvarchar (small) x 69.67 ops/sec ±0.50% (29 runs sampled)
Memory: 7.828125
Benchmarking 'nvarchar (large)'
nvarchar (large) x 7.01 ops/sec ±1.82% (37 runs sampled)
Memory: 31.6328125
Benchmarking 'varbinary (small)'
varbinary (small) x 65.99 ops/sec ±0.19% (13 runs sampled)
Memory: 6.296875
Benchmarking 'varbinary (large)'
varbinary (large) x 8.09 ops/sec ±2.41% (41 runs sampled)
Memory: 51.69140625
Benchmarking 'varbinary (huge)'
varbinary (huge) x 0.80 ops/sec ±1.13% (8 runs sampled)
Memory: 276.75390625
Done!

As you can see, the switch to dissolve hugely improves the performance when receiving data that spans multiple tds messages, while not impacting the performance of data that is contained inside a single message.

There is a slight increase in the memory usage reported in the benchmarks, and I'm not really sure what is causing this, but I'll take a closer look.

Overall, this is definitely not ready yet, but is meant to show off some of the work I've been doing and to gather some feedback on this.

@arthurschreiber
Copy link
Collaborator Author

The unit tests are failing right now, I'll get that fixed up ASAP. But the integration tests are all passing fine.

@arthurschreiber
Copy link
Collaborator Author

The latest performance numbers:

Benchmarking 'nvarchar (small)'
nvarchar (small) x 75.82 ops/sec ±0.47% (51 runs sampled)
Memory: 6.64453125
Benchmarking 'nvarchar (large)'
nvarchar (large) x 9.07 ops/sec ±2.04% (46 runs sampled)
Memory: 20.6328125
Benchmarking 'varbinary (small)'
varbinary (small) x 66.26 ops/sec ±0.17% (14 runs sampled)
Memory: 4.609375
Benchmarking 'varbinary (large)'
varbinary (large) x 10.58 ops/sec ±2.98% (51 runs sampled)
Memory: 19.91796875
Benchmarking 'varbinary (huge)'
varbinary (huge) x 1.03 ops/sec ±1.67% (10 runs sampled)
Memory: 191.42578125
Done!

This requires some patches to the dissolve module. I'll open a PR for that shortly on their github project page.

@bretcope
Copy link
Member

@arthurschreiber are you interested in becoming a maintainer on Tedious? Quite frankly, we could use a couple of new maintainers who are smart and actively using Tedious. @rossipedia and I were making significant contributions for a while, but we ended up porting the project we were using it for away from node.js, so our time and motivation to maintain Tedious has dropped off quite a bit. @patriksimek is still around here and there, but no one has the time to give it the attention it deserves.

You've started tackling streams, performance, and testing concerns, which have all been on our wish list for a very long time. If your intentions are to continue contributing improvements like this, then I don't think anyone would have a problem adding you as a collaborator.

@arthurschreiber
Copy link
Collaborator Author

@bretcope Sure, why not. 😄 I also got added as a maintainer for the dissolve module that I've been using in this POC.

With some more changes that I made to the dissolve module, I get the following performance numbers now:

Benchmarking 'nvarchar (small)'
nvarchar (small) x 69.76 ops/sec ±0.37% (29 runs sampled)
Memory: 6.1796875
Benchmarking 'nvarchar (large)'
nvarchar (large) x 10.65 ops/sec ±1.27% (52 runs sampled)
Memory: 11.9140625
Benchmarking 'varbinary (small)'
varbinary (small) x 66.47 ops/sec ±0.22% (15 runs sampled)
Memory: 4.35546875
Benchmarking 'varbinary (large)'
varbinary (large) x 13.04 ops/sec ±2.24% (60 runs sampled)
Memory: 20.3046875
Benchmarking 'varbinary (huge)'
varbinary (huge) x 1.36 ops/sec ±1.47% (11 runs sampled)
Memory: 183.859375
Done!

So, for fetching a 5MB varbinary column out of the database, this is around 6 times faster, while a 50MB varbinary, the streamig token parser is more than 67 times faster.

Next steps will be to land all the performance improvements I made in the dissolve module, release a new version of that, and then clean up this PR.

@mbroadst
Copy link

Any further work being done on this? This seems like great work

@arthurschreiber
Copy link
Collaborator Author

Yeah, I've not forgotten about this. I'm also thinking about rewriting the part of tedious that serializes data back to the stream to make use of concentrate to reduce memory usage and hopefully also improve performance.

What currently bugs me a bit is that both the old and new parsers are (subjectively) super fugly. Does anyone have any idea on how we could better organize the code?

@pekim
Copy link
Collaborator

pekim commented Jan 26, 2015

Does this mean that the horrible, nasty, ugly try/catch in token-stream-parser will go? That would be awesome. It's always bugged me.

(It would also mean that the Tokens that span Packets page could go too.)

@arthurschreiber
Copy link
Collaborator Author

@pekim Yup, that'll be completely gone. That (and the ensuing reparsing of tokens) is one of the reasons memory usage and performance really go down the drain.

@mbroadst
Copy link

@arthurschreiber imho you're trading one fugly solution for another fugly solution that improves performance/memory consumption, so I'd just merge the code in and then worry about making it look pretty 😄. If you can get it passing all tests, I'd say its prime for merging (and the concentrate change can come in another PR).

@arthurschreiber
Copy link
Collaborator Author

I'll be trying to bang this up into a more acceptable state over the course of next week and to make this available for others to take on a test ride. Sorry for the delay, I know many people are really looking forward to this.

@arthurschreiber
Copy link
Collaborator Author

Hey all. Just wanted to let you know that I'm still working on this. I made some performance and memory improvements to the dissolve module I'm using here, which I need to upstream first. Afterwards, I'll finish up this PR for final review.

I'll try to get at it this weekend.

This method is only used by the tests and probably should be removed
in the long term.
@arthurschreiber
Copy link
Collaborator Author

Okay, I just rebased all the changes here against the latest master changes and fixed up the unit tests to pass.

Here's the latest benchmark result:

$ node_modules/.bin/coffee benchmarks/query.coffee
Benchmarking 'nvarchar (small)'
nvarchar (small) x 69.54 ops/sec ±0.46% (27 runs sampled)
Memory: 4.91796875
Benchmarking 'nvarchar (large)'
nvarchar (large) x 8.70 ops/sec ±2.61% (45 runs sampled)
Memory: 12.40625
Benchmarking 'varbinary (small)'
varbinary (small) x 65.67 ops/sec ±0.20% (11 runs sampled)
Memory: 5.01171875
Benchmarking 'varbinary (4)'
varbinary (4) x 65.82 ops/sec ±0.23% (12 runs sampled)
Memory: 0.55859375
Benchmarking 'varbinary (large)'
varbinary (large) x 19.77 ops/sec ±1.94% (49 runs sampled)
Memory: 41.0078125
Benchmarking 'varbinary (huge)'
varbinary (huge) x 2.27 ops/sec ±1.69% (15 runs sampled)
Memory: 92.015625
Done!

These results are from running on Node.js 0.10.38. Overall, these are incredible performance and memory improvements over what tedious offered before, when reading large results back from the database. And there's no noticable impact when reading small results.

The next step would be to also make the sending side be stream based, but I want this to happen in a separate PR.

@bretcope @pekim @patriksimek What do you guys think?

@lee-houghton
Copy link
Contributor

Just out of curiosity... are these changes likely to make it easier to
implement streaming large column values such as varbinary(max) columns? (In
ODBC it's possible by calling SQLGetData repeatedly with a small buffer.)

On 23 July 2015 at 16:55, Arthur Schreiber notifications@github.com wrote:

Okay, I just rebased all the changes here against the latest master
changes and fixed up the unit tests to pass.

Here's the latest benchmark result:

$ node_modules/.bin/coffee benchmarks/query.coffee
Benchmarking 'nvarchar (small)'
nvarchar (small) x 69.54 ops/sec ±0.46% (27 runs sampled)
Memory: 4.91796875
Benchmarking 'nvarchar (large)'
nvarchar (large) x 8.70 ops/sec ±2.61% (45 runs sampled)
Memory: 12.40625
Benchmarking 'varbinary (small)'
varbinary (small) x 65.67 ops/sec ±0.20% (11 runs sampled)
Memory: 5.01171875
Benchmarking 'varbinary (4)'
varbinary (4) x 65.82 ops/sec ±0.23% (12 runs sampled)
Memory: 0.55859375
Benchmarking 'varbinary (large)'
varbinary (large) x 19.77 ops/sec ±1.94% (49 runs sampled)
Memory: 41.0078125
Benchmarking 'varbinary (huge)'
varbinary (huge) x 2.27 ops/sec ±1.69% (15 runs sampled)
Memory: 92.015625
Done!

These results are from running on Node.js 0.10.38. Overall, these are
incredible performance and memory improvements over what tedious offered
before, when reading large results back from the database. And there's no
noticable impact when reading small results.

The next step would be to also make the sending side be stream based, but
I want this to happen in a separate PR.

@bretcope https://github.com/bretcope @pekim https://github.com/pekim
@patriksimek https://github.com/patriksimek What do you guys think?


Reply to this email directly or view it on GitHub
#225 (comment).

@arthurschreiber
Copy link
Collaborator Author

@lee-houghton In theory, this might be possible. I just really don't know how the API would have to look like to support this. I'm also not familiar with ODBC.

@mbroadst
Copy link

I can't see a reason not to merge this if you're passing all the tests and you are seeing that kind of increase, but I might be missing something. So, LGTM 👍 from me!

@bretcope
Copy link
Member

Sounds good to me.

@patriksimek
Copy link
Collaborator

@arthurschreiber this is great, thanks!

Please take a look at the mssql test suite, there are few test that doesn't pass. There's a quick guide how to set up test config.

When I was trying to debug the protocol to make sure it's not a mssql issue I have found that config.options.debug.token switch doesn't work. It is propably caused by this line missing in the PR.

Please let me know if I can be helpful.

@arthurschreiber
Copy link
Collaborator Author

@patriksimek Thanks for the heads up. Looks like the tedious test suite has some holes. I'll try to check what running the mssql tests will show.

@arthurschreiber
Copy link
Collaborator Author

@patriksimek I pushed some additional changes that should fix the node-mssql issues. There's still some issues left, but I believe they are caused by node-mssql assuming XACT_ABORT is on by default (which it is not).

@patriksimek
Copy link
Collaborator

@arthurschreiber thanks, fixed that. I had that enabled in my config by mistake. There are still some issues with local datetime. Not sure if you can see them because it propably depends on your current time zone. I'll try to investigate it.

@patriksimek
Copy link
Collaborator

I have no idea how to patch this PR so I made a gist with changed value-parser.coffee. With that all test are passing.

https://gist.github.com/patriksimek/a0dabc66e41657af5612

@arthurschreiber
Copy link
Collaborator Author

@patriksimek Thanks. I'll try to incorporate that change ASAP.

@arthurschreiber
Copy link
Collaborator Author

Closed in favor of #285.

@chdh chdh mentioned this pull request Feb 23, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants