[WIP] Stream based token parser #225

arthurschreiber · 2015-01-09T12:23:06Z

This replaces the token parser inside tedious with a stream based version based on the dissolve module.

This is pretty much a big ugly hack at the moment, but it passes all tedious unit and integration tests. (Also, the application I'm working on passes all it's tests with the new parser). Beware, here be dragons.

The idea behind these changes is to make the token parsing code more performant. Right now, if a token is split over multiple tds messages, parsing is retried each time a new message is received from scratch. Obviously, this results in incredibly bad performance.

The dissolve module gives us a way to do chunk based parsing of tokens, without having to throw away the parsing results even in the case that tokens span multiple packages.

Here's some performance numbers (see benchmarks/benchmarks.coffee):

Old Parser

$ node_modules/.bin/coffee benchmarks/query.coffee
Benchmarking 'nvarchar (small)'
nvarchar (small) x 68.06 ops/sec ±0.32% (22 runs sampled)
Memory: 5.4609375
Benchmarking 'nvarchar (large)'
nvarchar (large) x 1.84 ops/sec ±1.45% (14 runs sampled)
Memory: 19.265625
Benchmarking 'varbinary (small)'
varbinary (small) x 66.09 ops/sec ±0.14% (14 runs sampled)
Memory: 3.6171875
Benchmarking 'varbinary (large)'
varbinary (large) x 1.89 ops/sec ±1.64% (14 runs sampled)
Memory: 242.21484375
Benchmarking 'varbinary (huge)'
varbinary (huge) x 0.02 ops/sec ±9.94% (5 runs sampled)
Memory: 324.49609375
Done!

New Parser

$ node_modules/.bin/coffee benchmarks/query.coffee
Benchmarking 'nvarchar (small)'
nvarchar (small) x 69.67 ops/sec ±0.50% (29 runs sampled)
Memory: 7.828125
Benchmarking 'nvarchar (large)'
nvarchar (large) x 7.01 ops/sec ±1.82% (37 runs sampled)
Memory: 31.6328125
Benchmarking 'varbinary (small)'
varbinary (small) x 65.99 ops/sec ±0.19% (13 runs sampled)
Memory: 6.296875
Benchmarking 'varbinary (large)'
varbinary (large) x 8.09 ops/sec ±2.41% (41 runs sampled)
Memory: 51.69140625
Benchmarking 'varbinary (huge)'
varbinary (huge) x 0.80 ops/sec ±1.13% (8 runs sampled)
Memory: 276.75390625
Done!

As you can see, the switch to dissolve hugely improves the performance when receiving data that spans multiple tds messages, while not impacting the performance of data that is contained inside a single message.

There is a slight increase in the memory usage reported in the benchmarks, and I'm not really sure what is causing this, but I'll take a closer look.

Overall, this is definitely not ready yet, but is meant to show off some of the work I've been doing and to gather some feedback on this.

arthurschreiber · 2015-01-09T12:30:15Z

The unit tests are failing right now, I'll get that fixed up ASAP. But the integration tests are all passing fine.

arthurschreiber · 2015-01-09T15:45:27Z

The latest performance numbers:

Benchmarking 'nvarchar (small)'
nvarchar (small) x 75.82 ops/sec ±0.47% (51 runs sampled)
Memory: 6.64453125
Benchmarking 'nvarchar (large)'
nvarchar (large) x 9.07 ops/sec ±2.04% (46 runs sampled)
Memory: 20.6328125
Benchmarking 'varbinary (small)'
varbinary (small) x 66.26 ops/sec ±0.17% (14 runs sampled)
Memory: 4.609375
Benchmarking 'varbinary (large)'
varbinary (large) x 10.58 ops/sec ±2.98% (51 runs sampled)
Memory: 19.91796875
Benchmarking 'varbinary (huge)'
varbinary (huge) x 1.03 ops/sec ±1.67% (10 runs sampled)
Memory: 191.42578125
Done!

This requires some patches to the dissolve module. I'll open a PR for that shortly on their github project page.

bretcope · 2015-01-11T02:06:47Z

@arthurschreiber are you interested in becoming a maintainer on Tedious? Quite frankly, we could use a couple of new maintainers who are smart and actively using Tedious. @rossipedia and I were making significant contributions for a while, but we ended up porting the project we were using it for away from node.js, so our time and motivation to maintain Tedious has dropped off quite a bit. @patriksimek is still around here and there, but no one has the time to give it the attention it deserves.

You've started tackling streams, performance, and testing concerns, which have all been on our wish list for a very long time. If your intentions are to continue contributing improvements like this, then I don't think anyone would have a problem adding you as a collaborator.

arthurschreiber · 2015-01-11T22:27:24Z

@bretcope Sure, why not. 😄 I also got added as a maintainer for the dissolve module that I've been using in this POC.

With some more changes that I made to the dissolve module, I get the following performance numbers now:

Benchmarking 'nvarchar (small)'
nvarchar (small) x 69.76 ops/sec ±0.37% (29 runs sampled)
Memory: 6.1796875
Benchmarking 'nvarchar (large)'
nvarchar (large) x 10.65 ops/sec ±1.27% (52 runs sampled)
Memory: 11.9140625
Benchmarking 'varbinary (small)'
varbinary (small) x 66.47 ops/sec ±0.22% (15 runs sampled)
Memory: 4.35546875
Benchmarking 'varbinary (large)'
varbinary (large) x 13.04 ops/sec ±2.24% (60 runs sampled)
Memory: 20.3046875
Benchmarking 'varbinary (huge)'
varbinary (huge) x 1.36 ops/sec ±1.47% (11 runs sampled)
Memory: 183.859375
Done!

So, for fetching a 5MB varbinary column out of the database, this is around 6 times faster, while a 50MB varbinary, the streamig token parser is more than 67 times faster.

Next steps will be to land all the performance improvements I made in the dissolve module, release a new version of that, and then clean up this PR.

mbroadst · 2015-01-26T01:11:05Z

Any further work being done on this? This seems like great work

arthurschreiber · 2015-01-26T11:57:22Z

Yeah, I've not forgotten about this. I'm also thinking about rewriting the part of tedious that serializes data back to the stream to make use of concentrate to reduce memory usage and hopefully also improve performance.

What currently bugs me a bit is that both the old and new parsers are (subjectively) super fugly. Does anyone have any idea on how we could better organize the code?

pekim · 2015-01-26T13:06:00Z

Does this mean that the horrible, nasty, ugly try/catch in token-stream-parser will go? That would be awesome. It's always bugged me.

(It would also mean that the Tokens that span Packets page could go too.)

arthurschreiber · 2015-01-26T13:49:10Z

@pekim Yup, that'll be completely gone. That (and the ensuing reparsing of tokens) is one of the reasons memory usage and performance really go down the drain.

mbroadst · 2015-01-26T13:53:21Z

@arthurschreiber imho you're trading one fugly solution for another fugly solution that improves performance/memory consumption, so I'd just merge the code in and then worry about making it look pretty 😄. If you can get it passing all tests, I'd say its prime for merging (and the concentrate change can come in another PR).

arthurschreiber · 2015-02-07T01:21:35Z

I'll be trying to bang this up into a more acceptable state over the course of next week and to make this available for others to take on a test ride. Sorry for the delay, I know many people are really looking forward to this.

arthurschreiber · 2015-06-21T11:50:26Z

Hey all. Just wanted to let you know that I'm still working on this. I made some performance and memory improvements to the dissolve module I'm using here, which I need to upstream first. Afterwards, I'll finish up this PR for final review.

I'll try to get at it this weekend.

This method is only used by the tests and probably should be removed in the long term.

arthurschreiber · 2015-07-23T15:55:02Z

Okay, I just rebased all the changes here against the latest master changes and fixed up the unit tests to pass.

Here's the latest benchmark result:

$ node_modules/.bin/coffee benchmarks/query.coffee
Benchmarking 'nvarchar (small)'
nvarchar (small) x 69.54 ops/sec ±0.46% (27 runs sampled)
Memory: 4.91796875
Benchmarking 'nvarchar (large)'
nvarchar (large) x 8.70 ops/sec ±2.61% (45 runs sampled)
Memory: 12.40625
Benchmarking 'varbinary (small)'
varbinary (small) x 65.67 ops/sec ±0.20% (11 runs sampled)
Memory: 5.01171875
Benchmarking 'varbinary (4)'
varbinary (4) x 65.82 ops/sec ±0.23% (12 runs sampled)
Memory: 0.55859375
Benchmarking 'varbinary (large)'
varbinary (large) x 19.77 ops/sec ±1.94% (49 runs sampled)
Memory: 41.0078125
Benchmarking 'varbinary (huge)'
varbinary (huge) x 2.27 ops/sec ±1.69% (15 runs sampled)
Memory: 92.015625
Done!

These results are from running on Node.js 0.10.38. Overall, these are incredible performance and memory improvements over what tedious offered before, when reading large results back from the database. And there's no noticable impact when reading small results.

The next step would be to also make the sending side be stream based, but I want this to happen in a separate PR.

@bretcope @pekim @patriksimek What do you guys think?

lee-houghton · 2015-07-23T16:02:23Z

Just out of curiosity... are these changes likely to make it easier to
implement streaming large column values such as varbinary(max) columns? (In
ODBC it's possible by calling SQLGetData repeatedly with a small buffer.)

On 23 July 2015 at 16:55, Arthur Schreiber notifications@github.com wrote:

Okay, I just rebased all the changes here against the latest master
changes and fixed up the unit tests to pass.

Here's the latest benchmark result:

$ node_modules/.bin/coffee benchmarks/query.coffee
Benchmarking 'nvarchar (small)'
nvarchar (small) x 69.54 ops/sec ±0.46% (27 runs sampled)
Memory: 4.91796875
Benchmarking 'nvarchar (large)'
nvarchar (large) x 8.70 ops/sec ±2.61% (45 runs sampled)
Memory: 12.40625
Benchmarking 'varbinary (small)'
varbinary (small) x 65.67 ops/sec ±0.20% (11 runs sampled)
Memory: 5.01171875
Benchmarking 'varbinary (4)'
varbinary (4) x 65.82 ops/sec ±0.23% (12 runs sampled)
Memory: 0.55859375
Benchmarking 'varbinary (large)'
varbinary (large) x 19.77 ops/sec ±1.94% (49 runs sampled)
Memory: 41.0078125
Benchmarking 'varbinary (huge)'
varbinary (huge) x 2.27 ops/sec ±1.69% (15 runs sampled)
Memory: 92.015625
Done!

These results are from running on Node.js 0.10.38. Overall, these are
incredible performance and memory improvements over what tedious offered
before, when reading large results back from the database. And there's no
noticable impact when reading small results.

The next step would be to also make the sending side be stream based, but
I want this to happen in a separate PR.

@bretcope https://github.com/bretcope @pekim https://github.com/pekim
@patriksimek https://github.com/patriksimek What do you guys think?

—
Reply to this email directly or view it on GitHub
#225 (comment).

arthurschreiber · 2015-07-23T16:04:57Z

@lee-houghton In theory, this might be possible. I just really don't know how the API would have to look like to support this. I'm also not familiar with ODBC.

mbroadst · 2015-07-23T16:06:57Z

I can't see a reason not to merge this if you're passing all the tests and you are seeing that kind of increase, but I might be missing something. So, LGTM 👍 from me!

bretcope · 2015-07-23T19:28:00Z

Sounds good to me.

patriksimek · 2015-07-23T21:42:58Z

@arthurschreiber this is great, thanks!

Please take a look at the mssql test suite, there are few test that doesn't pass. There's a quick guide how to set up test config.

When I was trying to debug the protocol to make sure it's not a mssql issue I have found that config.options.debug.token switch doesn't work. It is propably caused by this line missing in the PR.

Please let me know if I can be helpful.

arthurschreiber · 2015-07-24T08:56:07Z

@patriksimek Thanks for the heads up. Looks like the tedious test suite has some holes. I'll try to check what running the mssql tests will show.

arthurschreiber · 2015-07-27T09:17:18Z

@patriksimek I pushed some additional changes that should fix the node-mssql issues. There's still some issues left, but I believe they are caused by node-mssql assuming XACT_ABORT is on by default (which it is not).

patriksimek · 2015-07-27T11:17:26Z

@arthurschreiber thanks, fixed that. I had that enabled in my config by mistake. There are still some issues with local datetime. Not sure if you can see them because it propably depends on your current time zone. I'll try to investigate it.

patriksimek · 2015-07-27T11:46:59Z

I have no idea how to patch this PR so I made a gist with changed value-parser.coffee. With that all test are passing.

https://gist.github.com/patriksimek/a0dabc66e41657af5612

arthurschreiber · 2015-07-27T13:53:24Z

@patriksimek Thanks. I'll try to incorporate that change ASAP.

arthurschreiber · 2015-08-01T03:52:16Z

Closed in favor of #285.

arthurschreiber mentioned this pull request Jan 9, 2015

Performance improvements deoxxa/dissolve#9

Merged

bretcope mentioned this pull request Jan 13, 2015

Add Arthur Schreiber as Collaborator #229

Closed

bretcope mentioned this pull request Feb 20, 2015

Use Streams instead of Events #186

Closed

arthurschreiber mentioned this pull request Feb 23, 2015

ReadableTrackingBuffer.prototype.assertEnoughLeftFor - throwing oob error continuosly #132

Closed

bretcope mentioned this pull request Jun 12, 2015

Support back-pressure #271

Closed

arthurschreiber mentioned this pull request Jul 23, 2015

Fix AppVeyor CI Flakiness #252

Merged

arthurschreiber added 6 commits July 23, 2015 16:10

Add dissolve@0.3.3.

10f4bb4

Use Dissolve to make paket parsing simpler and more efficient.

3bf0080

Streaming token parser.

3cc986a

Add benchmark.js

58c14ab

Add some benchmarks.

889e6a1

Enable more benchmarks.

684ac97

arthurschreiber force-pushed the arthur/perf branch from e3631de to 684ac97 Compare July 23, 2015 14:12

arthurschreiber added 2 commits July 23, 2015 17:16

Bring back Parser#isEnd.

89b9d2e

This method is only used by the tests and probably should be removed in the long term.

Make the MessageIO tests work again.

fe465b5

arthurschreiber added 7 commits July 24, 2015 10:58

Add back token debugging.

f297ac6

Update benchmarks for stricter type validation checks.

f1daba1

Remove debug output.

6b92fc6

Fix the nbcrow token parser handling for null values.

968c6e2

Add a test for calling a procedure with parameters.

1320283

Remove log output.

8a83060

Add a test case for the rollbackTransaction event.

aa90170

arthurschreiber mentioned this pull request Jul 29, 2015

Stream Parser using Generators #285

Merged

arthurschreiber closed this Aug 1, 2015

chdh mentioned this pull request Feb 23, 2017

Pause Query Results #181

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Stream based token parser #225

[WIP] Stream based token parser #225

arthurschreiber commented Jan 9, 2015

arthurschreiber commented Jan 9, 2015

arthurschreiber commented Jan 9, 2015

bretcope commented Jan 11, 2015

arthurschreiber commented Jan 11, 2015

mbroadst commented Jan 26, 2015

arthurschreiber commented Jan 26, 2015

pekim commented Jan 26, 2015

arthurschreiber commented Jan 26, 2015

mbroadst commented Jan 26, 2015

arthurschreiber commented Feb 7, 2015

arthurschreiber commented Jun 21, 2015

arthurschreiber commented Jul 23, 2015

lee-houghton commented Jul 23, 2015

arthurschreiber commented Jul 23, 2015

mbroadst commented Jul 23, 2015

bretcope commented Jul 23, 2015

patriksimek commented Jul 23, 2015

arthurschreiber commented Jul 24, 2015

arthurschreiber commented Jul 27, 2015

patriksimek commented Jul 27, 2015

patriksimek commented Jul 27, 2015

arthurschreiber commented Jul 27, 2015

arthurschreiber commented Aug 1, 2015

[WIP] Stream based token parser #225

[WIP] Stream based token parser #225

Conversation

arthurschreiber commented Jan 9, 2015

Old Parser

New Parser

arthurschreiber commented Jan 9, 2015

arthurschreiber commented Jan 9, 2015

bretcope commented Jan 11, 2015

arthurschreiber commented Jan 11, 2015

mbroadst commented Jan 26, 2015

arthurschreiber commented Jan 26, 2015

pekim commented Jan 26, 2015

arthurschreiber commented Jan 26, 2015

mbroadst commented Jan 26, 2015

arthurschreiber commented Feb 7, 2015

arthurschreiber commented Jun 21, 2015

arthurschreiber commented Jul 23, 2015

lee-houghton commented Jul 23, 2015

arthurschreiber commented Jul 23, 2015

mbroadst commented Jul 23, 2015

bretcope commented Jul 23, 2015

patriksimek commented Jul 23, 2015

arthurschreiber commented Jul 24, 2015

arthurschreiber commented Jul 27, 2015

patriksimek commented Jul 27, 2015

patriksimek commented Jul 27, 2015

arthurschreiber commented Jul 27, 2015

arthurschreiber commented Aug 1, 2015