Accelerated worksheet parsing #421

Crzyrndm · 2019-11-16T02:33:15Z

I've had reason to utilise xlnt again recently, this time commonly with MB+ size xlsx files and the slow loading time was really starting to bite. Did some benchmarking, found the majority of the time taken was spent in the XML parsing library while (at the time) creating the xlnt::worksheet was only ~10-15% of the workload

Following the examples from libstudxml docs, I rewrote the segment of the parser that loads rows/cells. This roughly tripled (3x) the throughput of the XML library, and cut the load-spreadsheet benchmark time nearly in half (~60% of original). This commit adds the benchmark used to establish the improvements before any other changes are made

Benchmarking also highlighted number_converter::stold. Replacing it with strtod cut another ~15% (3-3.1s to ~2.6s loading the very_large.xlsx file on my machine). strtod unfortunately runs in the users locale and calling setlocale is a huge can of worms. However strtod_l or variants of is available on all major platforms and takes a locale parameter. Unfortunately, the story of which header to include in linux vs BSD/mac is not so straight forward (locale.h vs xlocale.h). Therefore this optmisation is currently only used with the toolchain I could get clear documentation for (_MSC_VER >= 1900 => MSVC 2015+). All other toolchains will use the glacially slow std::istringstream to convert string -> double. Probably best resolved by using an external library to make the conversions

With both improvements loading times are ~50% of current master (2x throughput)

large.xlsx (2MB): 2.4s -> 1.2s
very_large.xlsx (4.5MB): 5.2s -> 2.7s

a few hacks still, but a very noticeable speed up (2.2 -> 1.1 seconds on large.xlsx)

… loaded

…time extended

in load benchmark, not using the specialisation adds ~10% to execution time

tfussell · 2019-11-16T02:39:03Z

This is great, thanks! Will try to get this merged in soon.

coveralls · 2019-11-16T02:44:47Z

Coverage increased (+0.04%) to 83.558% when pulling f2ad495 on Crzyrndm:experimental/sheet-data-parser into b221531 on tfussell:dev.

1-2% improvement seen locally

this was being done already in most cases, and allows some simplification e.g. no need to check if something is already present, since we're starting with a blank

Crzyrndm · 2019-11-17T02:47:17Z

Have cleaned up and commented the code, much less experimental/hacky than the original PR. There's enough low hanging fruit here that I'm pretty sure load times could be halved again (e.g. by using a background thread to parse the XML and feeding the output into xlnt::worksheet in the main thread I'd estimate another +30-50% throughput is available. Re-doing the rest of the XML parsing would assist with string heavy worksheets. etc.)

It's working at the level I need it though (data processing is now outweighing data loading...), so I'll be stopping here for now (pending review comments). Opened #422 for the number_converter speed issues

resolves #398

…separator

…t fails...)

…e-DE" is?)

Crzyrndm · 2019-11-18T08:51:58Z

Implemented copy & replace of '.' when in a comma locale as discussed in #422, fixed a number of locations where xlsx_consumer was relying on attribute<double> (which is just stringstream again so had locale AND speed issues)

xlsx_producer is definitely NOT locale aware, since it uses libstudxml for quite a few conversions, but that's a battle for another day

Crzyrndm · 2019-11-19T08:35:15Z

@tfussell
somewhat related side note, this library currently has very few external dependencies (just libstudxml?). Doing the work for this PR had me looking at other libs in both supporting and performance roles (e.g. testing / benchmarking (1, 2), alternative non-std containers (particularly hashmaps which this lib uses heavily)

Do you have a strong opinion / guidelines around adding dependencies (my first recommendation after this PR would be google test / benchmark. Both would assist greatly if I was to continue making improvements to xlsx loading)

tfussell · 2019-12-19T21:28:10Z

@Crzyrndm Sorry it took so long to get this merged in. As far as dependencies, I'm totally open to that. What I do want to avoid with this library is depending on system libraries (like requiring Ubuntu users to do apt-get install libxml). I guess this means vendoring the libraries like I've done with libstdxml.

Crzyrndm added 12 commits November 16, 2019 11:25

add sheet load time benchmark

fa58994

rewrite the sheet data xml parse logic

b27e7fe

a few hacks still, but a very noticeable speed up (2.2 -> 1.1 seconds on large.xlsx)

handle whitespace in xml (e.g. '\n' because sheet was pretty printed)

d83ed0b

parse height as a double

8eda9f2

slightly optimise cell parsing routine, fix formula being incorrectly…

ec29e22

… loaded

fix compile error due to reference to r-value that MSVC silently life…

c26a59f

…time extended

and another silent one...

ad7933d

resolve some warnings

ea532c5

specialise number_converter when strtod_l is available

e059d25

in load benchmark, not using the specialisation adds ~10% to execution time

remove all usages of strtod_c

9d687ea

matchup integer types

96beae4

and more warnings suppressed

a580079

Crzyrndm added 5 commits November 16, 2019 19:49

exceptions, not asserts

2307ed4

cleanup and comments

001606a

specialised string equality template for literals strings

600cc9d

1-2% improvement seen locally

move helper functions and types to top of file ( namespace {} )

2b61cac

skip the user facing types, deal direct with the impls

a6fd7cc

this was being done already in most cases, and allows some simplification e.g. no need to check if something is already present, since we're starting with a blank

Crzyrndm mentioned this pull request Nov 17, 2019

number_converter is slow #422

Closed

Crzyrndm added 7 commits November 18, 2019 19:25

move numeric utils into the public headers

2eb88c2

resolves #398

scan and replace '.' with ',' when decimal separator is comma

d69a5de

fix wrong iterator bug

7621b28

fix using attribute<double> (causes bugs when '.' is not the decimal …

a25187f

…separator

fixed more parsing errors, added test for ',' locale serialisation (i…

9784141

…t fails...)

add missing include

613d7b6

resolve warnings, remove failing test (CI doesn't know what locale "d…

f2ad495

…e-DE" is?)

tfussell merged commit e2262a0 into tfussell:dev Dec 19, 2019

Crzyrndm deleted the experimental/sheet-data-parser branch February 29, 2020 02:22

Crzyrndm mentioned this pull request Mar 1, 2020

microbenchmarks for double<->string conversion, serialisation improvements #447

Merged

Crzyrndm mentioned this pull request Apr 26, 2020

locale aware double->string conversions #467

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Accelerated worksheet parsing #421

Accelerated worksheet parsing #421

Crzyrndm commented Nov 16, 2019 •

edited

Loading

tfussell commented Nov 16, 2019

coveralls commented Nov 16, 2019 •

edited

Loading

Crzyrndm commented Nov 17, 2019

Crzyrndm commented Nov 18, 2019 •

edited

Loading

Crzyrndm commented Nov 19, 2019

tfussell commented Dec 19, 2019

Accelerated worksheet parsing #421

Accelerated worksheet parsing #421

Conversation

Crzyrndm commented Nov 16, 2019 • edited Loading

tfussell commented Nov 16, 2019

coveralls commented Nov 16, 2019 • edited Loading

Crzyrndm commented Nov 17, 2019

Crzyrndm commented Nov 18, 2019 • edited Loading

Crzyrndm commented Nov 19, 2019

tfussell commented Dec 19, 2019

Crzyrndm commented Nov 16, 2019 •

edited

Loading

coveralls commented Nov 16, 2019 •

edited

Loading

Crzyrndm commented Nov 18, 2019 •

edited

Loading