Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Accelerated worksheet parsing #421

Merged
merged 24 commits into from
Dec 19, 2019
Merged

Accelerated worksheet parsing #421

merged 24 commits into from
Dec 19, 2019

Conversation

Crzyrndm
Copy link
Collaborator

@Crzyrndm Crzyrndm commented Nov 16, 2019

I've had reason to utilise xlnt again recently, this time commonly with MB+ size xlsx files and the slow loading time was really starting to bite. Did some benchmarking, found the majority of the time taken was spent in the XML parsing library while (at the time) creating the xlnt::worksheet was only ~10-15% of the workload

Following the examples from libstudxml docs, I rewrote the segment of the parser that loads rows/cells. This roughly tripled (3x) the throughput of the XML library, and cut the load-spreadsheet benchmark time nearly in half (~60% of original). This commit adds the benchmark used to establish the improvements before any other changes are made

Benchmarking also highlighted number_converter::stold. Replacing it with strtod cut another ~15% (3-3.1s to ~2.6s loading the very_large.xlsx file on my machine). strtod unfortunately runs in the users locale and calling setlocale is a huge can of worms. However strtod_l or variants of is available on all major platforms and takes a locale parameter. Unfortunately, the story of which header to include in linux vs BSD/mac is not so straight forward (locale.h vs xlocale.h). Therefore this optmisation is currently only used with the toolchain I could get clear documentation for (_MSC_VER >= 1900 => MSVC 2015+). All other toolchains will use the glacially slow std::istringstream to convert string -> double. Probably best resolved by using an external library to make the conversions

With both improvements loading times are ~50% of current master (2x throughput)

  • large.xlsx (2MB): 2.4s -> 1.2s
  • very_large.xlsx (4.5MB): 5.2s -> 2.7s

@tfussell
Copy link
Owner

This is great, thanks! Will try to get this merged in soon.

@coveralls
Copy link

coveralls commented Nov 16, 2019

Coverage Status

Coverage increased (+0.04%) to 83.558% when pulling f2ad495 on Crzyrndm:experimental/sheet-data-parser into b221531 on tfussell:dev.

this was being done already in most cases, and allows some simplification
e.g. no need to check if something is already present, since we're starting with a blank
@Crzyrndm
Copy link
Collaborator Author

Have cleaned up and commented the code, much less experimental/hacky than the original PR. There's enough low hanging fruit here that I'm pretty sure load times could be halved again (e.g. by using a background thread to parse the XML and feeding the output into xlnt::worksheet in the main thread I'd estimate another +30-50% throughput is available. Re-doing the rest of the XML parsing would assist with string heavy worksheets. etc.)

It's working at the level I need it though (data processing is now outweighing data loading...), so I'll be stopping here for now (pending review comments). Opened #422 for the number_converter speed issues

@Crzyrndm
Copy link
Collaborator Author

Crzyrndm commented Nov 18, 2019

Implemented copy & replace of '.' when in a comma locale as discussed in #422, fixed a number of locations where xlsx_consumer was relying on attribute<double> (which is just stringstream again so had locale AND speed issues)

xlsx_producer is definitely NOT locale aware, since it uses libstudxml for quite a few conversions, but that's a battle for another day

@Crzyrndm
Copy link
Collaborator Author

@tfussell
somewhat related side note, this library currently has very few external dependencies (just libstudxml?). Doing the work for this PR had me looking at other libs in both supporting and performance roles (e.g. testing / benchmarking (1, 2), alternative non-std containers (particularly hashmaps which this lib uses heavily)

Do you have a strong opinion / guidelines around adding dependencies (my first recommendation after this PR would be google test / benchmark. Both would assist greatly if I was to continue making improvements to xlsx loading)

@tfussell tfussell merged commit e2262a0 into tfussell:dev Dec 19, 2019
@tfussell
Copy link
Owner

@Crzyrndm Sorry it took so long to get this merged in. As far as dependencies, I'm totally open to that. What I do want to avoid with this library is depending on system libraries (like requiring Ubuntu users to do apt-get install libxml). I guess this means vendoring the libraries like I've done with libstdxml.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants