-
Notifications
You must be signed in to change notification settings - Fork 421
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Accelerated worksheet parsing #421
Accelerated worksheet parsing #421
Conversation
a few hacks still, but a very noticeable speed up (2.2 -> 1.1 seconds on large.xlsx)
in load benchmark, not using the specialisation adds ~10% to execution time
This is great, thanks! Will try to get this merged in soon. |
1-2% improvement seen locally
this was being done already in most cases, and allows some simplification e.g. no need to check if something is already present, since we're starting with a blank
Have cleaned up and commented the code, much less experimental/hacky than the original PR. There's enough low hanging fruit here that I'm pretty sure load times could be halved again (e.g. by using a background thread to parse the XML and feeding the output into xlnt::worksheet in the main thread I'd estimate another +30-50% throughput is available. Re-doing the rest of the XML parsing would assist with string heavy worksheets. etc.) It's working at the level I need it though (data processing is now outweighing data loading...), so I'll be stopping here for now (pending review comments). Opened #422 for the number_converter speed issues |
Implemented copy & replace of '.' when in a comma locale as discussed in #422, fixed a number of locations where xlsx_consumer was relying on attribute<double> (which is just stringstream again so had locale AND speed issues) xlsx_producer is definitely NOT locale aware, since it uses libstudxml for quite a few conversions, but that's a battle for another day |
@tfussell Do you have a strong opinion / guidelines around adding dependencies (my first recommendation after this PR would be google test / benchmark. Both would assist greatly if I was to continue making improvements to xlsx loading) |
@Crzyrndm Sorry it took so long to get this merged in. As far as dependencies, I'm totally open to that. What I do want to avoid with this library is depending on system libraries (like requiring Ubuntu users to do |
I've had reason to utilise xlnt again recently, this time commonly with MB+ size xlsx files and the slow loading time was really starting to bite. Did some benchmarking, found the majority of the time taken was spent in the XML parsing library while (at the time) creating the xlnt::worksheet was only ~10-15% of the workload
Following the examples from libstudxml docs, I rewrote the segment of the parser that loads rows/cells. This roughly tripled (3x) the throughput of the XML library, and cut the load-spreadsheet benchmark time nearly in half (~60% of original). This commit adds the benchmark used to establish the improvements before any other changes are made
Benchmarking also highlighted number_converter::stold. Replacing it with strtod cut another ~15% (3-3.1s to ~2.6s loading the very_large.xlsx file on my machine). strtod unfortunately runs in the users locale and calling setlocale is a huge can of worms. However strtod_l or variants of is available on all major platforms and takes a locale parameter. Unfortunately, the story of which header to include in linux vs BSD/mac is not so straight forward (locale.h vs xlocale.h). Therefore this optmisation is currently only used with the toolchain I could get clear documentation for (_MSC_VER >= 1900 => MSVC 2015+). All other toolchains will use the glacially slow std::istringstream to convert string -> double. Probably best resolved by using an external library to make the conversions
With both improvements loading times are ~50% of current master (2x throughput)