-
-
Notifications
You must be signed in to change notification settings - Fork 6.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add parser callback with the ability to filter results. #41
Conversation
…o that streams are read incrementally.
…ssed, including the ability to reject individual elements.
I like the idea and I love the way how you improved the parser, my first experiments show that the runtime is twice as high. Consider the following code: #include <json.hpp>
#include <fstream>
int main(int argc, char** argv)
{
std::ifstream input_file(argv[1]);
nlohmann::json j;
j << input_file;
} For the old version, I can read https://github.com/miloyip/nativejson-benchmark/blob/master/data/canada.json in 140 ms. The new parser takes 260 ms. I used clang 3.6 with |
I agree the default callback is probably adding some time and I think it's worth investigating, but I doubt that it would double the time (see below). I declared the default callback as static, but I don't recall why, and now I don't think it adds anything and I wonder if it may prevent inlining. I suspect the biggest performance hit comes from using the If the performance penalty can't be eliminated, the parse with callback could be added as a totally separate function. Or, you could even left out of your main code and include it as a user-contributed patch for those that could use it. It's critical to me for parsing a 1+ GB file, but my case is probably uncommon. |
Hi Aaaron, thanks for answering! First of - please ignore the messages by AppVeyor. I am currently trying to whether MSVC 2015's C++11 support is as good as some people claim... Second, I'll check the pull request as soon as I can find the time. This weekend, I tried to build a version of I understand your use case, but - if possible - I would like to get an idea of the input data. Getting a 1+ GB JSON file with a real-life task would be a nice benchmark - especially as it is not about how many milliseconds it takes to create an object. All the best |
I ran a debug build in Apple's Instruments to profile your test program with the canada.json data. It spent 1.4% of the time in default callback, so I think we can ignore it as the problem. I also reminded myself why it's static: it's a user-supplied function and thus not a member function, and so it has to be static when defined in its current location, and I put it inside the class declaration so that it could easily specify a I tried changing the The program spent a lot of time in push_back and destructing map and vector containers. Looking at the patch diff, I suspect this change is significant: - result.push_back(parse_internal());
+ auto value = parse_internal(keep);
+ if (keep and not value.is_discarded())
+ {
+ result.push_back(value);
+ } The sample data contains a lot of arrays, and that |
Hi Aaron, I'll check the code in a minute. I also checked the rest of the parser: a lot of time is wasted when arrays/objects are parsed, because they begin with empty capacity and are resized gradually. I think there is great room for improvements. Another thing is the string handling - the escape function does a terrible job. I'll keep an eye on that. |
I tried again with a larger file (http://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file). I get 2410 ms for the version without callback (clang 3.6, |
I found another instance where
to
I didn't compare it to the non-callback version yet, but that one move makes another big performance improvement. P.S. Nice find for a sample file! |
With the additional change described in my previous comment, here are some test results with Xcode 6.3.1 and flags Three runs of the jeopardy file:
Three runs of the canada.json file:
Do you want to see more improvement before merging? If not, do you want me to update the pull request? |
That sounds awesome! Let me give it a try, and then I'll merge. Thanks so much! 👍🏻 |
Hi @aburgh, I pulled your code and made some minor adjustments. Thanks a lot, and thanks for your patience! |
Hi Niels, it appears you didn't include the performance tweaks we found. Would you like me to submit a another pull request with them? |
Oops... I had problems merging the code. Sorry for that. Yes, another pull request against the current version would be great! |
This was closed with #69. |
Sync Fork from Upstream Repo
This request builds on the "incremental" pull request. I separated the two in case you find this change objectionable. The changes implement a callback to a user-provided function (which can be a closure) to notify the user of key parser events: entering object and array elements, closing object and array elements, parsing an object key, and parsing a value. This enables processing elements as they are parsed, for example to provide progress feedback. More importantly, the user function returns a bool to indicate whether to keep the value. This can be used to filter the accumulated elements to reduce memory consumption. There is a default callback provided, so it existing code should compile and work as before.
Below is an example use case. It parses a JSON file that consists of an array (which is inside a simple object) of a large number of objects. The example just pretty-prints the result, discarding all dictionaries at a depth of 2, but it could do more interesting processing. Without the callback, a 4.1 MB test file uses 12.5 MB of memory. With the callback, it peaks at around 680 KB, most of which is process overhead.