Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add parser callback with the ability to filter results. #41

Closed
wants to merge 12 commits into from

Conversation

aburgh
Copy link
Contributor

@aburgh aburgh commented Mar 2, 2015

This request builds on the "incremental" pull request. I separated the two in case you find this change objectionable. The changes implement a callback to a user-provided function (which can be a closure) to notify the user of key parser events: entering object and array elements, closing object and array elements, parsing an object key, and parsing a value. This enables processing elements as they are parsed, for example to provide progress feedback. More importantly, the user function returns a bool to indicate whether to keep the value. This can be used to filter the accumulated elements to reduce memory consumption. There is a default callback provided, so it existing code should compile and work as before.

Below is an example use case. It parses a JSON file that consists of an array (which is inside a simple object) of a large number of objects. The example just pretty-prints the result, discarding all dictionaries at a depth of 2, but it could do more interesting processing. Without the callback, a 4.1 MB test file uses 12.5 MB of memory. With the callback, it peaks at around 680 KB, most of which is process overhead.

using namespace std;
using json = nlohmann::json;

static const auto tab = '\t';
ifstream stream("/tmp/test/json/data.json");

...
{
        bool keyed = false;

        json j = json::parse(stream, [&keyed] (int depth, json::parse_event_t event, const json& element) -> bool {

            switch (event) {
                case json::parse_event_t::object_start:
                    if (not keyed)
                        for (int i = 0; i < depth; i++) cout << tab;
                    cout << '{' << endl;
                    keyed = false;
                    break;

                case json::parse_event_t::object_end:
                    for (int i = 0; i < depth; i++) cout << tab;
                    cout << '}' << endl;
                    if (depth == 2) return false;
                    break;

                case json::parse_event_t::array_start:
                    if (not keyed)
                        for (int i = 0; i < depth; i++) cout << tab;
                    cout << '[' << endl;
                    keyed = false;
                    break;

                case json::parse_event_t::array_end:
                    for (int i = 0; i < depth; i++) cout << tab;
                    cout << ']' << endl;
                    break;

                case json::parse_event_t::key:
                    for (int i = 0; i < depth; i++) cout << tab;
                    cout << element << " = ";
                    keyed = true;
                    break;

                case json::parse_event_t::value:
                    if (keyed) {
                        cout << element << endl;
                    }
                    else {
                        for (int i = 0; i < depth; i++) cout << tab;
                        cout << element << endl;
                    }
                    keyed = false;
                    break;

                default:
                    break;
            }
            return true;
        });
    }

@aburgh aburgh changed the title Add parser callback and with the ability to filter results. Add parser callback with the ability to filter results. Mar 2, 2015
@coveralls
Copy link

Coverage Status

Coverage remained the same at 100.0% when pulling b52dc79 on aburgh:callback into 5526de1 on nlohmann:master.

@nlohmann nlohmann self-assigned this Apr 24, 2015
@nlohmann
Copy link
Owner

I like the idea and I love the way how you improved the parser, my first experiments show that the runtime is twice as high.

Consider the following code:

#include <json.hpp>
#include <fstream>

int main(int argc, char** argv)
{
    std::ifstream input_file(argv[1]);
    nlohmann::json j;
    j << input_file;
}

For the old version, I can read https://github.com/miloyip/nativejson-benchmark/blob/master/data/canada.json in 140 ms. The new parser takes 260 ms. I used clang 3.6 with -flto -O3. I'll try to find out where the speed is going - my guess would be that the ´return true` was not inlined and not all copies could be avoided.

@aburgh
Copy link
Contributor Author

aburgh commented Apr 26, 2015

I agree the default callback is probably adding some time and I think it's worth investigating, but I doubt that it would double the time (see below). I declared the default callback as static, but I don't recall why, and now I don't think it adds anything and I wonder if it may prevent inlining.

I suspect the biggest performance hit comes from using the result variable instead of returning a newly constructed object. I think the way it's written will cause the object to be copied into result, which would be relatively expensive for strings, vectors, and maps. I would try changing the assignments result = ... to result = std::move(...).

If the performance penalty can't be eliminated, the parse with callback could be added as a totally separate function. Or, you could even left out of your main code and include it as a user-contributed patch for those that could use it. It's critical to me for parsing a 1+ GB file, but my case is probably uncommon.

@nlohmann
Copy link
Owner

Hi Aaaron, thanks for answering!

First of - please ignore the messages by AppVeyor. I am currently trying to whether MSVC 2015's C++11 support is as good as some people claim...

Second, I'll check the pull request as soon as I can find the time. This weekend, I tried to build a version of parse_internal() that does not return an object, but manipulates an object passed as reference. This would avoid copying altogether, but I could not get it running correctly. Maybe with another look at your code I get an idea.

I understand your use case, but - if possible - I would like to get an idea of the input data. Getting a 1+ GB JSON file with a real-life task would be a nice benchmark - especially as it is not about how many milliseconds it takes to create an object.

All the best
Niels

@aburgh
Copy link
Contributor Author

aburgh commented Apr 27, 2015

I ran a debug build in Apple's Instruments to profile your test program with the canada.json data. It spent 1.4% of the time in default callback, so I think we can ignore it as the problem. I also reminded myself why it's static: it's a user-supplied function and thus not a member function, and so it has to be static when defined in its current location, and I put it inside the class declaration so that it could easily specify a basic_json parameter type.

I tried changing the result = basic_json... to result = move(basic_json...), but that didn't make a noticeable difference.

The program spent a lot of time in push_back and destructing map and vector containers. Looking at the patch diff, I suspect this change is significant:

-                        result.push_back(parse_internal());
+                        auto value = parse_internal(keep);
+                        if (keep and not value.is_discarded())
+                        {
+                            result.push_back(value);
+                        }

The sample data contains a lot of arrays, and that push_back is a copy. I changed it to ...push_back(move(value)) and my debug build test times dropped from ~300ms to ~225ms, which is recovering 50% or more of the performance drop. If you want me to try it with optimization, let me know.

@nlohmann
Copy link
Owner

Hi Aaron, I'll check the code in a minute. I also checked the rest of the parser: a lot of time is wasted when arrays/objects are parsed, because they begin with empty capacity and are resized gradually. I think there is great room for improvements. Another thing is the string handling - the escape function does a terrible job. I'll keep an eye on that.

@nlohmann
Copy link
Owner

I tried again with a larger file (http://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file). I get 2410 ms for the version without callback (clang 3.6, -O3 -flto -std=c++11) and 3667 ms with callbacks. Any ideas?

@aburgh
Copy link
Contributor Author

aburgh commented Apr 28, 2015

I found another instance where std::move() is needed. In parse_internal(), change:

result[key] = value;

to

result[key] = std::move(value);

I didn't compare it to the non-callback version yet, but that one move makes another big performance improvement.

P.S. Nice find for a sample file!

@aburgh
Copy link
Contributor Author

aburgh commented Apr 29, 2015

With the additional change described in my previous comment, here are some test results with Xcode 6.3.1 and flags -O3 -flto -std=c++11:

Three runs of the jeopardy file:

w/o callback with callback
1222 ms 1279 ms
1184 ms 1247 ms
1233 ms 1264 ms

Three runs of the canada.json file:

w/o callback with callback
69.6 ms 71.2 ms
73.1 ms 78.2 ms
83.8 ms 72.9 ms

Do you want to see more improvement before merging? If not, do you want me to update the pull request?

@nlohmann
Copy link
Owner

That sounds awesome! Let me give it a try, and then I'll merge. Thanks so much! 👍🏻

nlohmann added a commit that referenced this pull request May 3, 2015
@nlohmann
Copy link
Owner

nlohmann commented May 3, 2015

Hi @aburgh, I pulled your code and made some minor adjustments. Thanks a lot, and thanks for your patience!

@nlohmann nlohmann closed this May 3, 2015
@aburgh
Copy link
Contributor Author

aburgh commented May 3, 2015

Hi Niels, it appears you didn't include the performance tweaks we found. Would you like me to submit a another pull request with them?

@nlohmann
Copy link
Owner

nlohmann commented May 3, 2015

Oops... I had problems merging the code. Sorry for that. Yes, another pull request against the current version would be great!

@nlohmann nlohmann reopened this May 3, 2015
@nlohmann
Copy link
Owner

nlohmann commented May 6, 2015

This was closed with #69.

@nlohmann nlohmann closed this May 6, 2015
@aburgh aburgh deleted the callback branch May 10, 2015 21:47
GerHobbelt pushed a commit to GerHobbelt/nlohmann-json that referenced this pull request May 7, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants