Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parsing of floats locale dependent #302

Closed
jomade opened this issue Sep 2, 2016 · 22 comments
Closed

Parsing of floats locale dependent #302

jomade opened this issue Sep 2, 2016 · 22 comments

Comments

@jomade
Copy link

jomade commented Sep 2, 2016

With version 2.0.3, the following code:

std::locale l("");
std::string curr_locale = l.name();
float j = nlohmann::json::parse("0.1");
std::setlocale(LC_ALL, "en_US.UTF-8");
float k = nlohmann::json::parse("0.1");

yields the following results (read in the debugger):
curr_locale = "nb_NO.UTF-8"
j = 0
k = 0.100000001

Thus, the decimals are truncated in the Norwegian locale. This is rather unexpected, and should IMO not happen.

I think it originates from the parser calling strtod which is locale dependent.

@nlohmann
Copy link
Owner

nlohmann commented Sep 2, 2016

This seems to be related to #228, but affects the parser. I'll look into it.

@nlohmann
Copy link
Owner

nlohmann commented Sep 2, 2016

I can reproduce the error with

#include <clocale>
#include "src/json.hpp"

int main()
{
    std::setlocale(LC_ALL, "no_NO.UTF-8");
    float j = nlohmann::json::parse("0.1");

    std::setlocale(LC_ALL, "en_US.UTF-8");
    float k = nlohmann::json::parse("0.1");

    std::cerr << "j = " << j << ", k = " << k << std::endl;
}

The output is:

j = 0, k = 0.1

Edit: The reason is that internally std::strtold is called which is not locale-independent.

@jomade
Copy link
Author

jomade commented Sep 2, 2016

Thanks for your efforts and quickly looking into it!

@t-b
Copy link
Contributor

t-b commented Sep 6, 2016

Some systems have strtold_l where you can pass in a locale used for conversion. Unfortunately it is not standardized in POSIX or similiar. A first try in writing a replacement version of strtod_l can be found here.

@nlohmann
Copy link
Owner

Thanks for the link - unfortunately, it uses string streams which - though correct - perform much worse than strtold.

nlohmann added a commit that referenced this issue Nov 12, 2016
- took current state of #337
- adjusted parser to reject incomplete numbers
- simplified number parsing code as it could rely on correct numbers
@nlohmann
Copy link
Owner

I copied the code from PR #337 into a feature branch and worked from there. I cleaned the code a bit and ran the benchmarks:

Before:

parse jeopardy.json                           5  3251660027 ns/op
parse canada.json                           100   139002149 ns/op
parse citm_catalog.json                     200    70643268 ns/op
parse twitter.json                          500    32565660 ns/op
parse numbers/floats.json                    10  1148783185 ns/op
parse numbers/signed_ints.json               20   893569712 ns/op
parse numbers/unsigned_ints.json             20   865393626 ns/op
dump jeopardy.json                           10  1092918030 ns/op
dump jeopardy.json with indent               10  1278158184 ns/op
./json_benchmarks_develop 235.942s

After:

parse jeopardy.json                           5  2758517489 ns/op
parse canada.json                           200    82824924 ns/op
parse citm_catalog.json                     200    67739617 ns/op
parse twitter.json                          500    31993207 ns/op
parse numbers/floats.json                    20   686197008 ns/op
parse numbers/signed_ints.json               20   880843166 ns/op
parse numbers/unsigned_ints.json             20   872672400 ns/op
dump jeopardy.json                           10  1237188893 ns/op
dump jeopardy.json with indent               10  1246099190 ns/op
./json_benchmarks_strtod 246.574s

For floating-point heavy tests (canada.json, floats.json) we see an improvement of about 66%!

There are still 3 tests failing:

-------------------------------------------------------------------------------
regression tests
  issue #186 miloyip/nativejson-benchmark: floating-point parsing
-------------------------------------------------------------------------------
src/unit-regression.cpp:36
...............................................................................

src/unit-regression.cpp:325: FAILED:
  CHECK( j.get<double>() == 72057594037927928.0 )
with expansion:
  72057594037927936.0 == 72057594037927928.0

src/unit-regression.cpp:328: FAILED:
  CHECK( j.get<double>() == 9223372036854774784.0 )
with expansion:
  9223372036854775808.0
  ==
  9223372036854774784.0

-------------------------------------------------------------------------------
compliance tests from nativejson-benchmark
  doubles
-------------------------------------------------------------------------------
src/unit-testsuites.cpp:103
...............................................................................

src/unit-testsuites.cpp:113: FAILED:
  CHECK( json::parse(json_string)[0].get<double>() == Approx(expected) )
due to unexpected exception with message:
  json_string := [2
  .225073858507201136057409796709131975934819546351645648023426109724822222021-
  0769455165295239081350879141491589130396211068700864386945946455276572074078-
  2062174337998814106326732925355228688137214901298112245145188984905722230728-
  5255133155755015914397476397983411801999323962548289017107081850690630666655-
  9949382757725720157630626906633326475653000092458883164330377797918696120494-
  9739037782970490505108060994073026293712895895000358379996720725430436028407-
  8895771796150945516748243471030702609144621572289880258182545180325707018860-
  8721131280795122334262883686223215037756666225039825343359745688844239002654-
  9819838548794829220689472168983109969836584681402285424333066033985088644580-
  4001034933970427567186443383770486037861622771738545623065874679014086723327-
  636718751234567890123456789012345678901e-308]
  expected := 2,22507e-308
  type must be number, but is null

===============================================================================
test cases:   34 |   32 passed | 2 failed
assertions: 8974 | 8971 passed | 3 failed

I need to better understand how to detect/handle overflows.

@nlohmann
Copy link
Owner

Digging up into http://www.exploringbinary.com/how-glibc-strtod-works/ I am thinking the current approach is too simple. @whackashoe, how would you propose to cope with rounding?

@whackashoe
Copy link
Contributor

I sorta have that feeling too... I've had a bit of a think about if we could do some trick with epsilon and predict loss during the mults and divs when calculating to get correct rounding but I'm not super certain how that would look or if that is even feasible- I'm certainly a floating point novice heh. The alternative would be like in your link basically trashing this and doing some bigint sort of thing... I'd really like if we could beat that performance wise though :)

@gregmarr
Copy link
Contributor

Searching google for "c++ string to double locale independent", this issue is on the second page. :)

Maybe a hybrid approach:

if strtod_l (or _strtod_l on Windows) is available at compile time, use that.

If not:
check at runtime to see if the locale is "C".
If so, use strtod.
Otherwise, fall back to the slower implementation.

This could go in a single helper function so there aren't a lot of ifdefs or other forms of conditionals in the code.

I'm not sure what the performance of the latter would be like, or if it's even feasible.

@whackashoe
Copy link
Contributor

@gregmarr I think having a 10x or whatever it is penalty if locale doesn't match- seems like poor workaround. I did a bench a while back and it was pretty bad (10x might be exaggeration). Really what we are doing doesn't match strtod - that just happens to provide all the functionality (and more...)

@gregmarr
Copy link
Contributor

Correctness wins over performance, especially when there is a correct and performant option that is usually available.

If you can find a correct solution outside of that, that's great. Otherwise, you'll have to go with something slow when the correct solution isn't otherwise available.

Do we actually know that strtod_l is a problem in any C++11 conforming compiler/library?

http://lua-users.org/lists/lua-l/2016-04/msg00216.html

  • strtod_l is available on linux (at least both glibc and musl) if _GNU_SOURCE is defined
  • _strtod_l is available in MSVC since VS2005
  • strtod_l is available on OSX since Darwin 8.
  • strtod_l is available on FreeBSD since 9.1

I found one reference where mingw's c library doesn't have either version above, and in that case, the user simply searched the string, replaced '.' with the locale's decimal separator, and then called strtod.

@TurpentineDistillery
Copy link

Most of the slowness due streams comes from initialization and locale-imbueing, so this can be mitigated by reusing initialized stream per-thread:

// const int x = to_num(...)
// const long double = to_num(...)
struct to_num
{
    const char* const data_ = nullptr;
    const size_t      len_  = 0;

    to_num(const to_num&)            = delete;
    to_num(to_num&&)                 = delete;
    to_num operator=(const to_num&)  = delete;
    to_num& operator=(to_num&&)      = delete;

    to_num(const char* const data, size_t len = std::string::npos)
        : data_{ data }
        , len_{ len == std::string::npos ? strlen(data) : len }
    {}

    to_num(const std::string& s)
        : data_{ s.data() }
        , len_{ s.size() }
    {}

    template<typename T,
             typename = typename std::enable_if<std::is_arithmetic<T>::value>::type >
    operator T() const
    {
        static thread_local std::unique_ptr<std::stringstream> sstr;

        if(!sstr) {
            sstr.reset(new std::stringstream);
            sstr->imbue(std::locale::classic());
        }

        if(len_ == 0) {
            throw std::runtime_error(
                    std::string("Can't parse empty string as numeric type=")
                    + typeid(T).name());
        }

        sstr->write(data_, static_cast<std::streamoff>(len_));
        T result;
        *sstr >> result;

        const bool could_not_parse = !sstr->eof();

        // fix-up sstr state regardless of whether
        // the exception is thrown below.
        sstr->clear();       // clear-out flags
        sstr->str(std::string()); // clear-out data

        if(could_not_parse) {
            throw std::runtime_error(
                    "Can't parse " + std::string(data_, len_)
                    + " as a number of type " + typeid(T).name());
        }

        return result;
    }
};

@nlohmann
Copy link
Owner

@TurpentineDistillery Thanks for the code. I shall have a look and see how it performs compared to the current solution.

@TurpentineDistillery
Copy link

I tried the stream-based parsing approach myself.
The performance is still garbage : (

There's another approach: query the current locale's decimal-separator character (maybe just once with static const initialization); if not '.', then preprocess the input to look the way the current locale expects (replace '.' with locale's decimal-separator) and then dispatch to strtold or appropriate variant.

@gregmarr
Copy link
Contributor

gregmarr commented Dec 1, 2016

@TurpentineDistillery I mentioned that possibility at the end of my previous comment. Have we tried the strtod_l function to see what its performance is?

@TurpentineDistillery
Copy link

@gregmarr, indeed you have! I thought this approach was a bit of a hack when I first thought about it, but now I'm feeling better about it : )

@jomade
Copy link
Author

jomade commented Dec 1, 2016

Is it possible to store numbers like 1.000.000,00 in json? If so, the parsing as you outline might get more complicated since it will yield an invalid result if you just replace the comma with a dot. However, it still could be a viable way.

@nlohmann
Copy link
Owner

nlohmann commented Dec 1, 2016

@jomade No, this is not valid JSON. Only these numbers are allowed: http://json.org/number.gif

@gregmarr
Copy link
Contributor

gregmarr commented Dec 1, 2016

@jomade and we wouldn't be replacing the comma with a dot. We'd be replacing the dot in 1000000.00 with a comma 1000000,00 so that the locale-specific function would parse properly.

@qwename
Copy link
Contributor

qwename commented Dec 29, 2016

@nlohmann The failing regression tests seem to be misleading as they do not take into account std::numeric_limits<double>::digits10. This is equal to 15 for me as it is most likely an IEEE 754 double [1], and we do see that the first 15 digits in both testcases are equal.

I ran std::strtod on strings from 72057594037927928.0 to 72057594037927936.0, and the first four values in the range returns 72057594037927928.0, while the last four values in the range returns 72057594037927936.0. The testcase 7205759403792793199999e-5 does lean towards the smaller value, but I'm not sure if this should be relied on.

[1] https://en.wikipedia.org/wiki/IEEE_floating_point#Basic_and_interchange_formats

@gregmarr
Copy link
Contributor

@qwename With 15 or fewer digits, you are guaranteed that a string to double to string conversion will produce the original string. With 17 digits, you are guaranteed that double to string to double conversion will produce the same double. With more than 15 digits, and an arbitrary sequence of digits, then string to double to string is not guaranteed to produce the original string. It will only do that if the string was produced by a conversion from double to string with 17 digits.

@nlohmann nlohmann added this to the Release 2.1.0 milestone Jan 2, 2017
@nlohmann nlohmann modified the milestones: Release 2.1.0, Release 3.0.1 Jan 24, 2017
@nlohmann
Copy link
Owner

Closed with #450.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants