Parsing of floats locale dependent #302

jomade · 2016-09-02T00:13:57Z

With version 2.0.3, the following code:

std::locale l("");
std::string curr_locale = l.name();
float j = nlohmann::json::parse("0.1");
std::setlocale(LC_ALL, "en_US.UTF-8");
float k = nlohmann::json::parse("0.1");

yields the following results (read in the debugger):
curr_locale = "nb_NO.UTF-8"
j = 0
k = 0.100000001

Thus, the decimals are truncated in the Norwegian locale. This is rather unexpected, and should IMO not happen.

I think it originates from the parser calling strtod which is locale dependent.

The text was updated successfully, but these errors were encountered:

nlohmann · 2016-09-02T07:01:04Z

This seems to be related to #228, but affects the parser. I'll look into it.

nlohmann · 2016-09-02T14:19:04Z

I can reproduce the error with

#include <clocale>
#include "src/json.hpp"

int main()
{
    std::setlocale(LC_ALL, "no_NO.UTF-8");
    float j = nlohmann::json::parse("0.1");

    std::setlocale(LC_ALL, "en_US.UTF-8");
    float k = nlohmann::json::parse("0.1");

    std::cerr << "j = " << j << ", k = " << k << std::endl;
}

The output is:

j = 0, k = 0.1

Edit: The reason is that internally std::strtold is called which is not locale-independent.

jomade · 2016-09-02T21:10:33Z

Thanks for your efforts and quickly looking into it!

t-b · 2016-09-06T23:54:42Z

Some systems have strtold_l where you can pass in a locale used for conversion. Unfortunately it is not standardized in POSIX or similiar. A first try in writing a replacement version of strtod_l can be found here.

nlohmann · 2016-09-12T18:48:18Z

Thanks for the link - unfortunately, it uses string streams which - though correct - perform much worse than strtold.

- took current state of #337 - adjusted parser to reject incomplete numbers - simplified number parsing code as it could rely on correct numbers

nlohmann · 2016-11-13T13:24:41Z

I copied the code from PR #337 into a feature branch and worked from there. I cleaned the code a bit and ran the benchmarks:

Before:

parse jeopardy.json                           5  3251660027 ns/op
parse canada.json                           100   139002149 ns/op
parse citm_catalog.json                     200    70643268 ns/op
parse twitter.json                          500    32565660 ns/op
parse numbers/floats.json                    10  1148783185 ns/op
parse numbers/signed_ints.json               20   893569712 ns/op
parse numbers/unsigned_ints.json             20   865393626 ns/op
dump jeopardy.json                           10  1092918030 ns/op
dump jeopardy.json with indent               10  1278158184 ns/op
./json_benchmarks_develop 235.942s

After:

parse jeopardy.json                           5  2758517489 ns/op
parse canada.json                           200    82824924 ns/op
parse citm_catalog.json                     200    67739617 ns/op
parse twitter.json                          500    31993207 ns/op
parse numbers/floats.json                    20   686197008 ns/op
parse numbers/signed_ints.json               20   880843166 ns/op
parse numbers/unsigned_ints.json             20   872672400 ns/op
dump jeopardy.json                           10  1237188893 ns/op
dump jeopardy.json with indent               10  1246099190 ns/op
./json_benchmarks_strtod 246.574s

For floating-point heavy tests (canada.json, floats.json) we see an improvement of about 66%!

There are still 3 tests failing:

-------------------------------------------------------------------------------
regression tests
  issue #186 miloyip/nativejson-benchmark: floating-point parsing
-------------------------------------------------------------------------------
src/unit-regression.cpp:36
...............................................................................

src/unit-regression.cpp:325: FAILED:
  CHECK( j.get<double>() == 72057594037927928.0 )
with expansion:
  72057594037927936.0 == 72057594037927928.0

src/unit-regression.cpp:328: FAILED:
  CHECK( j.get<double>() == 9223372036854774784.0 )
with expansion:
  9223372036854775808.0
  ==
  9223372036854774784.0

-------------------------------------------------------------------------------
compliance tests from nativejson-benchmark
  doubles
-------------------------------------------------------------------------------
src/unit-testsuites.cpp:103
...............................................................................

src/unit-testsuites.cpp:113: FAILED:
  CHECK( json::parse(json_string)[0].get<double>() == Approx(expected) )
due to unexpected exception with message:
  json_string := [2
  .225073858507201136057409796709131975934819546351645648023426109724822222021-
  0769455165295239081350879141491589130396211068700864386945946455276572074078-
  2062174337998814106326732925355228688137214901298112245145188984905722230728-
  5255133155755015914397476397983411801999323962548289017107081850690630666655-
  9949382757725720157630626906633326475653000092458883164330377797918696120494-
  9739037782970490505108060994073026293712895895000358379996720725430436028407-
  8895771796150945516748243471030702609144621572289880258182545180325707018860-
  8721131280795122334262883686223215037756666225039825343359745688844239002654-
  9819838548794829220689472168983109969836584681402285424333066033985088644580-
  4001034933970427567186443383770486037861622771738545623065874679014086723327-
  636718751234567890123456789012345678901e-308]
  expected := 2,22507e-308
  type must be number, but is null

===============================================================================
test cases:   34 |   32 passed | 2 failed
assertions: 8974 | 8971 passed | 3 failed

I need to better understand how to detect/handle overflows.

nlohmann · 2016-11-13T14:52:44Z

Digging up into http://www.exploringbinary.com/how-glibc-strtod-works/ I am thinking the current approach is too simple. @whackashoe, how would you propose to cope with rounding?

whackashoe · 2016-11-14T22:39:30Z

I sorta have that feeling too... I've had a bit of a think about if we could do some trick with epsilon and predict loss during the mults and divs when calculating to get correct rounding but I'm not super certain how that would look or if that is even feasible- I'm certainly a floating point novice heh. The alternative would be like in your link basically trashing this and doing some bigint sort of thing... I'd really like if we could beat that performance wise though :)

gregmarr · 2016-11-14T23:37:41Z

Searching google for "c++ string to double locale independent", this issue is on the second page. :)

Maybe a hybrid approach:

if strtod_l (or _strtod_l on Windows) is available at compile time, use that.

If not:
check at runtime to see if the locale is "C".
If so, use strtod.
Otherwise, fall back to the slower implementation.

This could go in a single helper function so there aren't a lot of ifdefs or other forms of conditionals in the code.

I'm not sure what the performance of the latter would be like, or if it's even feasible.

whackashoe · 2016-11-15T05:47:51Z

@gregmarr I think having a 10x or whatever it is penalty if locale doesn't match- seems like poor workaround. I did a bench a while back and it was pretty bad (10x might be exaggeration). Really what we are doing doesn't match strtod - that just happens to provide all the functionality (and more...)

gregmarr · 2016-11-15T15:18:20Z

Correctness wins over performance, especially when there is a correct and performant option that is usually available.

If you can find a correct solution outside of that, that's great. Otherwise, you'll have to go with something slow when the correct solution isn't otherwise available.

Do we actually know that strtod_l is a problem in any C++11 conforming compiler/library?

http://lua-users.org/lists/lua-l/2016-04/msg00216.html

strtod_l is available on linux (at least both glibc and musl) if _GNU_SOURCE is defined
_strtod_l is available in MSVC since VS2005
strtod_l is available on OSX since Darwin 8.
strtod_l is available on FreeBSD since 9.1

I found one reference where mingw's c library doesn't have either version above, and in that case, the user simply searched the string, replaced '.' with the locale's decimal separator, and then called strtod.

TurpentineDistillery · 2016-11-19T17:57:27Z

Most of the slowness due streams comes from initialization and locale-imbueing, so this can be mitigated by reusing initialized stream per-thread:

// const int x = to_num(...)
// const long double = to_num(...)
struct to_num
{
    const char* const data_ = nullptr;
    const size_t      len_  = 0;

    to_num(const to_num&)            = delete;
    to_num(to_num&&)                 = delete;
    to_num operator=(const to_num&)  = delete;
    to_num& operator=(to_num&&)      = delete;

    to_num(const char* const data, size_t len = std::string::npos)
        : data_{ data }
        , len_{ len == std::string::npos ? strlen(data) : len }
    {}

    to_num(const std::string& s)
        : data_{ s.data() }
        , len_{ s.size() }
    {}

    template<typename T,
             typename = typename std::enable_if<std::is_arithmetic<T>::value>::type >
    operator T() const
    {
        static thread_local std::unique_ptr<std::stringstream> sstr;

        if(!sstr) {
            sstr.reset(new std::stringstream);
            sstr->imbue(std::locale::classic());
        }

        if(len_ == 0) {
            throw std::runtime_error(
                    std::string("Can't parse empty string as numeric type=")
                    + typeid(T).name());
        }

        sstr->write(data_, static_cast<std::streamoff>(len_));
        T result;
        *sstr >> result;

        const bool could_not_parse = !sstr->eof();

        // fix-up sstr state regardless of whether
        // the exception is thrown below.
        sstr->clear();       // clear-out flags
        sstr->str(std::string()); // clear-out data

        if(could_not_parse) {
            throw std::runtime_error(
                    "Can't parse " + std::string(data_, len_)
                    + " as a number of type " + typeid(T).name());
        }

        return result;
    }
};

nlohmann · 2016-11-24T21:20:18Z

@TurpentineDistillery Thanks for the code. I shall have a look and see how it performs compared to the current solution.

TurpentineDistillery · 2016-12-01T04:29:17Z

I tried the stream-based parsing approach myself.
The performance is still garbage : (

There's another approach: query the current locale's decimal-separator character (maybe just once with static const initialization); if not '.', then preprocess the input to look the way the current locale expects (replace '.' with locale's decimal-separator) and then dispatch to strtold or appropriate variant.

gregmarr · 2016-12-01T04:54:03Z

@TurpentineDistillery I mentioned that possibility at the end of my previous comment. Have we tried the strtod_l function to see what its performance is?

TurpentineDistillery · 2016-12-01T05:02:03Z

@gregmarr, indeed you have! I thought this approach was a bit of a hack when I first thought about it, but now I'm feeling better about it : )

jomade · 2016-12-01T10:12:56Z

Is it possible to store numbers like 1.000.000,00 in json? If so, the parsing as you outline might get more complicated since it will yield an invalid result if you just replace the comma with a dot. However, it still could be a viable way.

nlohmann · 2016-12-01T11:35:56Z

@jomade No, this is not valid JSON. Only these numbers are allowed: http://json.org/number.gif

gregmarr · 2016-12-01T14:44:52Z

@jomade and we wouldn't be replacing the comma with a dot. We'd be replacing the dot in 1000000.00 with a comma 1000000,00 so that the locale-specific function would parse properly.

qwename · 2016-12-29T08:24:13Z

@nlohmann The failing regression tests seem to be misleading as they do not take into account std::numeric_limits<double>::digits10. This is equal to 15 for me as it is most likely an IEEE 754 double [1], and we do see that the first 15 digits in both testcases are equal.

I ran std::strtod on strings from 72057594037927928.0 to 72057594037927936.0, and the first four values in the range returns 72057594037927928.0, while the last four values in the range returns 72057594037927936.0. The testcase 7205759403792793199999e-5 does lean towards the smaller value, but I'm not sure if this should be relied on.

[1] https://en.wikipedia.org/wiki/IEEE_floating_point#Basic_and_interchange_formats

gregmarr · 2016-12-30T00:15:43Z

@qwename With 15 or fewer digits, you are guaranteed that a string to double to string conversion will produce the original string. With 17 digits, you are guaranteed that double to string to double conversion will produce the same double. With more than 15 digits, and an arbitrary sequence of digits, then string to double to string is not guaranteed to produce the original string. It will only do that if the string was produced by a conversion from double to string with 17 digits.

nlohmann · 2017-02-14T06:16:35Z

Closed with #450.

nlohmann added the kind: bug label Sep 2, 2016

nlohmann added the confirmed label Sep 2, 2016

nlohmann mentioned this issue Sep 12, 2016

Conformance report for reference #307

Closed

whackashoe mentioned this issue Oct 16, 2016

replace strtold with non locale dependent version #337

Closed

nlohmann added a commit that referenced this issue Nov 12, 2016

work on PR #337 / issue #302

7c44851

- took current state of #337 - adjusted parser to reject incomplete numbers - simplified number parsing code as it could rely on correct numbers

TurpentineDistillery mentioned this issue Dec 6, 2016

locale-independent str-to-num #379

Closed

nlohmann added this to the Release 2.1.0 milestone Jan 2, 2017

nlohmann modified the milestones: Release 2.1.0, Release 3.0.1 Jan 24, 2017

nlohmann modified the milestones: Release 3.0.1, Release 2.1.0 Jan 24, 2017

nlohmann self-assigned this Feb 14, 2017

nlohmann modified the milestones: Release 2.1.1, Release 3.0.1 Feb 14, 2017

nlohmann mentioned this issue Feb 14, 2017

TurpentineDistillery feature/locale independent str to num #450

Merged

nlohmann closed this as completed Feb 14, 2017

nlohmann added a commit that referenced this issue Feb 14, 2017

📝 update after #302

82fb613

nlohmann mentioned this issue Feb 15, 2017

Heap-buffer-overflow (OSS-Fuzz issue 585) #452

Closed

sbodenstein mentioned this issue Feb 22, 2017

RCNN training doesn't work with the latest master branch apache/mxnet#5056

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parsing of floats locale dependent #302

Parsing of floats locale dependent #302

jomade commented Sep 2, 2016

nlohmann commented Sep 2, 2016

nlohmann commented Sep 2, 2016 •

edited

Loading

jomade commented Sep 2, 2016

t-b commented Sep 6, 2016

nlohmann commented Sep 12, 2016

nlohmann commented Nov 13, 2016

nlohmann commented Nov 13, 2016

whackashoe commented Nov 14, 2016

gregmarr commented Nov 14, 2016

whackashoe commented Nov 15, 2016

gregmarr commented Nov 15, 2016

TurpentineDistillery commented Nov 19, 2016

nlohmann commented Nov 24, 2016

TurpentineDistillery commented Dec 1, 2016

gregmarr commented Dec 1, 2016

TurpentineDistillery commented Dec 1, 2016

jomade commented Dec 1, 2016

nlohmann commented Dec 1, 2016

gregmarr commented Dec 1, 2016

qwename commented Dec 29, 2016

gregmarr commented Dec 30, 2016

nlohmann commented Feb 14, 2017

Parsing of floats locale dependent #302

Parsing of floats locale dependent #302

Comments

jomade commented Sep 2, 2016

nlohmann commented Sep 2, 2016

nlohmann commented Sep 2, 2016 • edited Loading

jomade commented Sep 2, 2016

t-b commented Sep 6, 2016

nlohmann commented Sep 12, 2016

nlohmann commented Nov 13, 2016

nlohmann commented Nov 13, 2016

whackashoe commented Nov 14, 2016

gregmarr commented Nov 14, 2016

whackashoe commented Nov 15, 2016

gregmarr commented Nov 15, 2016

TurpentineDistillery commented Nov 19, 2016

nlohmann commented Nov 24, 2016

TurpentineDistillery commented Dec 1, 2016

gregmarr commented Dec 1, 2016

TurpentineDistillery commented Dec 1, 2016

jomade commented Dec 1, 2016

nlohmann commented Dec 1, 2016

gregmarr commented Dec 1, 2016

qwename commented Dec 29, 2016

gregmarr commented Dec 30, 2016

nlohmann commented Feb 14, 2017

nlohmann commented Sep 2, 2016 •

edited

Loading