-
-
Notifications
You must be signed in to change notification settings - Fork 6.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parsing of floats locale dependent #302
Comments
This seems to be related to #228, but affects the parser. I'll look into it. |
I can reproduce the error with #include <clocale>
#include "src/json.hpp"
int main()
{
std::setlocale(LC_ALL, "no_NO.UTF-8");
float j = nlohmann::json::parse("0.1");
std::setlocale(LC_ALL, "en_US.UTF-8");
float k = nlohmann::json::parse("0.1");
std::cerr << "j = " << j << ", k = " << k << std::endl;
} The output is:
Edit: The reason is that internally |
Thanks for your efforts and quickly looking into it! |
Some systems have |
Thanks for the link - unfortunately, it uses string streams which - though correct - perform much worse than |
I copied the code from PR #337 into a feature branch and worked from there. I cleaned the code a bit and ran the benchmarks: Before:
After:
For floating-point heavy tests (canada.json, floats.json) we see an improvement of about 66%! There are still 3 tests failing:
I need to better understand how to detect/handle overflows. |
Digging up into http://www.exploringbinary.com/how-glibc-strtod-works/ I am thinking the current approach is too simple. @whackashoe, how would you propose to cope with rounding? |
I sorta have that feeling too... I've had a bit of a think about if we could do some trick with epsilon and predict loss during the mults and divs when calculating to get correct rounding but I'm not super certain how that would look or if that is even feasible- I'm certainly a floating point novice heh. The alternative would be like in your link basically trashing this and doing some bigint sort of thing... I'd really like if we could beat that performance wise though :) |
Searching google for "c++ string to double locale independent", this issue is on the second page. :) Maybe a hybrid approach: if strtod_l (or _strtod_l on Windows) is available at compile time, use that. If not: This could go in a single helper function so there aren't a lot of ifdefs or other forms of conditionals in the code. I'm not sure what the performance of the latter would be like, or if it's even feasible. |
@gregmarr I think having a 10x or whatever it is penalty if locale doesn't match- seems like poor workaround. I did a bench a while back and it was pretty bad (10x might be exaggeration). Really what we are doing doesn't match strtod - that just happens to provide all the functionality (and more...) |
Correctness wins over performance, especially when there is a correct and performant option that is usually available. If you can find a correct solution outside of that, that's great. Otherwise, you'll have to go with something slow when the correct solution isn't otherwise available. Do we actually know that strtod_l is a problem in any C++11 conforming compiler/library? http://lua-users.org/lists/lua-l/2016-04/msg00216.html
I found one reference where mingw's c library doesn't have either version above, and in that case, the user simply searched the string, replaced '.' with the locale's decimal separator, and then called strtod. |
Most of the slowness due streams comes from initialization and locale-imbueing, so this can be mitigated by reusing initialized stream per-thread: // const int x = to_num(...)
// const long double = to_num(...)
struct to_num
{
const char* const data_ = nullptr;
const size_t len_ = 0;
to_num(const to_num&) = delete;
to_num(to_num&&) = delete;
to_num operator=(const to_num&) = delete;
to_num& operator=(to_num&&) = delete;
to_num(const char* const data, size_t len = std::string::npos)
: data_{ data }
, len_{ len == std::string::npos ? strlen(data) : len }
{}
to_num(const std::string& s)
: data_{ s.data() }
, len_{ s.size() }
{}
template<typename T,
typename = typename std::enable_if<std::is_arithmetic<T>::value>::type >
operator T() const
{
static thread_local std::unique_ptr<std::stringstream> sstr;
if(!sstr) {
sstr.reset(new std::stringstream);
sstr->imbue(std::locale::classic());
}
if(len_ == 0) {
throw std::runtime_error(
std::string("Can't parse empty string as numeric type=")
+ typeid(T).name());
}
sstr->write(data_, static_cast<std::streamoff>(len_));
T result;
*sstr >> result;
const bool could_not_parse = !sstr->eof();
// fix-up sstr state regardless of whether
// the exception is thrown below.
sstr->clear(); // clear-out flags
sstr->str(std::string()); // clear-out data
if(could_not_parse) {
throw std::runtime_error(
"Can't parse " + std::string(data_, len_)
+ " as a number of type " + typeid(T).name());
}
return result;
}
}; |
@TurpentineDistillery Thanks for the code. I shall have a look and see how it performs compared to the current solution. |
I tried the stream-based parsing approach myself. There's another approach: query the current locale's decimal-separator character (maybe just once with static const initialization); if not '.', then preprocess the input to look the way the current locale expects (replace '.' with locale's decimal-separator) and then dispatch to strtold or appropriate variant. |
@TurpentineDistillery I mentioned that possibility at the end of my previous comment. Have we tried the strtod_l function to see what its performance is? |
@gregmarr, indeed you have! I thought this approach was a bit of a hack when I first thought about it, but now I'm feeling better about it : ) |
Is it possible to store numbers like 1.000.000,00 in json? If so, the parsing as you outline might get more complicated since it will yield an invalid result if you just replace the comma with a dot. However, it still could be a viable way. |
@jomade No, this is not valid JSON. Only these numbers are allowed: http://json.org/number.gif |
@jomade and we wouldn't be replacing the comma with a dot. We'd be replacing the dot in |
@nlohmann The failing regression tests seem to be misleading as they do not take into account I ran [1] https://en.wikipedia.org/wiki/IEEE_floating_point#Basic_and_interchange_formats |
@qwename With 15 or fewer digits, you are guaranteed that a string to double to string conversion will produce the original string. With 17 digits, you are guaranteed that double to string to double conversion will produce the same double. With more than 15 digits, and an arbitrary sequence of digits, then string to double to string is not guaranteed to produce the original string. It will only do that if the string was produced by a conversion from double to string with 17 digits. |
Closed with #450. |
With version 2.0.3, the following code:
yields the following results (read in the debugger):
curr_locale = "nb_NO.UTF-8"
j = 0
k = 0.100000001
Thus, the decimals are truncated in the Norwegian locale. This is rather unexpected, and should IMO not happen.
I think it originates from the parser calling strtod which is locale dependent.
The text was updated successfully, but these errors were encountered: