-
Notifications
You must be signed in to change notification settings - Fork 131
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Problem with reading unicode text from .features file #40
Comments
To me it seems like "D\u00C4\u0085\u00C5\u009Bl\u00C4\u0099\u00C5\u00BCyn\u00C3\u00B3w,Oslo,Stockholm" is the correct JSON encode for "Dąślężynów,Oslo,Stockholm". I don't understand why Cucumber is sending rubbish back. I'll look into it possibly later today. |
Actually I think this encoding string is already wrong - look at the number of encoded chars. |
You are right: it should have been "D\u0105..." |
The bug looks like a Json Spirit issue in dealing with unicode characters when using 8bit characters. This test shows the library behavior: TEST(JsonSpiritTest, handlesUnicodeOnlyIfWideChars) {
EXPECT_EQ(L"\"\\u9EC4\\u74DC\"", json_spirit::write_string(wmValue(L"\u9EC4\u74DC"), false));
EXPECT_NE("\"\\u9EC4\\u74DC\"", json_spirit::write_string(mValue("\u9EC4\u74DC"), false));
} Unfortunately the JSON serialization code in CukeBins is ugly, so I'll try to refactor it while fixing the bug. |
Done a few tests. The components that might have problems are the wire protocol codec (currently using JSON Spirit) and the regular expression matcher. C++ support for unicode and regular expressions has been standardized only with C++0x that is still not an option. In C++03 source code encoding is ASCII only, so even CukeBins regular expressions should be encoded using the \u escape character and wide strings. Please note that MSVC is an example of compiler where wchar_t is 16 bits, so using wide strings would not solve the problem. My proposal for the moment is to treat every char (8-bit) sequence as UTF-8, handled by MSVC and GCC, and... JSON Spirit
Boost 1.48+ comes with the new Locale library that handles UTF quite well. I still haven't come to a conclusion on how to deal with the conversion without ICU or Boost Locale. I might introduce a new dependency from ICU for full unicode support (with any Boost version) or Boost 1.48+ without ICU for partial support. Edit: since JSON is encoded in UTF-16/UCS-2 like JavaScript, and since we don't care about counting, surrogate pairs should not be a problem, so I removed the case where wchar_t can't hold UTF-32 code points. Regex
Here there is a brief explanation of Boost Regex unicode support. |
`cucumber-ruby` expects position values which are based on the index of the code-point instead of the index of the byte. This change modifies the value returned to `cucumber-ruby` before transmitting the resulting JSON over the WireProtocol. Prior to this change, the regex match's position (based on the match's position in the byte array) would cause an `index out of string` error and crash cucumber-ruby when pretty-printing the results of a test. Closes cucumber#40.
`cucumber-ruby` expects position values which are based on the index of the codepoint instead of the index of the byte. This change modifies the value returned to `cucumber-ruby` before transmitting the resulting JSON over the WireProtocol. Prior to this change, the regex match's position (based on the match's position in the byte array) would cause an `index out of string` error and crash cucumber-ruby when pretty-printing the results of a test. Closes cucumber#40.
I have a UTF8 encoded features definition file:
I'm using cukebins with gtest. I want to compare some UTF8 encoded text with the string extracted from 'expected_value' collumn. The problem is that when extracting the text from 'expected_value' with:
REGEX_PARAM(std::string, expected_value);
I get 'D��l�żynów'.
After digging through cukebins code I managed to get closer to the problem's source.
Here is log of what cucumber is sending and recieving around the moment when the problem appears:
Anyway, found an ugly way to workaround the issue, doesn't seem to break anything - I switched off part of character escaping when json_spirit writes response to stream:
File json_spirit_writer_template.h:
OS: 64-bit Ubuntu Linux. Same problem appeared on 2 machines.
The text was updated successfully, but these errors were encountered: