Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem with reading unicode text from .features file #40

Closed
PiotrNakonczy-TomTom opened this issue Mar 26, 2012 · 5 comments · Fixed by #224
Closed

Problem with reading unicode text from .features file #40

PiotrNakonczy-TomTom opened this issue Mar 26, 2012 · 5 comments · Fixed by #224
Labels

Comments

@PiotrNakonczy-TomTom
Copy link

I have a UTF8 encoded features definition file:

# language: en

  Feature: 
    ...
  Scenario Outline: 

    ... step defs ...

  Examples:
    |sth|sth|expected_value|
    |...|...|Dąślężynów,Oslo,Stockholm|

I'm using cukebins with gtest. I want to compare some UTF8 encoded text with the string extracted from 'expected_value' collumn. The problem is that when extracting the text from 'expected_value' with:
REGEX_PARAM(std::string, expected_value);
I get 'D��l�żynów'.

After digging through cukebins code I managed to get closer to the problem's source.
Here is log of what cucumber is sending and recieving around the moment when the problem appears:

...
cucumber-send:
["step_matches",{"name_to_match":"the labels are present on the tile Dąślężynów,Oslo,Stockholm"}]
cucumber-raw_response:
["success",[{"args":[{"pos":35,"val":"D\u00C4\u0085\u00C5\u009Bl\u00C4\u0099\u00C5\u00BCyn\u00C3\u00B3w,Oslo,Stockholm"}],"id":"7","source":"test_steps.cpp:76"}]]
cucumber-send:
["invoke",{"id":"7","args":["D��l�żynów,Oslo,Stockholm"]}]
...

Anyway, found an ugly way to workaround the issue, doesn't seem to break anything - I switched off part of character escaping when json_spirit writes response to stream:

File json_spirit_writer_template.h:

    template< class String_type >
    String_type add_esc_chars( const String_type& s )
    {
        typedef typename String_type::const_iterator Iter_type;
        typedef typename String_type::value_type     Char_type;

        String_type result;

        const Iter_type end( s.end() );

        for( Iter_type i = s.begin(); i != end; ++i )
        {
            const Char_type c( *i );

            if( add_esc_char( c, result ) ) continue;

            const wint_t unsigned_c( ( c >= 0 ) ? c : 256 + c );

            // This is commented out due to LBSR-2427 - issue with unicode national chars.

//            if( iswprint( unsigned_c ) )
//            {
                result += c;
//            }
//            else
//            {
//                result += non_printable_to_string< String_type >( unsigned_c );
//            }
        }

        return result;
    }

OS: 64-bit Ubuntu Linux. Same problem appeared on 2 machines.

@paoloambrosio
Copy link
Member

To me it seems like "D\u00C4\u0085\u00C5\u009Bl\u00C4\u0099\u00C5\u00BCyn\u00C3\u00B3w,Oslo,Stockholm" is the correct JSON encode for "Dąślężynów,Oslo,Stockholm". I don't understand why Cucumber is sending rubbish back. I'll look into it possibly later today.

@PiotrNakonczy-TomTom
Copy link
Author

Actually I think this encoding string is already wrong - look at the number of encoded chars.
F.e. the letter 'ą' which is 'LATIN SMALL LETTER A WITH OGONEK' according to http://www.utf8-chartable.de/unicode-utf8-table.pl?start=256 and it's codepoint should be u0105, gets interpreted as 2 separate chars u00C4 and u0085.
This leads me to thinking that the problem appears maybe when reading unicode text from iostream or sth.
Basically the iswprint() function call does not recognize 'ą' and similar as printable character.

@paoloambrosio
Copy link
Member

You are right: it should have been "D\u0105..."

@paoloambrosio
Copy link
Member

The bug looks like a Json Spirit issue in dealing with unicode characters when using 8bit characters. This test shows the library behavior:

TEST(JsonSpiritTest, handlesUnicodeOnlyIfWideChars) {
    EXPECT_EQ(L"\"\\u9EC4\\u74DC\"", json_spirit::write_string(wmValue(L"\u9EC4\u74DC"), false));
    EXPECT_NE("\"\\u9EC4\\u74DC\"", json_spirit::write_string(mValue("\u9EC4\u74DC"), false));
}

Unfortunately the JSON serialization code in CukeBins is ugly, so I'll try to refactor it while fixing the bug.

@ghost ghost assigned paoloambrosio Mar 28, 2012
@paoloambrosio
Copy link
Member

Done a few tests. The components that might have problems are the wire protocol codec (currently using JSON Spirit) and the regular expression matcher. C++ support for unicode and regular expressions has been standardized only with C++0x that is still not an option. In C++03 source code encoding is ASCII only, so even CukeBins regular expressions should be encoded using the \u escape character and wide strings. Please note that MSVC is an example of compiler where wchar_t is 16 bits, so using wide strings would not solve the problem. My proposal for the moment is to treat every char (8-bit) sequence as UTF-8, handled by MSVC and GCC, and...

JSON Spirit

  • convert every string to wchar_t before decoding or encoding
  • if unicode support is disabled, fail on non-ASCII codes

Boost 1.48+ comes with the new Locale library that handles UTF quite well. I still haven't come to a conclusion on how to deal with the conversion without ICU or Boost Locale. I might introduce a new dependency from ICU for full unicode support (with any Boost version) or Boost 1.48+ without ICU for partial support.

Edit: since JSON is encoded in UTF-16/UCS-2 like JavaScript, and since we don't care about counting, surrogate pairs should not be a problem, so I removed the case where wchar_t can't hold UTF-32 code points.

Regex

  • use boost::u32regex if Boost is compiled with ICU support
  • use boost::wregex if wchar_t is 16 bits and fail on surrogate pairs
  • use boost::regex and fail on non-ASCII codes

Here there is a brief explanation of Boost Regex unicode support.

@muggenhor muggenhor modified the milestones: v0.4.1, v0.4 Jun 3, 2017
@muggenhor muggenhor removed this from the v0.5 milestone May 23, 2018
rkk-ableton pushed a commit to AbletonAppDev/cucumber-cpp that referenced this issue Jun 17, 2019
`cucumber-ruby` expects position values which are based on the index of the
code-point instead of the index of the byte. This change modifies the value
returned to `cucumber-ruby` before transmitting the resulting JSON over the
WireProtocol.

Prior to this change, the regex match's position (based on the match's position
in the byte array) would cause an `index out of string` error and crash
cucumber-ruby when pretty-printing the results of a test.

Closes cucumber#40.
rkk-ableton pushed a commit to AbletonAppDev/cucumber-cpp that referenced this issue Jun 17, 2019
`cucumber-ruby` expects position values which are based on the index of the
codepoint instead of the index of the byte. This change modifies the value
returned to `cucumber-ruby` before transmitting the resulting JSON over the
WireProtocol.

Prior to this change, the regex match's position (based on the match's position
in the byte array) would cause an `index out of string` error and crash
cucumber-ruby when pretty-printing the results of a test.

Closes cucumber#40.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants