Problem with reading unicode text from .features file #40

PiotrNakonczy-TomTom · 2012-03-26T12:16:12Z

I have a UTF8 encoded features definition file:

# language: en

  Feature: 
    ...
  Scenario Outline: 

    ... step defs ...

  Examples:
    |sth|sth|expected_value|
    |...|...|Dąślężynów,Oslo,Stockholm|

I'm using cukebins with gtest. I want to compare some UTF8 encoded text with the string extracted from 'expected_value' collumn. The problem is that when extracting the text from 'expected_value' with:
REGEX_PARAM(std::string, expected_value);
I get 'DÄ�Å�lÄ�Å¼ynÃ³w'.

After digging through cukebins code I managed to get closer to the problem's source.
Here is log of what cucumber is sending and recieving around the moment when the problem appears:

...
cucumber-send:
["step_matches",{"name_to_match":"the labels are present on the tile Dąślężynów,Oslo,Stockholm"}]
cucumber-raw_response:
["success",[{"args":[{"pos":35,"val":"D\u00C4\u0085\u00C5\u009Bl\u00C4\u0099\u00C5\u00BCyn\u00C3\u00B3w,Oslo,Stockholm"}],"id":"7","source":"test_steps.cpp:76"}]]
cucumber-send:
["invoke",{"id":"7","args":["DÄ�Å�lÄ�Å¼ynÃ³w,Oslo,Stockholm"]}]
...

Anyway, found an ugly way to workaround the issue, doesn't seem to break anything - I switched off part of character escaping when json_spirit writes response to stream:

File json_spirit_writer_template.h:

    template< class String_type >
    String_type add_esc_chars( const String_type& s )
    {
        typedef typename String_type::const_iterator Iter_type;
        typedef typename String_type::value_type     Char_type;

        String_type result;

        const Iter_type end( s.end() );

        for( Iter_type i = s.begin(); i != end; ++i )
        {
            const Char_type c( *i );

            if( add_esc_char( c, result ) ) continue;

            const wint_t unsigned_c( ( c >= 0 ) ? c : 256 + c );

            // This is commented out due to LBSR-2427 - issue with unicode national chars.

//            if( iswprint( unsigned_c ) )
//            {
                result += c;
//            }
//            else
//            {
//                result += non_printable_to_string< String_type >( unsigned_c );
//            }
        }

        return result;
    }

OS: 64-bit Ubuntu Linux. Same problem appeared on 2 machines.

The text was updated successfully, but these errors were encountered:

paoloambrosio · 2012-03-28T06:01:40Z

To me it seems like "D\u00C4\u0085\u00C5\u009Bl\u00C4\u0099\u00C5\u00BCyn\u00C3\u00B3w,Oslo,Stockholm" is the correct JSON encode for "Dąślężynów,Oslo,Stockholm". I don't understand why Cucumber is sending rubbish back. I'll look into it possibly later today.

PiotrNakonczy-TomTom · 2012-03-28T07:28:43Z

Actually I think this encoding string is already wrong - look at the number of encoded chars.
F.e. the letter 'ą' which is 'LATIN SMALL LETTER A WITH OGONEK' according to http://www.utf8-chartable.de/unicode-utf8-table.pl?start=256 and it's codepoint should be u0105, gets interpreted as 2 separate chars u00C4 and u0085.
This leads me to thinking that the problem appears maybe when reading unicode text from iostream or sth.
Basically the iswprint() function call does not recognize 'ą' and similar as printable character.

paoloambrosio · 2012-03-28T07:53:13Z

You are right: it should have been "D\u0105..."

paoloambrosio · 2012-03-28T17:16:07Z

The bug looks like a Json Spirit issue in dealing with unicode characters when using 8bit characters. This test shows the library behavior:

TEST(JsonSpiritTest, handlesUnicodeOnlyIfWideChars) {
    EXPECT_EQ(L"\"\\u9EC4\\u74DC\"", json_spirit::write_string(wmValue(L"\u9EC4\u74DC"), false));
    EXPECT_NE("\"\\u9EC4\\u74DC\"", json_spirit::write_string(mValue("\u9EC4\u74DC"), false));
}

Unfortunately the JSON serialization code in CukeBins is ugly, so I'll try to refactor it while fixing the bug.

paoloambrosio · 2012-04-09T08:22:56Z

Done a few tests. The components that might have problems are the wire protocol codec (currently using JSON Spirit) and the regular expression matcher. C++ support for unicode and regular expressions has been standardized only with C++0x that is still not an option. In C++03 source code encoding is ASCII only, so even CukeBins regular expressions should be encoded using the \u escape character and wide strings. Please note that MSVC is an example of compiler where wchar_t is 16 bits, so using wide strings would not solve the problem. My proposal for the moment is to treat every char (8-bit) sequence as UTF-8, handled by MSVC and GCC, and...

JSON Spirit

convert every string to wchar_t before decoding or encoding
if unicode support is disabled, fail on non-ASCII codes

Boost 1.48+ comes with the new Locale library that handles UTF quite well. I still haven't come to a conclusion on how to deal with the conversion without ICU or Boost Locale. I might introduce a new dependency from ICU for full unicode support (with any Boost version) or Boost 1.48+ without ICU for partial support.

Edit: since JSON is encoded in UTF-16/UCS-2 like JavaScript, and since we don't care about counting, surrogate pairs should not be a problem, so I removed the case where wchar_t can't hold UTF-32 code points.

Regex

use boost::u32regex if Boost is compiled with ICU support
use boost::wregex if wchar_t is 16 bits and fail on surrogate pairs
use boost::regex and fail on non-ASCII codes

Here there is a brief explanation of Boost Regex unicode support.

`cucumber-ruby` expects position values which are based on the index of the code-point instead of the index of the byte. This change modifies the value returned to `cucumber-ruby` before transmitting the resulting JSON over the WireProtocol. Prior to this change, the regex match's position (based on the match's position in the byte array) would cause an `index out of string` error and crash cucumber-ruby when pretty-printing the results of a test. Closes cucumber#40.

`cucumber-ruby` expects position values which are based on the index of the codepoint instead of the index of the byte. This change modifies the value returned to `cucumber-ruby` before transmitting the resulting JSON over the WireProtocol. Prior to this change, the regex match's position (based on the match's position in the byte array) would cause an `index out of string` error and crash cucumber-ruby when pretty-printing the results of a test. Closes cucumber#40.

PiotrNakonczy-TomTom closed this as completed Mar 28, 2012

PiotrNakonczy-TomTom reopened this Mar 28, 2012

ghost assigned paoloambrosio Mar 28, 2012

muggenhor modified the milestones: v0.4.1, v0.4 Jun 3, 2017

muggenhor removed this from the v0.5 milestone May 23, 2018

rkk-ableton mentioned this issue Jun 17, 2019

Support step definitions with multi-byte characters #224

Merged

9 tasks

jermus67 closed this as completed in #224 Aug 4, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problem with reading unicode text from .features file #40

Problem with reading unicode text from .features file #40

PiotrNakonczy-TomTom commented Mar 26, 2012

paoloambrosio commented Mar 28, 2012

PiotrNakonczy-TomTom commented Mar 28, 2012

paoloambrosio commented Mar 28, 2012

paoloambrosio commented Mar 28, 2012

paoloambrosio commented Apr 9, 2012

Problem with reading unicode text from .features file #40

Problem with reading unicode text from .features file #40

Comments

PiotrNakonczy-TomTom commented Mar 26, 2012

paoloambrosio commented Mar 28, 2012

PiotrNakonczy-TomTom commented Mar 28, 2012

paoloambrosio commented Mar 28, 2012

paoloambrosio commented Mar 28, 2012

paoloambrosio commented Apr 9, 2012