Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SimpleIon READ ERROR, 10 unit tests failed consistently. #249

Open
cheqianh opened this issue Mar 14, 2023 · 4 comments
Open

SimpleIon READ ERROR, 10 unit tests failed consistently. #249

cheqianh opened this issue Mar 14, 2023 · 4 comments
Labels

Comments

@cheqianh
Copy link
Contributor

cheqianh commented Mar 14, 2023

There are some read issues when using the simpleion module.

Example failed CI/CD workflow - https://github.com/amazon-ion/ion-python/actions/runs/4410728439/jobs/7728500690

1 failed unit test within ion-python is:

test_loads_unicode_utf8_conversion():

9 failed tests from ion-tests are:

bad/clobWithNonAsciiCharacter_ion
bad/utf8/shortUtf8Sequence_3_ion
bad/utf8/outOfUnicodeBounds_1_ion
bad/utf8/outOfUnicodeBounds_2_ion
bad/utf8/wrongUtf8LeadingBits_1_ion
bad/utf8/wrongUtf8LeadingBits_3_ion
bad/utf8/wrongUtf8LeadingBits_2_ion
bad/utf8/shortUtf8Sequence_2_ion
bad/utf8/shortUtf8Sequence_1_ion

Took a look, the root seems to be the recent ion-c change. Need further investigation.

@cheqianh cheqianh added the bug label Mar 14, 2023
@cheqianh cheqianh changed the title SimpleIon READ ERROR, 10 unit tests failed. SimpleIon READ ERROR, 10 unit tests failed consistently. Mar 14, 2023
@cheqianh
Copy link
Contributor Author

cheqianh commented Mar 15, 2023

One example repro:

# data = ['-', test, test, test, test, test, ... ,test]
data = "[ '''\u2013''', "
data += 'test, ' * 3000
data += "test ]"

# below throws IERR_INVALID_SYNTAX
simpleion.loads(data, parse_eagerly=True, text_buffer_size_limit=10000)

Detailed traceback:

ionc.ionc_read() --- simpleion.py#L516
ion_reader_next() in simpleIon --- ioncmodule.c#L1345
ion_reader_next() in ion-c --- ion_reader.c#L535
_ion_reader_next_helper --- ion_reader.c#L549
_ion_reader_text_next --- ion_reader_text.c#L199
_ion_reader_text_check_follow_token --- ion_reader_text.c#L234
IERR_INVALID_SYNTAX --- ion_reader_text.c#L701

@cheqianh
Copy link
Contributor Author

#250 mitigate this issue by building C extension utilizing ion-c 1.1.0. The commit should be reverted once this issue is solved.

@nirosys
Copy link
Contributor

nirosys commented Mar 24, 2023

After some digging, and applying an ion-c fix that is about to be submitted, I was able to fix the single ion-python unit test (non-ion-tests unit tests). The ion-tests failures took some digging to figure out what was going on.

I'm not sure yet why the change to ion-c v1.1.1 triggered these tests to break, but looking at the failures, this is what I found.

I updated ion_read_file_stream_handler in order to add some logging so I could see what was happening:

--- a/amazon/ion/ioncmodule.c
+++ b/amazon/ion/ioncmodule.c
@@ -1361,6 +1361,11 @@ iERR ion_read_file_stream_handler(struct _ion_user_stream *pstream) {
     PyObject *py_buffer = PyObject_CallMethod(stream_handle->py_file, "read", "O", IONC_STREAM_BYTES_READ_SIZE);
 
     if (py_buffer == NULL) {
+        printf("[ion_read_file_stream_handler] py_buffer == NULL, py_file: \n");
+        PyObject_Print(stream_handle->py_file, stdout, 0);
+        printf("\nException:\n");
+        PyObject_Print(PyErr_Occurred(), stdout, 0);
+        printf("\n");
         pstream->limit = NULL;
         FAILWITH(IERR_READ_ERROR);
     }

This resulted in the following output:

[ion_read_file_stream_handler] py_buffer == NULL, py_file: 
<_io.TextIOWrapper name='/Users/glitch/Code/ion-python/vectors/iontestdata/bad/utf8/shortUtf8Sequence_1.ion' mode='r' encoding='utf-8'>
Exception:
<class 'UnicodeDecodeError'>

So, right from the start we're not able to read from the file because the underlying TextIOWrapper has raised a UnicodeDecodeError exception. With this error, the unit test immediately fails, all before ion-c even saw any of the data.

The thing to note here, is that the test data that we're trying to give to ion-c is known to be invalid UTF-8, and the reason for the test is to ensure the parser (ion-c in our case) identifies it as such.

This led me to look at how the files were being opened, and I found the _open function defined in test_vectors.py. So it looks like we were intentionally specifying the encoding. This could be intended for the pure python implementation, but should probably not be used for ion-c. Partly as stated above, but also there are UTF-16 and UTF-32 files within ion-tests, and ion-c does not support anything other than UTF-8. The open function here will open other encodings, and unless I'm mistaken, convert it to UTF-8 as it parses it into python strings causing double validation (python decoding, and ion-c parsing).

I believe not too long ago we discussed only supporting UTF-8, and I believe that was the consensus, but I'm not sure.

I have a PR in the works to update the _open function, and add the UTF-16 & UTF-32 files to the skip list, but we might want to have some more discussion around that. I'm going to dig a little more in order to see if I can identify why the tests pass prior to v1.1.1, then I'll post the PR.

@nirosys
Copy link
Contributor

nirosys commented Mar 27, 2023

I spent more time looking into why the ion-c v1.1.1 release would have changed the behavior of the unit test. A bugfix that was included in v1.1.1 changed how user managed stream handlers were called and did not bubble up read errors. This resulted in a bug where an error reading from the handler would be detected as EOF but only after calling next, which resulted in ion_reader_open_stream returning successfully even if the handler did not.

The python decoder exception described above occurs with both v1.1.0 and v1.1.1, however prior to v1.1.1, if the stream handler returned an error on first call, it would result in ion_reader_open_stream returning an error, which would cause the ion-c python extension to raise an IonException which would satisfy the unit test. This would overwrite the decoder's exception and is why, I assume, it wasn't noticed before.

With the v1.1.1 change, that initial error from ion_reader_open_stream was not returned, which led the ion-c extension to return an iterator. As soon as ionc_read returned python would pick up on the raised exception from the TextIOWrapper's decoder and bubble it up. The raised UnicodeDecodeError was not expected by the unit test causing it to fail.

I have a PR that I'll be submitting shortly for ion-c that addresses the very first bug fix (that addressed the non-ion-test unit test failure), along with an update to address the read error bug. That alone will fix the unit test failures, but I also have an ion-python PR that I'll submit to remove the use of text decoding when using the C extension.

It would probably also be a good idea to check for exceptions after read attempts in the C extension. If an underlying IO exception occurs, it would be nice to make sure that gets bubbled to the user, rather than just an ion related exception. I might include that in the PR as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants