[BUG]: cupy, and by extension cudf changes the system's preferred encoding to ANSI_X3.4-1968 #859

dagardner-nv · 2023-04-07T19:03:58Z

Version

23.07

Which installation method(s) does this occur on?

Docker, Conda, Source

Describe the bug.

Calling various cupy and cudf methods changes the system's preferred encoding from UTF-8 to ANSI_X3.4-1968.
cupy/cupy#7514
rapidsai/cudf#13085

The problem is any code called after this that requires reading a UTF-8 data source without explicitly setting the encoding will fail.

Minimum reproducible example

import locale

import cupy as cp

print(locale.getpreferredencoding()) # UTF-8
cpa =  cp.arange(0, 10)
print(locale.getpreferredencoding()) # ANSI_X3.4-1968

with open('models/training-tuning-scripts/sid-models/resources/bert-base-cased-vocab.txt') as fh: # contains unicode chars
    contents = fh.read()

Relevant log output

UTF-8
ANSI_X3.4-1968

---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
Cell In[1], line 10
      7 print(locale.getpreferredencoding()) # ANSI_X3.4-1968
      9 with open('models/training-tuning-scripts/sid-models/resources/bert-base-cased-vocab.txt') as fh: # contains unicode chars
---> 10     contents = fh.read()

File ~/work/conda/envs/morpheus/lib/python3.8/encodings/ascii.py:26, in IncrementalDecoder.decode(self, input, final)
     25 def decode(self, input, final=False):
---> 26     return codecs.ascii_decode(input, self.errors)[0]

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 1323: ordinal not in range(128)

Full env printout

No response

Other/Misc.

No response

Code of Conduct

I agree to follow Morpheus' Code of Conduct
I have searched the open bugs and have found no duplicates for this bug report

dagardner-nv · 2023-04-11T22:11:51Z

This is a known issue with NVRTC (cupy/cupy#7514 (comment)) currently there are only two known work-arounds:

Setting LC_ALL="POSIX"
Explicitly set the encoding when opening a file ex: with open(vocab_path, encoding='UTF-8')

This PR creates at least one test for each example containing custom stages. This PR currently only covers those examples which do not require additional packages. Part of #849. * Moves the bert vocabulary files to `morpheus/data` dir, no longer requiring them to be fetched from LFS and making them available to unittests. * Fixes type hints and remove a redundant method in `examples/log_parsing/inference.py` * Remove redundant copies of `bert-base-cased-hash.txt` and `bert-base-uncased-hash.txt` files, replacing them with symlinks to the files in the morpheus/data` dir fixes #850 * Explicitly set `encoding='UTF-8'` in `examples/log_parsing/postprocessing.py` as a work-around for issue #859 * Add `py::kw_only` to Python bindings for `TensorMemory` and sublasses to ensure parity with Python impls. * Set `repr=False` for the `tensors` field of `TensorMemory` avoids bug when printing due to the fact that we assign the value to `self._tensors` * Seed cupy's random number generator in `manual_seed` method. * Fix usage of `reload_modules` fixture, requesting a reload of multiple modules should be done with `@pytest.mark.reload_modules([mod1, mod2])` not calling `reload_modules` twice. * New test data in `tests/tests_data/log_parsing` is based upon the first 5 rows of data from `models/datasets/validation-data/log-parsing-validation-data-input.csv` Authors: - David Gardner (https://github.com/dagardner-nv) Approvers: - Michael Demoret (https://github.com/mdemoret-nv) URL: #885

dagardner-nv added the bug Something isn't working label Apr 7, 2023

github-actions bot added the Needs Triage Need team to review and classify label Apr 7, 2023

dagardner-nv removed the Needs Triage Need team to review and classify label Apr 7, 2023

dagardner-nv added a commit to dagardner-nv/Morpheus that referenced this issue Apr 10, 2023

Explicitly set the encoding as a work-around for nv-morpheus#859

5985006

dagardner-nv self-assigned this Apr 11, 2023

dagardner-nv closed this as completed Apr 11, 2023

dagardner-nv mentioned this issue Apr 12, 2023

[FEA]: Explicitly set encoding parameter on all file open calls #872

Closed

2 tasks

dagardner-nv reopened this Apr 12, 2023

dagardner-nv mentioned this issue Apr 13, 2023

Create tests for examples with custom stages #885

Merged

3 tasks

mdemoret-nv mentioned this issue Apr 19, 2023

[FEA]: Add pylint to the CI Checks stage #896

Closed

2 tasks

dagardner-nv mentioned this issue May 15, 2023

Add Pylint to CI #950

Merged

3 tasks

rapids-bot bot closed this as completed in #950 May 19, 2023

rapids-bot bot closed this as completed in 3eec11d May 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG]: cupy, and by extension cudf changes the system's preferred encoding to ANSI_X3.4-1968 #859

[BUG]: cupy, and by extension cudf changes the system's preferred encoding to ANSI_X3.4-1968 #859

dagardner-nv commented Apr 7, 2023 •

edited

Loading

dagardner-nv commented Apr 11, 2023

[BUG]: cupy, and by extension cudf changes the system's preferred encoding to ANSI_X3.4-1968 #859

[BUG]: cupy, and by extension cudf changes the system's preferred encoding to ANSI_X3.4-1968 #859

Comments

dagardner-nv commented Apr 7, 2023 • edited Loading

Version

Which installation method(s) does this occur on?

Describe the bug.

Minimum reproducible example

Relevant log output

Full env printout

Other/Misc.

Code of Conduct

dagardner-nv commented Apr 11, 2023

dagardner-nv commented Apr 7, 2023 •

edited

Loading