Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]: cupy, and by extension cudf changes the system's preferred encoding to ANSI_X3.4-1968 #859

Closed
2 tasks done
dagardner-nv opened this issue Apr 7, 2023 · 1 comment · Fixed by #950
Closed
2 tasks done
Assignees
Labels
bug Something isn't working

Comments

@dagardner-nv
Copy link
Contributor

dagardner-nv commented Apr 7, 2023

Version

23.07

Which installation method(s) does this occur on?

Docker, Conda, Source

Describe the bug.

Calling various cupy and cudf methods changes the system's preferred encoding from UTF-8 to ANSI_X3.4-1968.
cupy/cupy#7514
rapidsai/cudf#13085

The problem is any code called after this that requires reading a UTF-8 data source without explicitly setting the encoding will fail.

Minimum reproducible example

import locale

import cupy as cp

print(locale.getpreferredencoding()) # UTF-8
cpa =  cp.arange(0, 10)
print(locale.getpreferredencoding()) # ANSI_X3.4-1968

with open('models/training-tuning-scripts/sid-models/resources/bert-base-cased-vocab.txt') as fh: # contains unicode chars
    contents = fh.read()

Relevant log output

UTF-8
ANSI_X3.4-1968

---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
Cell In[1], line 10
      7 print(locale.getpreferredencoding()) # ANSI_X3.4-1968
      9 with open('models/training-tuning-scripts/sid-models/resources/bert-base-cased-vocab.txt') as fh: # contains unicode chars
---> 10     contents = fh.read()

File ~/work/conda/envs/morpheus/lib/python3.8/encodings/ascii.py:26, in IncrementalDecoder.decode(self, input, final)
     25 def decode(self, input, final=False):
---> 26     return codecs.ascii_decode(input, self.errors)[0]

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 1323: ordinal not in range(128)

Full env printout

No response

Other/Misc.

No response

Code of Conduct

  • I agree to follow Morpheus' Code of Conduct
  • I have searched the open bugs and have found no duplicates for this bug report
@dagardner-nv dagardner-nv added the bug Something isn't working label Apr 7, 2023
@github-actions github-actions bot added the Needs Triage Need team to review and classify label Apr 7, 2023
@dagardner-nv dagardner-nv removed the Needs Triage Need team to review and classify label Apr 7, 2023
dagardner-nv added a commit to dagardner-nv/Morpheus that referenced this issue Apr 10, 2023
@dagardner-nv dagardner-nv self-assigned this Apr 11, 2023
@dagardner-nv
Copy link
Contributor Author

This is a known issue with NVRTC (cupy/cupy#7514 (comment)) currently there are only two known work-arounds:

  1. Setting LC_ALL="POSIX"
  2. Explicitly set the encoding when opening a file ex: with open(vocab_path, encoding='UTF-8')

@dagardner-nv dagardner-nv reopened this Apr 12, 2023
rapids-bot bot pushed a commit that referenced this issue Apr 28, 2023
This PR creates at least one test for each example containing custom stages.
This PR currently only covers those examples which do not require additional packages. 
Part of #849.

* Moves the bert vocabulary files to `morpheus/data` dir, no longer requiring them to be fetched from LFS and making them available to unittests.
* Fixes type hints and remove a redundant method in `examples/log_parsing/inference.py`
* Remove redundant copies of `bert-base-cased-hash.txt` and `bert-base-uncased-hash.txt` files, replacing them with symlinks to the files in the morpheus/data` dir fixes #850
* Explicitly set `encoding='UTF-8'` in `examples/log_parsing/postprocessing.py` as a work-around for issue #859 
* Add `py::kw_only` to Python bindings for `TensorMemory` and sublasses to ensure parity with Python impls.
* Set `repr=False` for the `tensors` field of `TensorMemory` avoids bug when printing due to the fact that we assign the value to `self._tensors`
* Seed cupy's random number generator in `manual_seed` method.
* Fix usage of `reload_modules` fixture, requesting a reload of multiple modules should be done with `@pytest.mark.reload_modules([mod1, mod2])` not calling `reload_modules` twice.
* New test data in `tests/tests_data/log_parsing` is based upon the first 5 rows of data from `models/datasets/validation-data/log-parsing-validation-data-input.csv`

Authors:
  - David Gardner (https://github.com/dagardner-nv)

Approvers:
  - Michael Demoret (https://github.com/mdemoret-nv)

URL: #885
@dagardner-nv dagardner-nv mentioned this issue May 15, 2023
3 tasks
@rapids-bot rapids-bot bot closed this as completed in #950 May 19, 2023
@rapids-bot rapids-bot bot closed this as completed in 3eec11d May 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

1 participant