Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PrePARE UnicodeDecodeError when scanning dummy NO_ATTRIBUTES.nc netCDF file #643

Closed
atmodatcode opened this issue Jan 14, 2022 · 4 comments · Fixed by #644
Closed

PrePARE UnicodeDecodeError when scanning dummy NO_ATTRIBUTES.nc netCDF file #643

atmodatcode opened this issue Jan 14, 2022 · 4 comments · Fixed by #644

Comments

@atmodatcode
Copy link

Hello,
when exploring how the PrePARE checker is working, I tried it out on various files.
While it works perfectly on a ps_3hr_MPI-ESM1-2-HR_historical_r6i1p1f1_gn_201007120000-201007122100.nc file, there is a strange UnicodeDecodeError when using it for scanning the dummy netcdf file https://raw.githubusercontent.com/AtMoDat/demo_data/main/NO_ATTRIBUTES.nc.

PrePARE.py --variable ps NO_ATTRIBUTES.nc --table-path /pool/data/CMIP6/cmip6-cmor-tables/Tables/CMIP6_3hr.json
Traceback (most recent call last):
  File "/mnt/lustre01/work/bm0021/conda-envs/quality-assurance/lib/python3.9/site-packages/cmip6_cv/PrePARE/PrePARE.py", line 1022, in <module>
    main()
  File "/mnt/lustre01/work/bm0021/conda-envs/quality-assurance/lib/python3.9/site-packages/cmip6_cv/PrePARE/PrePARE.py", line 935, in main
    log_text = f.read()
  File "/mnt/lustre01/work/bm0021/conda-envs/quality-assurance/lib/python3.9/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x9f in position 18771: invalid start byte

The dummy file is a netCDF file with no metadata, but otherwise a valid netCDF file.
E.g. output when scanning with the latest CEDA CF Checker

cfchecks NO_ATTRIBUTES.nc
CHECKING NetCDF FILE: NO_ATTRIBUTES.nc
=====================
Using CF Checker Version 4.1.0
Checking against CF Version CF-1.8
Using Standard Name Table Version 78 (2021-09-21T11:55:06Z)
Using Area Type Table Version 10 (23 June 2020)
Using Standardized Region Name Table Version 4 (18 December 2018)
[...]
ERRORS detected: 0
WARNINGS given: 9
INFORMATION messages: 2

Thanks and best regards

@matthew-mizielinski
Copy link

matthew-mizielinski commented Jan 18, 2022

Hi @atmodatcode, the file you are working with looks to be a long way from what we would expect to be passed in.

If you give the file a name that has some of the information expected, even just ps_3hr.nc rather than NO_ATTRIBUTES.nc, and run PrePARE--table-path <location> ps_3hr.nc you'll get past the failure.

The code itself could probably handle this if the two open(logfile, 'r') calls in PrePARE.py (lines 934 and 958) were modified to include encoding='utf8', errors='ignore' arguments -- I'm guessing that the C code is outputting some characters that upsets python3.

@mauzey1
Copy link
Collaborator

mauzey1 commented Jan 21, 2022

I have discovered that there was an error message being generated using a string that wasn't initialized since the experiment_id attribute was missing from the dataset. This introduced an invalid character into the error message, which caused PrePARE to crash when it tried to read the log file.

cmor/Src/cmor_CV.c

Lines 1035 to 1038 in df7fc34

snprintf(msg, CMOR_MAX_STRING,
"Your experiment_id \"%s\" defined in your input file\n! "
"could not be found in your Control Vocabulary file.(%s)\n! ",
szExperiment_ID, CV_Filename);

I have added some code to check if the attribute exists before proceeding to code that requires the attribute's value. I will create a pull request to merge it.

@taylor13
Copy link
Collaborator

I wonder if there are similar cases to this, since the original coder may not have anticipated handling such an egregiously out-of-conformance file. I realize this only occurred because experiment_id was missing, so maybe there are no other instances of imbedded dependencies like this.

@durack1
Copy link
Contributor

durack1 commented Jan 22, 2022

@taylor13 agreed, it would be great if we could catch such an error and stop - at the very least what @matthew-mizielinski suggests and what @mauzey1 has implemented is a step in the right direction

@mauzey1 mauzey1 mentioned this issue Aug 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants