Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ADIF import fails to read non-ASCII UTF-8 characters correctly #813

Closed
mikaelnousiainen opened this issue Jan 11, 2021 · 8 comments
Closed

Comments

@mikaelnousiainen
Copy link
Contributor

mikaelnousiainen commented Jan 11, 2021

Describe the bug

ADIF import fails to read non-ASCII UTF-8 characters correctly. The issue I'm experiencing seems exactly like #321

Additionally, if the non-ASCII char ('ä' in this example) is the last character in the any field (e.g. comment), it results in invalid UTF-8 to be sent to the DB and the import fails with error:

A Database Error Occurred
Error Number: 1366

Incorrect string value: '\xC3' for column `cloudlog_oh3bhx`.`TABLE_HRD_CONTACTS_V01`.`COL_COMMENT` at row 1

To Reproduce
Steps to reproduce the behaviour:

  1. Have non-ASCII UTF-8 chars in ADIF comment or QTH field, e.g. "Hämeenlinna" in the comment.
  2. Import the ADIF file
  3. Check QSO details -> The example will result in the last char missing: "Hämeenlinn"

Expected behaviour
All ADIF fields with UTF-8 characters will be imported correctly.

Desktop (please complete the following information):

  • OS: Fedora Linux 32
  • Browser: Chromium 87.0.4280.88

Additional context

I've checked that the ADIF contains valid UTF-8 chars and that the MariaDB tables use the default charset of utf8mb4. Something else goes wrong in parsing the ADIF file.

I'm running the latest master, downloaded on Jan 11th.

@mikaelnousiainen
Copy link
Contributor Author

Question: Do system locale settings determine what charset Cloudlog uses to parse uploaded files? I'm running Cloudlog in a Docker container and haven't set any locale settings, so it's defaulting to POSIX locale.

I'm also seeing that there's a charset setting (with default value of UTF-8) in Cloudlog config, which I'd assume to override any system locale settings...

@mikaelnousiainen
Copy link
Contributor Author

I've tried to change the locale to a UTF-8 one -- didn't fix the issue. Any ideas?

@mikaelnousiainen
Copy link
Contributor Author

mikaelnousiainen commented Jan 13, 2021

@magicbug I may have found the issue in ADIF parsing (at least) on line:

https://github.com/magicbug/Cloudlog/blob/master/application/libraries/Adif_parser.php#L163

UTF-8 strings can NOT be split into substrings reliably by index access like this: $value = $value.$record[$a];

If I've understood PHP string correctly (not being a PHP developer), all indexed access to the data should use mb_substr or a similar function:

https://stackoverflow.com/questions/6315750/wrong-output-when-using-array-indexing-on-utf-8-string

As multiple fields on a line in ADIF log may contain UTF-8 chars, it's probably necessary to use multibyte-compatible substr in all places where index access is used currently. The length of each field indicated by the ADIF format is the number of characters, not the number of bytes.

@AndreasK79
Copy link
Contributor

I think you are on the right track. I tried to modify the code and use mb_substr instead. It worked with the test I did.
I did these modifications:
image

Before on the left, after on the right.
If you can test this out yourself and see if that works better, I can submit a patch for it.

@mikaelnousiainen
Copy link
Contributor Author

Thanks for testing! However, I think all of the index references to the ADIF line content need to be changed to support multi-byte charsets. I'll test this and try to push out a PR.

@mikaelnousiainen
Copy link
Contributor Author

@magicbug @AndreasK79 There's now an initial attempt to fix this in PR #830 -- feel free to clean up the code if you wish to :)

@AndreasK79
Copy link
Contributor

@mikaelnousiainen great work. I will test it out later.

@magicbug
Copy link
Owner

Merged into the code

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants