Desktop: PDF search text: Remove NULL characters early to avoid possible sync issues #9862

personalizedrefrigerator · 2024-02-05T22:23:56Z

Summary

We now skip OCR when a PDF already has embedded text. However, embedded PDF text may contain NUL characters (sample PDF). This pull request removes such NUL characters just after extracting search text from PDFs.

NUL characters are also known to cause search issues and possibly sync issues.

This is a follow-up pull request to #9774.

Notes

The PDF text extraction logic is new and both FTS is broken when a note contains null characters #9775 (see Desktop: Resolves #9765: OCR: Use existing PDF text when available #9764 (comment)) and Cannot Synchronize note with 0x00 (NULL byte) pasted in #5046 were caused by NUL characters in PDFs.
SQLite's documentation warns against using NUL characters in strings because they can cause functions like length and quote to behave unexpectedly.
The CommonMark spec states that NUL characters should be replaced with "�".

Notes

It may also make sense to replace NUL characters when saving notes and other items. SQLite's documentation suggests against using NUL characters in strings, as doing so may break functions like length and quote.

Testing

Automated tests should verify that:

Notes containing NUL characters are still searchable
Multiple NUL characters in the same string are replaced

I have additionally done the following manual testing on Ubuntu 23.10:

Enable OCR
This PDF previously caused FTS issues due to NUL characters. Download it and attach it to a note.
Wait about 30 seconds for the resource to become searchable.
Search for "Rumelhart" (see original issue)
Verify that the PDF is shown as one of the results.

The "Unknown" character, �, does not seem to be a token separator. Thus, the NUL character test needed to be adjusted.

laurent22 · 2024-02-05T23:53:26Z

What would happen if some ocr data has already been saved with a NULL character, and then this data is synced? Do you know if it's going to break sync?

In which case maybe we should provide process any ocr text that's been generated recently to strip off null characters.

Also (but that's for later) we should probably have safeguards when serializing a sync item - if a null character is found we stop sync to prevent data corruption (provided these characters actually are a problem)

personalizedrefrigerator · 2024-02-06T00:21:36Z

What would happen if some ocr data has already been saved with a NULL character, and then this data is synced? Do you know if it's going to break sync?

From an initial test (Syncthing file system sync, encryption disabled), it doesn't seem to break sync. I'm now testing with Joplin Server sync. Edit: Joplin Server sync also works.

I don't know which sync target was being used in the original bug report — it's possible that this is only an issue with a certain sync target (e.g. WebDAV).

Edit 2: I am getting errors while searching on mobile (Cannot execute MATCH query: Search : unable to use function MATCH in the requested context). These seem to happen after attaching the problematic PDF. Because this happens even if OCR is disabled, this is likely a different issue (perhaps caused by the version of SQLite on the Android 7 device).

laurent22 · 2024-02-06T16:23:25Z

At FOSDEM someone shown me a sync error that indicated a corrupted sync item. I'm not sure if it's because he manually edited the item on Nextcloud or if it was due to this particular bug.

Could you maybe try this?

Sync from the desktop with Nextcloud and make sure one of the item has a NULL character
Then sync with the Android app and see if the item is correctly downloaded.

I'm wondering if the network library or something along the way fails on the NULL character

personalizedrefrigerator · 2024-02-08T18:14:09Z

Could you maybe try this?

Sync from the desktop with Nextcloud and make sure one of the item has a NULL character
Then sync with the Android app and see if the item is correctly downloaded.

I can confirm that any text after a NUL character in a note fails to sync.

Resources seem to sync successfully, despite having NUL characters in their ocr_text.

This was tested with TheGood.Cloud (the first free public provider listed on nextcloud.com). It's very, very slow, however (attempting to attach a several-megabyte PDF timed out). For further tests, I might try self-hosting.

(Originally posted on Discord -- included here to make it easier to find in the future)

Desktop: Remove NULL characters early.

bc1705a

personalizedrefrigerator marked this pull request as draft February 5, 2024 23:02

Fix search engine test -- \x00 is no longer a separator

0d1844b

The "Unknown" character, �, does not seem to be a token separator. Thus, the NUL character test needed to be adjusted.

personalizedrefrigerator marked this pull request as ready for review February 5, 2024 23:17

laurent22 merged commit a906e73 into laurent22:dev Feb 6, 2024
10 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Desktop: PDF search text: Remove NULL characters early to avoid possible sync issues #9862

Desktop: PDF search text: Remove NULL characters early to avoid possible sync issues #9862

personalizedrefrigerator commented Feb 5, 2024 •

edited

Loading

laurent22 commented Feb 5, 2024

personalizedrefrigerator commented Feb 6, 2024 •

edited

Loading

laurent22 commented Feb 6, 2024

personalizedrefrigerator commented Feb 8, 2024

Desktop: PDF search text: Remove NULL characters early to avoid possible sync issues #9862

Desktop: PDF search text: Remove NULL characters early to avoid possible sync issues #9862

Conversation

personalizedrefrigerator commented Feb 5, 2024 • edited Loading

Summary

Notes

See also

Notes

Testing

laurent22 commented Feb 5, 2024

personalizedrefrigerator commented Feb 6, 2024 • edited Loading

laurent22 commented Feb 6, 2024

personalizedrefrigerator commented Feb 8, 2024

personalizedrefrigerator commented Feb 5, 2024 •

edited

Loading

personalizedrefrigerator commented Feb 6, 2024 •

edited

Loading