Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Desktop: PDF search text: Remove NULL characters early to avoid possible sync issues #9862

Conversation

personalizedrefrigerator
Copy link
Collaborator

@personalizedrefrigerator personalizedrefrigerator commented Feb 5, 2024

Summary

We now skip OCR when a PDF already has embedded text. However, embedded PDF text may contain NUL characters (sample PDF). This pull request removes such NUL characters just after extracting search text from PDFs.

NUL characters are also known to cause search issues and possibly sync issues.

This is a follow-up pull request to #9774.

Notes

See also

Notes

Testing

Automated tests should verify that:

  1. Notes containing NUL characters are still searchable
  2. Multiple NUL characters in the same string are replaced

I have additionally done the following manual testing on Ubuntu 23.10:

  1. Enable OCR
  2. This PDF previously caused FTS issues due to NUL characters. Download it and attach it to a note.
  3. Wait about 30 seconds for the resource to become searchable.
  4. Search for "Rumelhart" (see original issue)
  5. Verify that the PDF is shown as one of the results.

@personalizedrefrigerator personalizedrefrigerator marked this pull request as draft February 5, 2024 23:02
The "Unknown" character, �, does not seem to be a token separator. Thus,
the NUL character test needed to be adjusted.
@personalizedrefrigerator personalizedrefrigerator marked this pull request as ready for review February 5, 2024 23:17
@laurent22
Copy link
Owner

What would happen if some ocr data has already been saved with a NULL character, and then this data is synced? Do you know if it's going to break sync?

In which case maybe we should provide process any ocr text that's been generated recently to strip off null characters.

Also (but that's for later) we should probably have safeguards when serializing a sync item - if a null character is found we stop sync to prevent data corruption (provided these characters actually are a problem)

@personalizedrefrigerator
Copy link
Collaborator Author

personalizedrefrigerator commented Feb 6, 2024

What would happen if some ocr data has already been saved with a NULL character, and then this data is synced? Do you know if it's going to break sync?

From an initial test (Syncthing file system sync, encryption disabled), it doesn't seem to break sync. I'm now testing with Joplin Server sync. Edit: Joplin Server sync also works.

I don't know which sync target was being used in the original bug report — it's possible that this is only an issue with a certain sync target (e.g. WebDAV).

Edit 2: I am getting errors while searching on mobile (Cannot execute MATCH query: Search : unable to use function MATCH in the requested context). These seem to happen after attaching the problematic PDF. Because this happens even if OCR is disabled, this is likely a different issue (perhaps caused by the version of SQLite on the Android 7 device).

@laurent22
Copy link
Owner

At FOSDEM someone shown me a sync error that indicated a corrupted sync item. I'm not sure if it's because he manually edited the item on Nextcloud or if it was due to this particular bug.

Could you maybe try this?

  • Sync from the desktop with Nextcloud and make sure one of the item has a NULL character
  • Then sync with the Android app and see if the item is correctly downloaded.

I'm wondering if the network library or something along the way fails on the NULL character

@laurent22 laurent22 merged commit a906e73 into laurent22:dev Feb 6, 2024
10 checks passed
@personalizedrefrigerator
Copy link
Collaborator Author

Could you maybe try this?

Sync from the desktop with Nextcloud and make sure one of the item has a NULL character
Then sync with the Android app and see if the item is correctly downloaded.

I can confirm that any text after a NUL character in a note fails to sync.

Resources seem to sync successfully, despite having NUL characters in their ocr_text.

This was tested with TheGood.Cloud (the first free public provider listed on nextcloud.com). It's very, very slow, however (attempting to attach a several-megabyte PDF timed out). For further tests, I might try self-hosting.

(Originally posted on Discord -- included here to make it easier to find in the future)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants