Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Store blobs in subdirs with random names (#4309) #5495

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

iequidoo
Copy link
Collaborator

@iequidoo iequidoo commented Apr 25, 2024

Close #4309
Related discussions: #4555, #4583.
TODO: If this is merged, consider reverting #5778.

@iequidoo iequidoo force-pushed the iequidoo/blob-random-file-suffixes branch from b248bbf to ed33f30 Compare April 26, 2024 00:19
@iequidoo iequidoo requested a review from link2xt April 26, 2024 00:52
Copy link
Member

@r10s r10s left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, functionality wise, thanks! imu, the PR just always adds a suffix to the stem, not just sometimes (please correct me if i summed that up wrongly :)

i think, this is a reasonable way forward to get rid of weird filenames. however, i leave code-level review to ppl who are more into rust and python.

deltachat.h still says "File name may be mangled" - but that should be "File name will be mangled", or?

wondering what is left in #4583 - i am a little confused by the the two PR :)

ideally, before this gets merged, imu UIs should check their usage of dc_msg_get_file() and change "affected" code to BLOBDATA+dc_msg_get_filename() or dc_msg_save_file() - again imu, UIs can test that already now by forcing a suffix by sending the same file twice
@iequidoo can you please check/refine and file issues for that in all UIs? this is useful in any case

@iequidoo
Copy link
Collaborator Author

iequidoo commented Apr 29, 2024

imu, the PR just always adds a suffix to the stem, not just sometimes

Exactly

deltachat.h still says "File name may be mangled" - but that should be "File name will be mangled", or?

Yes, forgot to move the documentation changes to this PR from the follow-up one.
EDIT: Done.

wondering what is left in #4583 - i am a little confused by the the two PR :)

It removes the original file stem at all, we don't need it as we have dc_msg_get_filename() and the app should use it when exporting attachments (maybe not only, but at least). And also the next PR fixes #5338.

It's still a question what to do if we need to display a file (with random name) in some external program. This could be solved by preserving file names, but storing files in dirs having random names to avoid copying. But afaik there are plans to encrypt the whole blobstorage (especially on Desktop because there's no sandboxing), so copying can't be avoided anyway.

ideally, before this gets merged, imu UIs should check their usage of dc_msg_get_file() and change "affected" code to BLOBDATA+dc_msg_get_filename() or dc_msg_save_file() - again imu, UIs can test that already now by forcing a suffix by sending the same file twice
@iequidoo can you please check/refine and file issues for that in all UIs? this is useful in any case

I think this one can be merged even before switching all UIs to the new API (though it can break sending multipart archives as @adbenitez noted), but definitely not the next PR. I can check Desktop and Android on my own, maybe also iOS, but i don't use it and never looked into its sources :)

@iequidoo iequidoo force-pushed the iequidoo/blob-random-file-suffixes branch 2 times, most recently from 76c82f3 to deca65d Compare May 21, 2024 20:04
@iequidoo
Copy link
Collaborator Author

iequidoo commented May 23, 2024

So far i see the following tasks for the UIs:

  • Android: switch to using dc_msg_save_file() for saving attachments.
  • Desktop: the same (jsonrpc's save_msg_file()).

(I'm going to file tasks, but need to study the apps code a bit more. DONE: deltachat/deltachat-desktop#3851, deltachat/deltachat-android#3091.)
But they look unrelated to the PR so it can be merged. The more important question is

... what to do if we need to display a file (with random name) in some external program. This could be solved by preserving file names, but storing files in dirs having random names to avoid copying. But afaik there are plans to encrypt the whole blobstorage (especially on Desktop because there's no sandboxing), so copying can't be avoided anyway.

@iequidoo
Copy link
Collaborator Author

iequidoo commented Aug 30, 2024

This is likely will never be merged because it breaks opening attachments in external programs (they will always have weird names). The previous approach (copying to a tmp file) failed as well, we don't want to do extra work in general and also it's difficult to maintain that UIs don't cause blob copying when it's not needed, e.g. the UI may want to get the blob path to display an image, this mustn't cause copying as the path isn't passed to an external program, so we need two separate API functions for this.

So my suggestion is to save blobs to separate dirs having random names. These dirs should be garbage-collected in housekeeping (as far as i can tell this is already done in remove_unused_files()). Note that for Desktop, if we plan to encrypt the whole blobstorage there, copying/exporting blobs still will be needed at least in case of passing them to external programs. @adbenitez also mentioned that there's a case of multipart archives, but i think it should be fine to export blobs to a separate dir before passing them to an unarchiver? And we already have an API for this -- CommandApi::save_msg_file().

@Simon-Laux
Copy link
Member

Simon-Laux commented Sep 1, 2024

I think if we change blob storage, then we should already go for a solution that has support for deduplication and encryption.

I think it could make sense to have a second sqlite db (maybe even with a different page size more optimised for storing files, or compressed like in https://sqlite.org/sqlar.html)

Then that sqlite db could have two tables, one with filenames and hash and one with hash, size and content or sth like that. Or just an SQLite Archive (as described in https://sqlite.org/sqlar.html) archive with hashes as filenames and reusing the blobs table in the existing database.

-> https://sqlite.org/fasterthanfs.html

@iequidoo
Copy link
Collaborator Author

iequidoo commented Sep 27, 2024

I think if we change blob storage, then we should already go for a solution that has support for deduplication and encryption.

I still think this may be excessive on platforms with apps sandboxing like Android. And this makes useless the fs-level dedup if any (my Android just uses ext4, but maybe there are devices using Btrfs or so, though i don't see it on https://source.android.com/docs/core/architecture/android-kernel-file-system-support). Fs-level dedup is useful if files are exported outside of Delta Chat.

If we want to use SQLite Archive, then maybe just switch to it? Reading the "Disadvantages" section on https://sqlite.org/sqlar.html, i don't see any critical ones. Still measurements make sense. And on some platforms we still may want to store blobs directly on fs, moreover there may be already a block device encryption or fs-level encryption.

For now i suggest to just store blobs in separate dirs, it's not that huge change and should resolve all issues with file names intersection. EDIT: Done.

@iequidoo iequidoo force-pushed the iequidoo/blob-random-file-suffixes branch from 0610b7d to 8bfb6e9 Compare October 2, 2024 19:28
@iequidoo iequidoo changed the title feat: Use random filename suffixes for blobstorage (#4309) feat: Store blobs in subdirs with random names (#4309) Oct 2, 2024
@iequidoo iequidoo marked this pull request as draft October 2, 2024 19:34
@iequidoo iequidoo force-pushed the iequidoo/blob-random-file-suffixes branch 4 times, most recently from 8a315e8 to d2e299d Compare October 5, 2024 18:47
@iequidoo iequidoo marked this pull request as ready for review October 5, 2024 18:58
Recently there was an accident with a chatbot that replaced its avatar set from the command line
with an unrelated avatar of a contact. Both the `selfavatar` setting and the contact avatar `i`
param pointed to `$BLOBDIR/avatar.png` at the time it was detected. How this happened is unclear,
but it is possible that `avatar.png` was removed, unmounted or otherwise not detected by the core,
and the core stored avatar received from the contact as `avatar.png`, while `selfavatar` config
still pointed to `$BLOBDIR/avatar.png`.

Such bugs are unavoidable even if the core itself has no bugs as we cannot rely on blobdir not
reside on the faulty network filesystem, being incorrectly backed up and restored etc., so we should
assume that files may be randomly removed. Then there may be dangling `$BLOBDIR/...` references in
the database which may accidentally point to unrelated files, could even be an `avatar.png` file
sent to the bot in private.

To prevent such bugs, store blobs in blobdir subdirs with random names. Also this helps when we
receive multiple attachments having the same name -- before, random filename suffixes were added to
subsequent attachments, now attachments preserve their filenames which is important if they are
opened in external programs.
@iequidoo iequidoo force-pushed the iequidoo/blob-random-file-suffixes branch from d2e299d to cb1d008 Compare November 15, 2024 17:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Use random filenames or hashes for blobstorage
4 participants