Compress rarely modified files #869

matklad · 2019-02-21T11:16:11Z

Crazy idea: source code occupies a non-negligible amount of memory. For rust-analyzer, it is

4222 (47mb) files

which actually is worse than I expected (could this be a bug? Do we include unrelated files?).

It might be a good idea to compress this code on the fly! Specifically, we can store text not as Arc<String>, but as an opaque TextBuffer object, which can compress/decompress large text on the fly. Specifically, we should compress all files after the initial indexing of the project, and decompress them on-demand.

This shouldn't be to hard to implement actually!

To clarify, I still think it's a good idea to keep all the source code in memory, to avoid IO errors, but we could use less memory.

The text was updated successfully, but these errors were encountered:

vipentti · 2019-02-21T14:48:18Z

I wonder if it would be possible to have some sort of LRU caching for the compressed source, where you compress everything, but things that are being frequently changed, may stay in memory uncompressed to avoid unnecessary compression/decompression. I guess it also depends on the compression itself, what kind of overhead it has.

jrmuizel · 2019-02-21T15:41:36Z

Why is the source stored at all? Can't it be read from disk as needed?

matklad · 2019-02-21T15:50:06Z

@jrmuizel it's important to let no arbitrary IO into the core incremental computation. We can't guarantee that reading a file twice will yield the same result, and, if we get different results in the same incremental session, we'll be in an inconsistent state.

What should be possible is to "copy" files to some ".rust-analyzer" dir and read them from there, with a contract that an IO error while reading from this private to rust-analyzer dir is fatal and requires restart.

Overal spending 50 megs of RAM to store text seems much better deal than dealing with IO in any form. A good thing about compression is that it gives use memory savings in a purely-functional context.

matklad · 2019-02-21T15:52:18Z

I wonder if it would be possible to have some sort of LRU

The simplest form of LRU is "compress everything once in a while". This is what we do for syntax trees, and it seems to work.

vipentti · 2019-02-21T17:26:57Z

How does ra_vfs relate to the compression, ra_vfs monitors the files and has them in memory right? So should the compression happen already in ra_vfs ?

matklad · 2019-02-21T17:31:48Z

Yeah, I think so!

Currently, Vfs stores text as text: Arc<String>, and it could be changed to a more abstract types. I can't sketch the whole design off the top of my head, I expect there will be some interesting questions about lifeimes and interior mutability. If we are introducing a new type for this, it might also be a good occasion to switch LSP layer to patch files with edits on modifications, instead of asking the client for the whole text buffer every time.

marcogroppo · 2019-03-05T11:33:10Z

Hi! I've noticed that a lot of non-Rust files (LICENSE, AUTHORS, Dockerfile, COPYING, .gitignore, etc.) are included in the salsa db. Is this by design?

matklad · 2019-03-05T11:35:22Z

@marcogroppo that's definitely a bug, only .rs files should be included

matklad · 2019-03-05T11:37:38Z

found it:

https://github.com/rust-analyzer/ra_vfs/blob/beac2769f48474a7dc33014a982614c5c13804ea/src/roots.rs#L97-L100

Here, we include extension-less files. This is so that we don't ignore directories. We should probably do additional filtering somewhere on io layer to filter-out extension less files.

3: Filter out hidden and extensionless files from watching r=matklad a=vipentti Relates to discussion in rust-lang/rust-analyzer#869 I'm not sure if this is the appropriate place to do the filtering. Co-authored-by: Ville Penttinen <villem.penttinen@gmail.com>

marcogroppo · 2019-03-06T09:09:24Z

I did a quick check and with the ra_vfs patch the memory occupied by rust-analyzer's source code is now 3586 (38mb) files (without the patch: 4305 (47mb) files). Another thing I've noticed is that the source code includes tests, benchmarks and examples from libcore and other dependencies

killercup · 2019-03-06T20:46:35Z

This seems like an interesting idea but one should note that some operating systems already compress memory pages when under pressure (macOS by default, Linux with zram).

vipentti · 2019-03-07T12:26:01Z

I think once we can properly ignore files that are not necessary, like tests, benchmarks or examples from external sources, the amount of files should be reduced even further.

matklad · 2019-03-07T14:45:44Z

I think once we can properly ignore files that are not necessary, like tests, benchmarks or examples from external sources, the amount of files should be reduced even further.

I think we should extend vfs API to allow to specify exclusion together with the roots. Than, we can change the logic in rust-analyzer to ignore tests|benches|examples for crates from crates.io.

kjeremy · 2019-03-07T14:49:33Z

Could we use the ignore crate for this?

vipentti · 2019-03-07T14:50:09Z

I think we can use ignore to at least get .gitignore support, maybe we could use it to ignore other things as well ?

matklad · 2019-03-07T14:53:29Z

Yeah, using gitignore is fine!

We only need to think carefully about the interface between VFS and the the rest of the world, such that consumers could flexibly choose the strategy. Perhaps VFS should just accept a BoxFn, such that using gitignore is strictly consumer’s business?

lnicola · 2019-03-07T14:55:24Z

Wouldn't it be good to include the examples, tests and benchmarks, so things like go to definition and find references keep working?

matklad · 2019-03-07T15:00:42Z

@lnicola for crate.io dependencies I think that is not important

lnicola · 2019-03-07T15:03:45Z

Good point. But for the current project they are.

4: Implement Root based filtering for files and folders in Vfs r=matklad a=vipentti The filtering is done through implementing the trait `Filter` which is then applied to folders and files under the given `RootEntry`. This relates to discussion in rust-lang/rust-analyzer#869 and in [zulip](https://rust-lang.zulipchat.com/#narrow/stream/185405-t-compiler.2Fwg-rls-2.2E0/topic/ignoring.20in.20VFS) . This allows users to provide filtering for each root. Enabling to have crate specific filtering applied, so for example for external crates you may exclude `test|bench|example` folders. Co-authored-by: Ville Penttinen <villem.penttinen@gmail.com>

997: Improve filtering of file roots r=matklad a=vipentti `ProjectWorkspace::to_roots` now returns a new `ProjectRoot` which contains information regarding whether or not the given path is part of the current workspace or an external dependency. This information can then be used in `ra_batch` and `ra_lsp_server` to implement more advanced filtering. This allows us to filter some unnecessary folders from external dependencies such as tests, examples and benches. Relates to discussion in #869 Co-authored-by: Ville Penttinen <villem.penttinen@gmail.com>

matklad · 2019-04-08T21:19:34Z

Something we've discussed with @Xanewok at zulip is that we can also fold parsing into the mix and have a three-state repr:

enum SourceState {
    Compressed(Vec<u8>),
    Decompressed(String),
    Parsed(TreeArc<ast::SourceFile>),
}

the repr could change dynamically (so, an interiro mutability is required) depending on access patterns and memory usage. This should also allow us to incrementally reparse files

spadaval · 2020-01-03T12:11:07Z

(Another) crazy idea: Store source code and other large (meta)data in a sqlite or similar database-in-a-file system.
This would allow us to reduce memory usage while also having minimal effects on performance.

The new dependencies are not insignificant, but they would probably be acceptable.
This approach also scales better for huge projects.

lnicola · 2020-01-03T12:17:55Z

@spadaval this might work at the salsa level, see salsa-rs/salsa#10.

lnicola · 2020-09-19T06:12:16Z

I gave this a try at the VFS level, using LZ4:

Run	RSS (MB)	CPU time (s)	Mem (MB)
before	812	20.24	764
before	812	18.69	765
before	814	20.74	764
after	841	19.76	751
after	842	21.58	751
after	813	19.54	751

Uncompressed source code is 43 MB. The tests consisted in starting Code with only RA's main.rs open. I didn't use the custom dictionary feature of the LZ4 crate, but that might help a little, too.

Overall I'm not convinced this is worth it, what do you think?

matklad · 2020-09-21T11:44:52Z

Yeah, seems like it's not worth it at this time!

Thanks for quantifing the wins here @lnicola , that's super helpful!

lnicola · 2023-05-02T15:25:53Z

@Veykril think we should revisit this? See table above.

Veykril · 2023-05-02T15:29:07Z

Ye I think this would be good to revisit (vfs takes up ~100 mb on r-a for me currently)

lnicola · 2023-05-02T17:53:56Z

Some updated baseline numbers after starting Code with main.rs:

cache priming disabled: 1028 MB
cache priming enabled: 1165 MB
FileTextQuery takes 59 MB

nehalem501 · 2023-11-17T19:42:52Z

I might suggest trying with a more modern compression algorithm like zstd instead of lz4 this time.

lnicola · 2023-11-17T19:46:59Z

The issue with zstd is that it pulls in a lot of C code.

Anyway, I tried this again and the memory usage grew, so there's probably something weird going on that's not related to the compression.

davidbarsky · 2023-11-20T17:16:55Z

@lnicola do you still have a branch where you tried this approach (if not, a description is totally fine!). I wanted to try it out with zstd where, for organizational reasons, it's substantially easier for me to bundle a bunch of C code.

(if it's successful, it would likely be a private set of patches I wouldn't send as a PR for the aforementioned "way too much C code" reasons.)

lnicola · 2023-11-20T18:16:56Z

@davidbarsky yeah, I'll clean it up and rebase tomorrow, but it's pretty trivial.

I think at the time I actually did some tests against zstd (outside of RA, by compressing the files), but don't remember the results, but I think using a custom dictionary wasn't really worth it.

The other thing that's needed here is at least a one-item LRU cache, because without that we're going to keep recompressing the current file when the user is typing. I don't think we generally hit the VFS too much otherwise (except when switching branches). https://github.com/lnicola/rust-analyzer/tree/vfs-log adds some logging we can use to double-check.

lnicola · 2024-01-08T15:45:04Z

#16307 makes this obsolete by dropping the file contents from the VFS.

We could still compress the contents in the salsa db, but I'm not sure how to implement that without thrashing on active set of files. Can queries change the inputs?

matklad added E-medium fun A technically challenging issue with high impact labels Feb 21, 2019

vipentti mentioned this issue Mar 5, 2019

Filter out hidden and extensionless files from watching rust-analyzer/ra_vfs#3

Merged

This was referenced Mar 18, 2019

Implement Root based filtering for files and folders in Vfs rust-analyzer/ra_vfs#4

Merged

Upgrade ra_vfs to use new Filtering #994

Merged

vipentti mentioned this issue Mar 19, 2019

Improve filtering of file roots #997

Merged

matklad closed this as completed Sep 21, 2020

Veykril reopened this May 2, 2023

Veykril added the A-perf performance issues label May 2, 2023

lnicola self-assigned this May 2, 2023

Veykril mentioned this issue Jun 19, 2023

Steering Issue 18 #15092

Closed

7 tasks

Veykril mentioned this issue Sep 11, 2023

Steering Issue 19 #15596

Closed

5 tasks

lnicola closed this as not planned Won't fix, can't repro, duplicate, stale Jan 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compress rarely modified files #869

Compress rarely modified files #869

matklad commented Feb 21, 2019 •

edited

Loading

vipentti commented Feb 21, 2019

jrmuizel commented Feb 21, 2019

matklad commented Feb 21, 2019 •

edited

Loading

matklad commented Feb 21, 2019

vipentti commented Feb 21, 2019

matklad commented Feb 21, 2019

marcogroppo commented Mar 5, 2019

matklad commented Mar 5, 2019

matklad commented Mar 5, 2019

marcogroppo commented Mar 6, 2019

killercup commented Mar 6, 2019

vipentti commented Mar 7, 2019

matklad commented Mar 7, 2019

kjeremy commented Mar 7, 2019

vipentti commented Mar 7, 2019

matklad commented Mar 7, 2019

lnicola commented Mar 7, 2019

matklad commented Mar 7, 2019

lnicola commented Mar 7, 2019

matklad commented Apr 8, 2019

spadaval commented Jan 3, 2020 •

edited

Loading

lnicola commented Jan 3, 2020

lnicola commented Sep 19, 2020 •

edited

Loading

matklad commented Sep 21, 2020

lnicola commented May 2, 2023

Veykril commented May 2, 2023

lnicola commented May 2, 2023

nehalem501 commented Nov 17, 2023

lnicola commented Nov 17, 2023

davidbarsky commented Nov 20, 2023

lnicola commented Nov 20, 2023

lnicola commented Jan 8, 2024

Compress rarely modified files #869

Compress rarely modified files #869

Comments

matklad commented Feb 21, 2019 • edited Loading

vipentti commented Feb 21, 2019

jrmuizel commented Feb 21, 2019

matklad commented Feb 21, 2019 • edited Loading

matklad commented Feb 21, 2019

vipentti commented Feb 21, 2019

matklad commented Feb 21, 2019

marcogroppo commented Mar 5, 2019

matklad commented Mar 5, 2019

matklad commented Mar 5, 2019

marcogroppo commented Mar 6, 2019

killercup commented Mar 6, 2019

vipentti commented Mar 7, 2019

matklad commented Mar 7, 2019

kjeremy commented Mar 7, 2019

vipentti commented Mar 7, 2019

matklad commented Mar 7, 2019

lnicola commented Mar 7, 2019

matklad commented Mar 7, 2019

lnicola commented Mar 7, 2019

matklad commented Apr 8, 2019

spadaval commented Jan 3, 2020 • edited Loading

lnicola commented Jan 3, 2020

lnicola commented Sep 19, 2020 • edited Loading

matklad commented Sep 21, 2020

lnicola commented May 2, 2023

Veykril commented May 2, 2023

lnicola commented May 2, 2023

nehalem501 commented Nov 17, 2023

lnicola commented Nov 17, 2023

davidbarsky commented Nov 20, 2023

lnicola commented Nov 20, 2023

lnicola commented Jan 8, 2024

matklad commented Feb 21, 2019 •

edited

Loading

matklad commented Feb 21, 2019 •

edited

Loading

spadaval commented Jan 3, 2020 •

edited

Loading

lnicola commented Sep 19, 2020 •

edited

Loading