Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Serialization to disk #10

Open
nikomatsakis opened this issue Oct 1, 2018 · 8 comments
Open

Serialization to disk #10

nikomatsakis opened this issue Oct 1, 2018 · 8 comments
Labels
rfc Active discussion about a possible future feature
Milestone

Comments

@nikomatsakis
Copy link
Member

We should support some way to serialize the state of our queries to disk and then reload them for a future session. This is a lot of work and we can learn from rustc, of course. We'd want to do the reloading lazilly, for example.

I definitely want to punt on this.

@nikomatsakis nikomatsakis added the rfc Active discussion about a possible future feature label Oct 1, 2018
@nikomatsakis nikomatsakis added this to the Far future milestone Oct 1, 2018
@matklad
Copy link
Member

matklad commented Nov 15, 2018

A similar, but different feature is to allow to transparently spill rarely used values to disk.

IntelliJ relies on similar feature heavily: when you open a multi-million line project with lots of dependencies, indices become really huge.

Note that this is a significantly different setup from rustc, which operates on a crate at a time, and has a reasonable natural cap on the amount of data it must process simultaneously.

@matklad
Copy link
Member

matklad commented Jan 23, 2019

A very wise observation from: rust-lang/rfcs#1317 (comment)

In a strictly on-demand setting (IDE, not a compiler), serialization to disk creates more problems than it solves.

@lnicola
Copy link
Contributor

lnicola commented Dec 4, 2019

In a strictly on-demand setting (IDE, not a compiler), serialization to disk creates more problems than it solves.

Note that some popular IDEs like Visual Studio actually use a disk database. VS migrated a while ago from a custom format to a SQLite database: https://devblogs.microsoft.com/cppblog/introducing-c-experimental-editor-tools/.

@lpil
Copy link

lpil commented Mar 5, 2020

Hi! This would be a desirable feature for me. Is this being worked on?

Not trying to rush you, just trying to evaluate how suitable this library is for my use-case. Thank you. :)

@matklad
Copy link
Member

matklad commented Mar 5, 2020 via email

@lpil
Copy link

lpil commented Mar 5, 2020

Thank you

@fogti
Copy link

fogti commented Apr 9, 2020

I think serialization should be generally opt-in:

  • maybe at salsa::database level:
    rather coarse, without lazy loading or transparent spilling, useful for "whole session" store/load and short-term-running scenarios
  • or even per query:
    fine-grained, with lazy loading and maybe transparent spilling, useful to reduce RAM usage in long-term-running scenarios

I think I already have a kind of usage scenario ("scenario" as in "salsa is currently not used, but I investigate potential usages") in zs-filecrawler.

Click to expand That program first walks through a file list and computes the hash of each file. Then it iterates over the list of hashes, takes the first associated file, and calls a user-defined hook script on that file. It caches the hash list and the progress. It might not really fit the usual `salsa` usage scenario, but the target is similiar: avoid redoing work.
QueryGroup 1: 
  file_content(filepath) <-- hash_data(filepath)
  ^-[maybe lazy input]      --> association [filepath -> hash_of_file_data]

QueryGroup 2:
  hash2file(hash)    <-- call_hook(hash)
  ^-[input, from QG1]   --> implicit association [hash -> done(hook return value)]

Currently, I just take the "session serialization approach", deserialize at startup, and serialize at shutdown/interrupt, but this may lose some progress. I think that the zs-filecrawler utility program could benefit from salsa, but it requires some way to serialize the state (the split into two QueryGroups would simulate that, but it makes interleaving both parts more difficult, and reduces potential benefits).

@MichaReiser
Copy link
Contributor

Thanks for creating salsa. It's an outstanding piece of software and an extremely valuable inspiration resource.

We're exploring adding incremental computation to Ruff, a static analysis tool for Python that is preliminary used from the CLI but also comes with an LSP. We're intrigued by salsa's model. It's nice how it handles much of the caching complexity for you. However, we believe that a persistent cache is essential for us because subsequent check times are important when using the CLI locally or in CI. That's how I came across this issue.

Is this a feature where active contributions would be welcomed? Are there ides on how this could be implemented in Salsa 2022 that I could explore further?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
rfc Active discussion about a possible future feature
Projects
None yet
Development

No branches or pull requests

6 participants