Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compress link_ids lists #48279

Closed
wants to merge 1 commit into from
Closed

Compress link_ids lists #48279

wants to merge 1 commit into from

Conversation

timholy
Copy link
Sponsor Member

@timholy timholy commented Jan 14, 2023

This reduces the size of our precompile cache files, using run-length encoding (RLE) to represent the module of external linkages. Most linkages seem to be against the sysimg itself, and RLE allows long stretches of such linkages to be encoded compactly.

Closes #48218

With a fairly minimal default environment (42 packages including dependencies), here were the sizes of .julia/compiled/v1.10 in bytes:

  • PR: 119218646
  • preceding commit: 126898230

for a savings of about 6%.

This reduces the size of our precompile cache files, using
run-length encoding (RLE) to represent the module of external
linkages. Most linkages seem to be against the sysimg itself,
and RLE allows long stretches of such linkages to be encoded
compactly.

Closes #48218
@timholy timholy added backport 1.9 Change should be backported to release-1.9 pkgimage labels Jan 14, 2023
@vtjnash
Copy link
Sponsor Member

vtjnash commented Jan 14, 2023

Since most are 0 currently, should we make that a special RefTags of its own, so this structure is also very rarely needed at all? We could give it a sized reftag structure of:

small_image_reloc64 {
    tag:3 (= ExternalLinkageRef)
    id:21 (= reloc_id for this)
    offset:40 (= offset for this in id)
}
small_image_reloc32 {
    tag:3 (= ExternalLinkageRef)
    id:0 (= reloc_id 0 only)
    offset:29 (= offset for this in 0)
}

So that on 64 bit, we use some of the spare bits to encode ids up to 2 million packages and sizes up to 1TB each, without needing the side table

And on 32 bit, we only have enough spare bits to do this meaningfully for image 0, so we still need to support the side table, but hopefully with much less content

@timholy
Copy link
Sponsor Member Author

timholy commented Jan 14, 2023

Just to check that I understand, you're proposing to encode the module/buildid and offset in the same 64-bit field, right? Do you have a specific proposal for how we represent the buildid in 29 bits? With respect to a constant list of dependency-buildids, perhaps? (Essentially the same role of pkg_build_ids in this PR.)

And also to check, you're proposing to add this as a second type of external linkage to RefTags, right? Maybe call it SysImageLinkage?

enum RefTags {
    DataRef,            // mutable data
    ConstDataRef,       // constant data (e.g., layouts)
    TagRef,             // items serialized via their tags
    SymbolRef,          // symbols
    FunctionRef,        // generic functions
    BuiltinFunctionRef, // builtin functions
    SysImageLinkage,    // pkgimage reference to the sysimage
    ExternalLinkage     // items defined externally (used when serializing packages)
};

@vtjnash
Copy link
Sponsor Member

vtjnash commented Jan 15, 2023

An array of modules is already an argument to the deserializer, so it could be an index into that array. I gave it 21 bits in the encoding above for the index.

@timholy
Copy link
Sponsor Member Author

timholy commented Jan 15, 2023

OK, good. And the second part? A new tag, right?

@vtjnash
Copy link
Sponsor Member

vtjnash commented Jan 15, 2023

Yep

@timholy
Copy link
Sponsor Member Author

timholy commented Jan 18, 2023

I have a mostly-working implementation of your proposal in the teh/compress2 branch. However, I've started to have second thoughts: aside from the fact that the approach is a bit more intrusive (which might be a good or bad thing, not sure), my major concern is that we give up something big to get something small:

  • the small thing we gain is an at-most 1% further decrease in file size, and for the majority of packages it's much smaller than that (see figure below)
  • the big thing we give up is our last free RefTag, and on 32-bit I don't see a good way to expand the RefTags without making the maximum file size a factor of 2 smaller. I think a much better use for that RefTag would be ForeignObject, aka for GAP.jl (cc @fingolfin).

So I'm back to thinking this is the better approach. Thoughts?

The figure below displays the total size of the link_ids as a fraction of total file size, both without (horizontal axis) and with (vertical axis) RLE compression. You can see RLE compression gains a worst-case 10x compression and the worst package analyzed (Makie.jl's dependencies), ImageIO, the compressed link_ids are only 1% of the total file size.

compression

@vtjnash
Copy link
Sponsor Member

vtjnash commented Jan 18, 2023

We can recover more reftags whenever we want, by making them slightly more complicated. Only 4 of them are fixed now, the other 4 could be combined into 1

@timholy
Copy link
Sponsor Member Author

timholy commented Jan 18, 2023

OK. What do you think about the approach in teh/compress2: b43d3e0
There are a lot of places where I have to do quite different things on 64-bit and 32-bit, so the code is quite a lot uglier. It all just started to seem like the disadvantages were not worth the small savings in compression. But I can keep working on it and see if I can come up with a less ugly split.

@KristofferC KristofferC removed the backport 1.9 Change should be backported to release-1.9 label Feb 20, 2023
@giordano giordano deleted the teh/compress branch February 25, 2024 21:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

pkgimages: storage of link_ids should be compressed
3 participants