Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support ExtraData de-duplication in MySQL backend #1968

Closed
rolandshoemaker opened this issue Nov 12, 2019 · 4 comments
Closed

Support ExtraData de-duplication in MySQL backend #1968

rolandshoemaker opened this issue Nov 12, 2019 · 4 comments

Comments

@rolandshoemaker
Copy link
Contributor

(This is a rather complicated proposition, so it's more of a speculative request, rather than a concrete proposal.)

The CT personality builds leaves for inclusion that include in the ExtraData field the user submitted chain, minus the end-entity certificate. This data has relatively low cardinality and takes up a considerable amount of space in the storage backend.

It would be great if there was an option to store the ExtraData in a separate table (or something) from LeafData so that they could be deduped (based on data hash or something) with a reference to the relevant row in LeafData in a many-to-one setup. In certain setups this could save >50% of current storage requirements. It seems likely that this optimization is only really relevant to the CT usage of trillian, so I'm not entirely sure if there are issues this could cause for other personalities. This would, likely, incur slightly more expensive MySQL queries.

@Martin2112
Copy link
Contributor

It's an interesting request, though as you say possibly more a CT thing than generic.

We have considered options like not storing leaf data in the database. As it's immutable it could be served from edge caches or whatever. That was more for performance than saving disk space though. e.g. you could pack up ranges of leaves and serve them much faster than mysql can.

Not sure we'd want to make this sort of schema change now but It's possible that experiments along these lines could be done with a modified CT personality that stores some sort of cache ID instead of the leaf data. That might be a place to start.

@pphaneuf
Copy link
Contributor

From a purely Trillian point of view, ExtraData should probably not be in there at all, as it has absolutely nothing to do with the Merkle tree.

Having the data together makes it possible (even though it's not the case at the moment) to make get-entries (which makes up about 75% of read requests on our CT logs, the rest being mostly get-sth, and trace amounts of get-sth-consistency) have a much lower latency, by doing only one sequential read, instead of 1 sequential read + N lookups (where N is the number of entries fetched). This would probably be fewer lookups if they were deduped, though.

Note that if Trillian didn't use revisions for subtrees of logs (which I believe it doesn't need, only maps need them?), you might be looking at both a speedup of proof retrieval and a very sizable reduction in storage requirements (I don't have numbers handy, but if someone told me about 50%, I'd believe it). It would also probably speed up sequencing.

@pav-kv
Copy link
Contributor

pav-kv commented May 14, 2020

This discussion has migrated to google/certificate-transparency-go#691, as it was CT-specific. Generalisations are possible, but we tend to think that they should happen on the personality side.

I propose to close this issue here.

@pav-kv pav-kv closed this as completed May 14, 2020
@pav-kv
Copy link
Contributor

pav-kv commented May 14, 2020

A good follow-up request from this one would be optional interleaving ExtraData field into the sequenced leaf data. This would help making get entries calls faster for those kinds of use-cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants