Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add custom kvp metadata to chunk dictionary #51

Merged
merged 5 commits into from
Nov 2, 2024

Conversation

caesay
Copy link
Contributor

@caesay caesay commented Oct 30, 2024

This PR contains the following changes:

  • few extra gitignores for vscode / rustrover
  • metadata field added to ChunkDictionary, and surfaced through Archive<R>.metadata()
  • compress command gets new arguments: --metadata-file <key> <path> and --metadata-value <key> <value>
  • info command (no arguments) displays metadata keys and content length
  • info command gets new --metadata-key <key> argument which writes the kvp content to stdout (and prints nothing else)
  • refactor info command to use same InputArchive logic as clone command (rationale: info was incorrectly parsing a local Windows file path as a URL and I noticed this was solved already for clone. additionally, if the http server requires an authorization header it's not possible to use without the --http-header argument).

Possible improvements?

  • not sure how I feel about separate --metadata-file <key> <path> and --metadata-value <key> <value> arguments.
  • the metadata dictionary is not deterministic (ideally we should sort either by CLI order, or alphabetically. the latter would be easier)

@oll3
Copy link
Owner

oll3 commented Oct 31, 2024

Great improvements, and good fix to the info command too! Will go through it in more detail later but here are some initial notes.

the metadata dictionary is not deterministic (ideally we should sort either by CLI order, or alphabetically. the latter would be easier).

Yes, I do think the map of kvp should have deterministic order, preferably ordered as when inserted. And since the protobuf map is really just a repeated field of kvp's the order in the dictionary should be stable so it's just a question of how to map it on the Rust side.
Think you can configure prost to map map fields into a BTreeMap, or possibly just a plain Vec of pairs. Vec is probably sufficient until someone adds hundreds/thousands of pairs.

not sure how I feel about separate --metadata-file and --metadata-value arguments.

Personally I would find it to be enough with compress --metadata-value and one could provide files or values using the shell. But I'm ok with having a separate argument for reading directly from a file too.

I should probably put this in a contribution description somewhere... But if/when you do modifications to a PR please just modify each commit which is affected. Like switching to a BTreeMap would belong in the Add metadata commit etc.

@caesay
Copy link
Contributor Author

caesay commented Oct 31, 2024

Regarding --metadata-file <key> <path> and --metadata-value <key> <value>, out of the two, the more important one is --metadata-file, because that is the only way you can reliably provide arbitrary binary data. I doubt that you'll be able to pass a random binary file as an inline command line argument. (certainly not going to happen on Windows)

I tried switching the HashMap to a BTreeMap as suggested:

       prost_build::Config::new()
            .btree_map(&["."])
            .compile_protos(&["proto/chunk_dictionary.proto"], &["proto/"])
            .unwrap();

It kind of worked, in the sense that all of our code after clap did preserve the order in which we initially constructed the first BTreeMap. However, clap itself seems to have no consistency in the order in which it gives the command line arguments. Additionally, even if clap gave us the arguments in the order in which they were specified (it doesn't) I don't know how we'll preserve the order between the two different arguments, because clap doesn't give us the "position" of the argument. Hope that makes sense?

@oll3
Copy link
Owner

oll3 commented Oct 31, 2024

Regarding --metadata-file <key> <path> and --metadata-value <key> <value>, out of the two, the more important one is --metadata-file, because that is the only way you can reliably provide arbitrary binary data. I doubt that you'll be able to pass a random binary file as an inline command line argument. (certainly not going to happen on Windows)

I can see that it may provide some value so I'm ok with providing both.

It kind of worked, in the sense that all of our code after clap did preserve the order in which we initially constructed the first BTreeMap. However, clap itself seems to have no consistency in the order in which it gives the command line arguments.

I see. I would've expected clap to give us the values in order of occurrence but maybe it doesn't. I didn't really find anything in the docs mentioning the order of multiple values and haven't tested for myself. About the order between -file and -value I can see that the order between them will be lost, as you did mention.

But then, for now at least, I think it's enough if the bitar api for building and reading/iterating the kvps keeps the build time order. E.g one should be able to iterate the kvps in the same order as they were provided when building the archive.

Cli argument order could potentially be fixed later. For now, maybe just order the pairs on key name on the cli side?

Other than that I think I would prefer a regular Vec (like Vec<(String, Vec<u8>)>) to hold the kvps after decoding the dictionary. Mainly since I reads well that there is a order, it's performant and simple compared to a BTreeMap/HashMap. If a user has lots of values to look up one can collect them into a HashMap outside of the library. So basically I think the metadata fn on the dictionary could just be something like:

  fn metadata(&self) -> &[(&str, &[u8])]

And the simplest way to achieve this is probably to make a repeated field in protobuf. Since a protobuf map is really stored the same way it's just a question of how prost will decode it.

message ChunkDictionary {
  ...
  repeated KeyValuePair metadata = 8;
}

message KeyValuePair {
  string key = 1;
  bytes value = 2;
}

``
Does this make any sense to you?

@caesay
Copy link
Contributor Author

caesay commented Nov 1, 2024

I think that sorting the keys alphabetically would be the best way to make sure they are deterministic. I can make that change.

Other than that I think I would prefer a regular Vec (like Vec<(String, Vec)>) to hold the kvps after decoding the dictionary.

I don't think performance is really a concern here (it's going to be fast enough either way) but switching to a Vec of key value pairs is going to be slower and will make consuming code more complex. We need at least one extra mapping in the library (to map to/from the protobuf types) and all the use cases require a hashmap anyways, so to support the info command we're going to need to collect the Vec into a hashmap anyway. I'll need to do the same in my own usecases: there's no value in having a Vec when you need to retrieve a value by key.

@oll3
Copy link
Owner

oll3 commented Nov 1, 2024

I'll need to do the same in my own usecases: there's no value in having a Vec when you need to retrieve a value by key.

Fair enough. So would storing in a BTreeMap and have an api something like below be sufficient you think?

    fn metadata_iter(&self) -> impl Iterator<Item=(&str, &[u8])>
    fn metadata_value(&self, key: &str) -> Option<&[u8]>

@caesay
Copy link
Contributor Author

caesay commented Nov 1, 2024

I have made the discussed changes, please let me know if there's anything further!

@caesay caesay force-pushed the cs/custom-metadata branch 2 times, most recently from 49dd712 to c0b3279 Compare November 1, 2024 22:03
@oll3
Copy link
Owner

oll3 commented Nov 2, 2024

Think it looks good!

Was thinking that the metadata length value in the info command should use a unit (eg use human_size!) but realized it looked a bit busy when unit it's repeated for every kvp when we have multiple kvps, so I think this is fine.
Also a metadata_len might have been good to check number of kvps easily. But it's nothing that can't be added when needed.

Thank you, will merge!

@oll3 oll3 force-pushed the cs/custom-metadata branch from c0b3279 to 5c216c0 Compare November 2, 2024 22:31
@oll3 oll3 merged commit 6b84707 into oll3:main Nov 2, 2024
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants