[Solved] Defining Key Expressions #24

p-avital · 2022-05-17T10:09:30Z

p-avital
May 17, 2022

Key Expressions (KEs) are the representation of Zenoh's address space, but are currently code-defined rather than abstractly defined. This creates confusion (including among the team) as to its expected and true behavior. This discussion's goal is to reach a new definition of KEs that we may then adapt code to fit, while addressing some of the issues that current KEs raise.

Definitions

To be exact, Keys are Zenoh's address space.

Key Expressions is a language that allows to address sets of keys in a compact form.

Any Key is a valid KE that matches only itself (this point is debatable under current implementation, which is one of the issues this discussion seeks to address).

Almost all of Zenoh's directives may be applied to sets of keys equivalently to single keys, so Zenoh's API only works in terms of KEs.

Needs

Looking back at our experience with KEs, here are some of the goals we have set for the new definition of KEs. To simplify possible examples, cke("...") and ke("...") denote the construction of a KeyExpr as currently defined and as proposed respectively, while cke"..." and ke"..." denote the resulting values of these constructions.

Stringifiability

Currently, KeyExpr is shown to the user as a (u64, String) tuple, where the u64 is a Session-bound identifier for a prefix, with 0 being always bound to "". This presentation is confusing, may lead to issues for users that may spawn several Sessions, and not generally useful considering KeyExpr that reach user-scope are always resolved by Zenoh so that the prefix is always 0.

Our new definition of KeyExpr should make key_expr.as_str() an infallible and context-independent operation.

Unicity

With their current definition, KEs present a subtle trap for users: there exist many strings that are different, but address the exact same set of keys: cke"a" == cke"/a", or cke"/*/**" == cke"/**/*" are some examples of this. This may lead user-code to misbehave if this isn't accounted for. This is compounded by the fact that we do not offer users a way to check for KE-equivalence.

We propose to solve this issue by restricting the KE language by new syntactic rules that ensure that for any set of keys, at most a single KE exists to express it. For examples, under these rules, ke"/a" would not be a possible value, with ke("/a") either producing the ke"a" value instead, or yielding an error.

Clearly defined syntactic rules

Currently, KeyExpr doesn't enforce any of the "rules" that we have set for excluded characters (such as ?), making these more alike to "suggestions". This may lead to odd behaviors as these characters are typically excluded because the have or may take roles in certain operations: ? acts as the separator between KE and arguments in selector-strings.

The new definition for KEs should address this by defining a clear set of characters that are forbidden.

Opacity

Currently, KeyExpr expose their whole structure to users, despite the fact that modifying this structure currently has no point but creating bugs.

The new KeyExpr type should be as opaque as possible to allow greater flexibility.

Ease of use

Helper functions exist to let the user check KEs for intersection or inclusion, but they are hidden away as raw functions in obscure paths of the zenoh crate.

The new definition of KeyExpr should support common operations such as equality, intersection, comparison and concatenation in a well defined way through methods.

p-avital · 2022-05-17T10:49:34Z

p-avital
May 17, 2022
Author

Proposal 1:

Under this proposal, KeyExpr becomes an opaque type that always has a reference to a string that has been checked for syntactic validity. KeyExpr may implement Deref<Target=str> and AsRef<str> in Rust for convenience.

Validity

The criterion for syntactic validity is BOTH the absence of illegal patterns AND the canonical form of the expression.

#? is the set of forbidden characters. ** AND * may only be preceded and followed by / to reduce matching complexity. // is an illegal pattern. Leading and trailing /s are forbidden. The empty KE is illegal. These last 3 rules combined make *'s behavior entirely consistent without the handwavy stuff we often have to do when explaining it. $ is reserved for DSLs, as explained further in the proposal.

A key belongs to the set defined by a KE iff by replacing all of the KE's ** by any arbitrary sequence of characters, and by replacing all * by any sequence that doesn't contain /, it is possible to reconstruct the key. * is thus equivalent to the [^/]* regex, and ** is equivalent to the .* regex.

Two KEs intersect if there exists at least one key that belongs to both sets. KE A contains KE B if every key belonging to B also belongs to A.

A string that doesn't contain illegal patterns may be canonized by steps equivalent to this python code:

# Note, this is a very bad algorithm, do not canonize this way!
# This is just a concise way to express canonization.
def canonize(x: str):
  pastX = ""
  while pastX != x:
    pastX = x
    x = x.replace("**/**", "**") # note that python's replace method doesn't alter its arguments
    if pastX == x:
      x = x.replace("**/*", "*/**")
  return x

A string x is considered canon iff x == canonize(x).

It is proposed that the default KeyExpr::from_str constructor returns an error if the string contains an illegal pattern, and performs canonization on behalf of the user, as much better algorithms than the reference one exist to do so and the cost of detecting non-canon forms VS correcting them are similar.

Other constructors may be provided to return errors for non-canon forms, and to unsafely construct KeyExpr without checks, placing the burden of ensuring validity onto the user.

From a network standpoint, routers should check incoming KEs for validity before performing any operation, and drop messages with invalid KEs that may have sneaked into the network.

Concatenation and path joining

Concatenation and path joining should ensure that the KeyExpr invariants are maintained by checking for forbidden patterns and canonizing once done.

Concatenation is defined as simple string concatenation. Where operator overloading is available, it should be performed by the + and += operators.

Path Joining is defined as joining two KEs by a /. Where operator overloading is available, it should be performed by the / and /= operators.

To avoid unexpected behaviors, concatenating a string that starts with * onto a KeyExpr that ends with * yields an error. Should a user REALLY want to do so, they may use KeyExpr::from(format!("{}{}", previous_ke, string_starting_by_start)).

Concatenating a string onto a KeyExpr should check for the * on either side, but be otherwise identical to the code above.

Concatenating a KeyExpr onto a KeyExpr should perform a path-join-like operation, since KEs cannot start or end with a /, and standard concatenation would be available by using as_str on the second KE.

Extending the KE language through DSLs

To allow the KE language to grow and become more useful without causing breaking changes, Domain Specific Languages may be added to the KE language through updates. DSLs will be inserted into KEs via the $<DSL_ID><DSL_CONTENT> syntax.

DSLs will be presented to the user as a way to address more specific sets of keys than what the base KE language is able to, at the cost of performance. A $ in a KE will hence act as an easy marker for "this KE is slow".

The first DSL to be introduced will have the ID * and has 0-sized content: $* inside any KE chunk will act just like *, by matching 0 or more characters that aren't slashes. a/b* will now be written a/b$*. Measurements have shown that matching KEs following the a/b* pattern takes 30% to 50% more time than matching KEs following the a/* pattern, so marking this pattern as slow using $* instead of * both lets users know they're using something slow, and hopefully "annoys them" into thinking their key-space through better.

Note that DSLs will still have to be designed to maintain unicity: for the case of $*, it is forbidden to use it as alone as a chunk, since * would match the same set of keys: a/$* canonizes to a/*.

In the future, we hope to provide a new DSL which will offer a subset of regex's functions, but with many questions regarding unicity and canonization, as well as efficient intersection (especially when both KEs contain the regex-ike DSL), this DSL will probably take a long time to develop.

4 replies

Mallets May 18, 2022
Collaborator

It would be great if you could give more insight behind the motivation of:

#[]{}$? is the set of illegal characters. ** may only be preceded and followed by / to reduce matching complexity. // is an illegal pattern. Leading and trailing /s are forbidden. The empty KE is illegal. These last 3 rules combined make *'s behavior entirely consistent without the handwavy stuff we often have to do when explaining it.

In addition to reduce matching complexity (which should speed up also the routing), what's the rationale for prohibiting leading and trailing /? Could you elaborate more?

p-avital May 19, 2022
Author

Since the dawn of time (ok, maybe not that long), Zenoh's ignored trailing slashes. This is mostly because NOT ignoring them opens a can of worms:

How does cke"a/" differ from cke"a"? / being used as a form of hierarchical separator, does it imply that empty chunks are valid chunks? Then why wouldn't * match the empty chunk? The commonly accepted explanation for *'s behavior is that it corresponds to [^/]*, but empty chunks are not allowed to exist.
KEs seek to be URL-compatible, trailing /s may interact very poorly with a hypothetical requirement to treat them as significant.

Leading /s are a similar story, they've been largely ignored because we've always encouraged users to always use them. This is largely due to the file-system analogy, as well as the now-removed existence of workspaces which carried a concept similar to a current working directory, giving sense to the leading / as a qualifier for absolute paths.

With the removal of workspaces, leading / have lost their significance.
Leading / again interacts poorly with the URL-compatibility of KEs.
Currently accepted solution is to always specify a leading /. However, this implies carrying non-significant around for the sole reason of convention, which I'm against. As Syndrome said, when every one is Super, no one is.

In general, the forbidding of leading and trailing /s, combined with the forbidding of //, implies that KEs are treated as a straight-forward /-separated list of chunks, where empty chunks are forbidden. The alternative of allowing empty chunks may cause incompatibilities with URLs, while only providing new and improved ways for users to run into errors.

Note that due to unicity, ke"/a" MUST be semantically different from ke"a", so ignoring empty chunks and/or leading/trailing /s is not an option.

Finally, since we don't currently have better use for them than "let's ignore them", we might as well forbid them, if only to prevent adding meaning to them from breaking user code in the future.

p-avital May 24, 2022
Author

Note: this proposal's take on concatenation was edited after suggestions by @JEnoch

p-avital May 24, 2022
Author

Note: following a discussion with @OlivierHecart and @JEnoch, []{} have been removed from the set of forbidden characters.

The intent to keep them as delimiters for future Domain Specific Languages (DSL) can be fulfilled by the current team consensus that DSLs would be marked by $<DSLid>{<DSLcontent>}.

kydos · 2022-06-03T08:37:31Z

kydos
Jun 3, 2022
Maintainer

I generally happy with the proposal, but I have two comments:

I still prefer key-expressions to start with "/" as this is inline with what file-systems do as well as the path on URL/URI
I'd like to discuss k-expr registration and multiple sessions.

2 replies

p-avital Jun 6, 2022
Author

This was re-discussed IRL. Just for others to know about the resolution: leading slashes will indeed be forbidden, as keeping them would mean enforcing that they always be there (otherwise, streq-seteq breaks), which is just overhead for the sake of aesthetics.

As for the point on registration (declare_keyexpr) and multiple session: it is indeed desirable to ensure that a KeyExpr, even if associated to a session because declared for that specific session, be compatible with other sessions.

The good news on that front is that since this new spec requires that KeyExpr always be deref-able into its full expression str, as long as sessions can inspect KeyExpr's internals and find out they aren't concerned by the other session's optimization, they can indeed use the KeyExpr like they would use any that hasn't been declared from their point of view.

Note that this multi-session compatibility of declared KeyExpr hasn't been implemented yet, but we'll probably add this before 0.6.0, that way KeyExpr will always be usable every where, which will completely eliminate that bug-category.

p-avital Jun 6, 2022
Author

Again on multi-session compatibility: since peace of mind was a main objective of the KeyExpr refactor, I went ahead and implemented it :)

Note that current implementation is u16-id based, to keep KeyExpr's size from increasing on stack, at the cost of limiting the user to 2^16 sessions within a single program. I deeply apologize to any user who wants to open a 65537 sessions or more

cguimaraes · 2022-06-17T06:49:52Z

cguimaraes
Jun 17, 2022
Collaborator

@p-avital / @Mallets which from the following are not valid KE?

ke("a/*b*") -> second segment has at least a b?
ke("a/**b**") -> should it be equivalent to ke("a/**/*b*/**) ?
ke("a/**b") -> should it be equivalent to ke("a/**/*b) or ke("a/**/b) ?
ke("a/b**") -> should it be equivalent to ke("a/b*/**") or ke("a/b/**") ?

18 replies

p-avital Jun 17, 2022
Author

Just throwing a possible (quick) solution out there:

** may only be surrounded by /, still
transition * to only be allowed to be surrounded by /
introduce $* as our first DSL, to express the meaning of our current *.

That way, a keyexpr is quickly sorted by:

contains $ => will need chunk-wise evaluations or to apply DSLs
contains * but not $ => has wild components, but we never need to look inside * chunks
contains neither => streq is enough for matching.

cguimaraes Jun 20, 2022
Collaborator

I was not expecting it to raise such a discussion.
Anyway, here are my two cents:

I see more benefits when * or ** is only allowed to be surrounded by slashes.

Taking the example from @JEnoch, it allows to shorten ke(/a/b/something/*) into ke(1, *), which otherwise would need to be ke(1, something-*).
Also, as a user, I tend to use Zenoh or to organise resources in a tree, in a similar way to what @gabrik just described. I think it is more intuitive and less error-prone, since you have a clear view of your information and data model. If the best practices that @Mallets just shared is something that users are already aware of, it would just help our case to motivate such choice.

As additional detail as to why ke"a/b" &co are forbidden, it would also break unicity, as whether we make it behave like ke"a//b" or like the more senseful ke"a/**/*b", it's still a different string that addresses the same set of keys, which is forbidden by the unicity requirement

Also, if ke(“a/**b”) breaks the unicity, it might hinder any security model to be defined in the future.

If we want to support this kind of matching we should go all-in and provide a RegExp-like thing, like a/b$[0-9]$ that will be way easier to understand by users and way less inconsistent (this could be done in the future and will be costly for the routing, but hey no magic is possible).

Exactly. We might provide more fancy ways to handle KEs, that are more regex-oriented but the users must be aware that it will have an impact on the performance. Still, the baseline mechanisms still provides higher performances, and the user can make use of the full power of Zenoh if he translate his information and a data model.

Or robot/sensor-$[.+]$ that will match all of the above, and all the other sensors.

I like the this, and although I agree with @JEnoch that handling $[..]$ is tedious and error-prone, I truly believe that the problematic part is the regex itself and not how we describe it in the KEs.

Summarising:

I think ke("a/*/b") is a valid KE, while ke("a*/b") is not. 
To provide in the future regex-oriented KEs that will allow a higher degree of expressiveness, at the cost of performance.
I do not agree in introducing already $* as our first DSL for this release.

p-avital Jun 20, 2022
Author

@cguimaraes The reason I proposed $* as a first DSL (appart from the fun of creating a DSL where the only valid expression is an empty one) is so that we don't remove the current *'s behavior from our feature-set, as some users may get angry at that. I don't think we're ready to tackle even a subset of regex just yet :)

cguimaraes Jun 20, 2022
Collaborator

As long as we define a pattern for the regex within the KEs that is future friendly, then I agree to have it so that users are not losing functionalities.
In that sense, I am more fond of what @gabrik proposed: $[*] than just $*

p-avital Jun 20, 2022
Author

The reason for my $* proposal is that the original intent for DSLs was to work as $<DSL_ID><DSL_CONTENT>, where the DSL would be responsible for selecting its range. Rather than extending a single DSL's capabilities, the strategy would be to introduce DSLs as we become able to support them.
I'd rather keep $[<DSL_CONTENT>] for a more interesting DSL such as character ranges or other regex likes.

Keep in mind that by "supporting a DSL", I mean supporting it to the standards that KE sets, with unicity still being a high priority IMO, so when introducing a DSL, we must be sure that we can unify it across DSLs: ke("a/$*/b") would have to canonize to ke"a/*/b", as would an eventual ke("a/$re{.*}/b"), ke("a/$re{.+}/b"), ke("a/$re{..*}/b") etc... Introducing DSLs will be a tall ordeal if we want unicity, which is a valuable property to simplify routing and user code, so I'd rather introduce the smallest DSL possible for now, and to work by DSL addition rather than DSL extension.

$* is simply the shortest available string that interoperates well with the current spec and makes sense for "match 0+ characters within this chunk".

sreeja · 2022-06-21T15:39:04Z

sreeja
Jun 21, 2022

In order to bring clarity to key_expr, I am attempting a context-free grammar style definition with help from @p-avital. The following is the CFG for how it stands now:

# Note: this is a first draft to be built-on if useful.

# Terminals
SLASH -> "/"
STAR -> "*"
LITERAL -> [^#$?*/]+

# Non terminals
key_expr -> chunks 
	| double_wild + SLASH + chunks
	| double_wild 
	| multiple_single_wild + SLASH + double_wild + SLASH + chunks
	| multiple_single_wild + SLASH + double_wild
	| multiple_single_wild + SLASH + chunks
	| multiple_single_wild 
chunks -> valid_chunk + SLASH + double_wild + SLASH + chunks
	| valid_chunk + SLASH + double_wild
	| valid_chunk + SLASH + multiple_single_wild + SLASH + chunks
	| valid_chunk + SLASH + multiple_single_wild
        | valid_chunk
valid_chunk -> free_chunk + SLASH + valid_chunk
	| free_chunk 
free_chunk -> literal_chunk  
	| pattern_chunk
multiple_single_wild -> single_wild + SLASH + multiple_single_wild
	| single_wild 

literal_chunk -> LITERAL 

double_wild -> STAR + STAR

single_wild -> STAR

pattern_chunk -> multi_pattern_prefix + LITERAL
	| multi_pattern_prefix 
	| multi_pattern_suffix + STAR
        | multi_pattern_suffix
multi_pattern_prefix -> pattern_prefix+
multi_pattern_suffix -> pattern_suffix+
pattern_prefix -> LITERAL + STAR
pattern_suffix -> STAR + LITERAL

We can extend this if we decide to include DSL or regex.

0 replies

p-avital · 2022-06-27T16:02:00Z

p-avital
Jun 27, 2022
Author

After considering the discussions related to the comments under @cguimaraes 's remarks, it has been decided that * may now only be surrounded by /. The reasoning for this is that it allows KE matching to be much faster (about 30%, and KE matching is a big part of what a zenoh router does).

We're not dropping support for the a/b* pattern completely: instead, $* will be introduced as zenoh's first DSL, where exactly one program can be defined: match 0 or more characters that aren't slashes.

This introduces the concept of Domain Specific Languages as the future of extending the KE language, which is planned to follow a $<DSL_ID><DSL_CONTENT> syntax. The goal of such DSLs will be to give users the ability to address more restricted key sets than what the core KE language offers, while increasing awareness of the computational cost of doing such things. We will then be able to tell users that a KE that contains a $ is a slow KE, and that an address space that forces the user to use them is likely ill-designed.

Note that we plan on introducing subsets of regex's feature set as a DSL, but since unicity is a great propriety for KEs to have, designing such a language is difficult and will take some time.

0 replies

Mallets · 2022-09-12T09:21:41Z

Mallets
Sep 12, 2022
Collaborator

RFC available here: https://github.com/eclipse-zenoh/roadmap/blob/main/rfcs/ALL/Key%20Expressions.md

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Solved] Defining Key Expressions #24

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 6 comments 24 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

[Solved] Defining Key Expressions #24

p-avital May 17, 2022

Definitions

Needs

Stringifiability

Unicity

Clearly defined syntactic rules

Opacity

Ease of use

Replies: 6 comments · 24 replies

p-avital May 17, 2022 Author

Proposal 1:

Validity

Concatenation and path joining

Extending the KE language through DSLs

Mallets May 18, 2022 Collaborator

p-avital May 19, 2022 Author

p-avital May 24, 2022 Author

p-avital May 24, 2022 Author

kydos Jun 3, 2022 Maintainer

p-avital Jun 6, 2022 Author

p-avital Jun 6, 2022 Author

cguimaraes Jun 17, 2022 Collaborator

p-avital Jun 17, 2022 Author

cguimaraes Jun 20, 2022 Collaborator

p-avital Jun 20, 2022 Author

cguimaraes Jun 20, 2022 Collaborator

p-avital Jun 20, 2022 Author

sreeja Jun 21, 2022

p-avital Jun 27, 2022 Author

Mallets Sep 12, 2022 Collaborator

p-avital
May 17, 2022

Replies: 6 comments 24 replies

p-avital
May 17, 2022
Author

Mallets May 18, 2022
Collaborator

p-avital May 19, 2022
Author

p-avital May 24, 2022
Author

p-avital May 24, 2022
Author

kydos
Jun 3, 2022
Maintainer

p-avital Jun 6, 2022
Author

p-avital Jun 6, 2022
Author

cguimaraes
Jun 17, 2022
Collaborator

p-avital Jun 17, 2022
Author

cguimaraes Jun 20, 2022
Collaborator

p-avital Jun 20, 2022
Author

cguimaraes Jun 20, 2022
Collaborator

p-avital Jun 20, 2022
Author

sreeja
Jun 21, 2022

p-avital
Jun 27, 2022
Author

Mallets
Sep 12, 2022
Collaborator