Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for Static Strings #435

Closed
myoussefmir opened this issue Nov 12, 2020 · 9 comments
Closed

Support for Static Strings #435

myoussefmir opened this issue Nov 12, 2020 · 9 comments
Labels
feature A new feature.

Comments

@myoussefmir
Copy link

myoussefmir commented Nov 12, 2020

🚀 Feature

Support for Static Strings

Motivation

Usage of strings is common in web programming and is table stakes

@acoglio
Copy link
Collaborator

acoglio commented Apr 27, 2021

Some initial design questions [with comments in brackets] about this feature for DP3:

  • Will these support Unicode? [Probably, since currently console format strings and comments do.]
  • Will these subsume console format strings? [Probably, since we wouldn't want to have two different but similar notions.]
  • Does 'static' refer to just supporting string literals, or are we planning to have a full-fledged string type? [I'm assuming the latter.]
  • Will these support some form of Unicode escapes, and if so which form exactly? [Seems a good idea.]
  • Will these support shorter escapes for common characters (e.g. \n or \")? [Also seems a good idea.]
  • Will we also support a type of (Unicode) characters? [Seems natural if we have strings; the alternative is to use non-negative numbers (code points).]
  • If we also want a type for characters, do we also want character literals, besides string literals, with similar escapes? [Seems natural.]
  • Will we provide operations on strings via (built-in) non-member functions like string_length or via member functions like String::length (where String is a built-in circuit name that collects static member functions)? [A similar issue may apply to other built-in operations that manifest as functions, such as the operations on bits and bytes in [Proposal] bits/bytes syntax #682; we should pick one approach for all of these.]
  • Does compilation to R1CS impose statically knowing the size of all strings? [I think it does.]
  • Since conceptually a string is a sequence of characters, should strings be just arrays of characters with some special syntax? [Note that if we have a type of characters we would have arrays of characters anyhow.]

@bendyarm
Copy link
Collaborator

I think we want to avoid bringing the Unicode tables into the R1CS. What are the operations that are most needed?  And then what is the representation that will make that work efficiently?

For example, if a common operation is substringp(sub, super), where sub is a public input and super is private, what would that look like in the gadget if we used utf-8 vs utf-32?  It might be feasible to convert utf-8 into utf-32 outside the gadget and the code in the gadget could use field elements to represent the code points and the algorithm in the gadget might be simpler than if it always had to look up how many bytes to read to get to the next character.

@acoglio
Copy link
Collaborator

acoglio commented Apr 28, 2021

I had a look at string support in Solidity.

In short: There are string literals, but they really are just another way to denote byte arrays of various kinds, and there seem to be no operations on strings as such.

A bit more detail:

There is a string type for dynamically sized UTF-8 byte arrays (https://docs.soliditylang.org/en/v0.8.4/types.html#dynamically-sized-byte-array), but without associated operations, not even length or indexing on this type (https://docs.soliditylang.org/en/v0.8.4/types.html#bytes-and-string-as-arrays). The documentation mentions ABI access and 3rd-party string libraries, but these are really operations on byte arrays.

Besides the string type, string literals may have (i.e. denote values of) types byteN (if they have length N) or bytes (https://docs.soliditylang.org/en/v0.8.4/types.html#string-literals-and-types) -- where bytes is a packed version of the dynamically sized array type byte[]. So they are all byte arrays, and string literals just provide another way to denote them.

There are two kinds of string literals: one without prefix that denotes ASCII, and one prefixed by unicode that denotes Unicode UTF-8.

@bendyarm
Copy link
Collaborator

Here are some more string ideas. No attempt to be complete, but some things to think about.

  • Some algorithms on strings

    • string search (substring predicate)
    • regex search and pattern match
    • edit distance
    • control (string is used to select an option in the program)
  • Creating strings

    • messages
    • results of computations
    • matched patterns
  • Typical operations

    • literal strings. Leo literal strings could have escapes, as Coglio mentions.
      We probably don't need to expose "parse literal string" as an operation inside the circuit.
    • comparisons
    • append
    • format (fill in format indicators like {} in a formatted string)
  • Operations not needed

    • arithmetic operations (although maybe upcase/downcase could use it)
  • Probably not needed

    • exposure of Unicode details. E.g., conversion from utf-16 to Leo's representation.
      We can probably handle this sort of data conversion in input/output data adapters.
      If this sort of thing is needed, as another example, in order
      to check if a Unicode Japanese string is in a legacy JIS X 0208 file,
      that conversion would need to be coded in the Leo program.

Motivating example of string search.

Let's say that you log transactions to a confidential database, and you publish the hashes of the logs
to a blockchain (or other secure location).
Later, it is found that such databases have a vulnerability that, when exploited, result in a certain
message in the log. How can you prove to a regulator that the message is not in your log?

The Leo program takes a public hash of the log, a public string (or regex), and a private copy of the log,
and returns one of four possible values: (1) the hash matches and the string is in the log;
(2) the hash matches and the string is not in the log; (3) the hash does not match the hash of the log;
(4) some other problem made it impossible to answer. (The last value may not be needed.)

You would run the program, and prove to the verifier that the program got answer 2: the string
is not in the log. You could also run it with a string that, if the database is logging properly,
you would expect to see in the log, and prove that it is in the log.

How big are these logs, and how big would the resulting circuit be? This could be challenging.

More observations.

If a string is a byte array in UTF-8, then the meaning of a byte is context-dependent.
It would be interesting to see how the string operation gadgets would differ based on
using UTF-8 versus UTF-32. In either case, the array elements could be field elements
for simplicity.

Adapters: it would be good if the definition of a Leo program could include some additional
data conversions on the inputs and outputs of a program. For example, if someone had
a log file in UTF-8, the conversion could turn that into a string of utf-32 characters
outside the circuit. Another example would be a network header that needs to be
decomposed into various fields. There is no advantage to pushing that code into the circuit.
Better to destructure the struct outside the circuit and pass in the fields as inputs.
In general, we want to turn data into R1CS-friendly data types before putting it in the circuit.

@acoglio
Copy link
Collaborator

acoglio commented Apr 29, 2021

Here is a summary of the preliminary and tentative design ideas discussed at today's weekly sync, with some additional thoughts, just so that we have a record of what was discussed. Please add anything I may have forgotten.

Programmer's view (largely independent from R1CS compilation):

  • Add a new scalar type char to Leo.
  • This new type char is isomorphic to the set {0, ..., 10FFFFh}, but it is distinct from all the other Leo types.
  • Values of type char can be denoted via character literals like 'a' or '<escape>' for some suitable choice of escape notations (e.g. \n, \u6FF).
  • Values of type 'char' could be perhaps also denoted via typed integer literals of the form 34char (where the char suffix may be type-inferred like other literals), where the integer is in the range of Unicode code point (a simple static check).
  • We need to decide which operations to support on char, e.g. < for comparisons, + for sums (the result should be within the range of Unicode code points), casts to/from other integer types (subject to range restrictions), functions to upcase/downcase, etc.
  • Strings are just arrays of chars -- no new type for strings as such.
  • We may consider syntactic sugar like string(10) as a synonym of [char; 10].
  • Polymorphic operations on arrays (e.g. range) are automatically applicable to strings.
  • We need to decide which built-in functions to support on strings -- maybe no operators as it would be odd to have operators that only work with certain arrays but not others.

Compilation to R1CS:

  • A char could be represented as a field element, which is more efficient than one field element per bit of the character.
  • The size of strings must be statically known as for any other arrays. This is why the syntactic sugar string(10) has a length, although we might allow string as something whose length must be inferred (but note that we currently have no array notation for length-to-be-inferred).
  • If strings were a separate type from arrays, we would have to replicate some of the machinery to infer array sizes etc. for strings, so that's a reason for just viewing strings as arrays of characters.
  • The choice of operations on characters and strings may be affected by considerations of efficient realization in R1CS.

@bendyarm
Copy link
Collaborator

bendyarm commented Apr 29, 2021

Variable-length arrays that are implemented by a fixed, larger array and a fill pointer would be especially useful for concatenating strings. Max had a related proposal.

It would be good to have library routines for

  • integer (any int type including field element) to hex string
  • eip-55 20-byte number (field element) to mixed-case hex string
  • field element to base58 or base58check string (for Bitcoin addresses)
  • bech32 for segwit and many other cryptocurrency address encoders
  • other address encoders as demand requires (see https://github.com/ensdomains/address-encoder)

@acoglio
Copy link
Collaborator

acoglio commented Apr 29, 2021

Would it make sense for this kind of variable-length arrays/strings to be realized via (library) circuit types?

@bendyarm
Copy link
Collaborator

bendyarm commented May 4, 2021

I suggest we use the exact syntax for Rust character and string escapes in literal characters and strings.

https://doc.rust-lang.org/reference/tokens.html#character-and-string-literals

Possibly we could remove the thing where spaces after backslash newline
are ignored, but overall I think the Rust design is very good and we could
just use it without modification.

I see on Asana that casting is a P1 for DP3a.
Maybe we could treat char as an integer type for casting purposes,
as a way to go from chars to code point values and vice versa.

Another thing on chars potentially representing code points not part of any Unicode encoding:
The RFC mentions any "code point" in the range 0 through 10FFFF,
The code points in the range U+D800—U+DFFF are not "Unicode Scalar Values"
(http://www.unicode.org/glossary/#unicode_scalar_value), i.e., they don't represent any character
in UTF-32. Some of those are not legal UTF-32,
So we need to establish the rules on when these values cause what sorts of errors.
If you do
let c: char = '\u{d800}';
would that succeed or fail?
How about if we have casting and you do
let c: char = (char) 55296;
What if you have a char whose value is 55295 and add 1 to it?
What if you then output it, either via console.log or as a return value?
I think we need to answer all these questions.
Btw, Rust restricts the char type to be only the "unicode scalar values".
The advantage of being more restrictive
is that you don't have to check validity every time the string is used for
something that might go out of the circuit.
However, the downside for Leo to be more restrictive is that more checking
would have to happen in the circuit. I think less checking in the circuit
is important, so I would vote for diverging from Rust in this case.

@collinc97
Copy link
Collaborator

Added to Leo in #974

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature A new feature.
Projects
None yet
Development

No branches or pull requests

4 participants