nom 5.0 internal design #878

Geal · 2019-03-06T17:56:55Z

After some thought, I reached a satisfying new design for nom 5.0, that I tried in the nomfun repository.
This design uses functions instead of macros, with the same signature as macros combinators, mostly having I -> IResult<I,O,E> functions as arguments, and returning other functions, or applying them directly on some input.
As an example, here is how the pair combinator would be written:

pub fn pair<I, O1, O2, E, F, G>(first: F, second: G) -> IResult<I, (O1, O2), E>
  where F: Fn(I) -> IResult<I, O1, E>,
        G: Fn(I) -> IResult<I, O2, E> {

  move |input: I| {
    let (input, o1) = first(input)?;
    second(input).map(|(i, o2)| (i, (o1, o2)))
  }
}

pub fn pairc<I, O1, O2, E, F, G>(input: I, first: F, second: G) -> IResult<I, (O1, O2), E>
  where F: Fn(I) -> IResult<I, O1, E>,
        G: Fn(I) -> IResult<I, O2, E> {

  pair(first, second)(input)
}

This way we have two versions, one that combines two parsers and makes another one, and another that can take some input.

The macro version can then be rewritten that way:

macro_rules! pair(
  ($i:expr, $submac:ident!( $($args:tt)* ), $submac2:ident!( $($args2:tt)* )) => (
    pair!($i, |i| $submac!(i, $($args)*), |i| $submac2!(i, $($args2)*))
  );

  ($i:expr, $submac:ident!( $($args:tt)* ), $g:expr) => (
    pair!($i, |i| $submac!(i, $($args)*), $g);
  );

  ($i:expr, $f:expr, $submac:ident!( $($args:tt)* )) => (
    pair!($i, $f, |i| $submac!(i, $($args)*));
  );

  ($i:expr, $f:expr, $g:expr) => (
    $crate::pairc($i, $f, $g)
  );
);

As we can see currently in the 5.0 branch, most combinators are easy to replace:

extract the macro's code
put it in a function
add the proper trait bounds to the function
replace the macro's code with a function call

The resulting code is functionally equivalent, has less type inference issues and is much faster when built with link time optimization (I have seen results like 20% faster with the new design on some benchmarks).

Another benefit of this system is that it benefits from better import behaviour. Right now, even in edition 2018, macros that are exported are put at the top level (so the module import like macros use is actually a lie). So I cannot make variants of macros 'except by changing their name, to do stuff like separated_list and separated_list_complete.
This was an issue because we expect slightly different behaviour in streaming parsers (where we're not sure we'll get the whole data at once) or in complete parsers (where we're sure we have the whole input). In nom 4, I tried to solve this by introducing the CompleteByteSlice and CompleteStr input types, that would behave differently, so you could use the same macros but have different parsers depending on the input. This proved difficult to use, especially considering that we might want to switch back and forth between behaviours (streaming parsers using TLV will know that they have the complete data inside the TLV, complete parsers might want to use methods that work directly on &[u8] and &str. Also, most people did not bother reading the documentation about it and started directly using &[u8] and &str as input when they expected the other behaviour, which resulted in quite some time spent explaining it.

So with functions, we can actually make specialized versions of combinators. We could imagine having streaming and complete versions of many0, tag, etc. And we would let people use those versions by importing them directly (use nom::multi::streaming::many0, etc), and they could even use both versions in the same file.

The downside is that there's an enormous amount of work for this:

for most combinators (that are not affected by streaming or not), two functions to add:
- a function returning a closure
- a function that is called directly
for combinators affected by streaming, multiply that by 3:
- a legacy function to keep backward compatibility inside the macro (although I'm not yet sure we should keep the old behaviour, I might just remove CompleteByteSlice and CompleteStr)
- a function that works in streaming
- a function that does not work in streaming

These functions will also require their own documentation and tests, and all of nom's documentation and examples should probably be adapted to this.
I'm making steady progress on converting the combinators, but there's still a lot to do. (TODO: make a checklist of which combinators were ported over or not)

Questions I have to solve now:

do I keep nom 4's behaviour with CompleteByteSlice and CompleteStr ?
do I keep macros related to streaming backward compatible with nom 4?
if not, which behaviour should I go with? Streaming or not?
do I make two versions of each combinators, or can I assume people will be able to write pair(first, second)(input) without any issues?
how do I port combinators that have a variable number of arguments, like do_parse (this one could probably be written directly with the ? syntax like this: https://github.com/Geal/nomfun/blob/master/benches/http.rs#L93-L102 ), tuple, permutation, alt, switch?
how do I reimplement ws?
should I extract method and ws in their own crates? They're not strictly necessary to nom and would make sense as separate libraries

The text was updated successfully, but these errors were encountered:

Gedweb · 2019-03-13T20:31:35Z

Methods was marked as deprecated and reference to nonexistent nom-methods crate, so sad

Geal · 2019-03-13T20:33:25Z

I have not pushed the crate yet. This major version will be a big cleanup, and it's a good occasion to extract things that are not as closely maintained as the rest

Gedweb · 2019-03-13T21:25:22Z

How what about compatibility around a macros and functions moved to different crate?

BTW, in my opinion CompleteStr is useful, but not described on the documentation main page, in a cause many people try use &str directly.

Evelyn-H · 2019-03-23T16:56:46Z

do I make two versions of each combinators, or can I assume people will be able to write pair(first, second)(input) without any issues?

My vote is definitely on having only one version. It keeps the amount of duplicate code to a minimum and encourages first building a parser and then feeding it some input later.
If there are some clear examples given in the docs then I feel like this won't be too confusing for new users. Especially when the alternative is: "Why are there two versions of everything? Which one should I use?" I think that'd be much more confusing.

0X1A · 2019-03-24T17:08:34Z

I don't fully understand marking a macro as deprecated on the latest version, without a proper replacement crate existing. I can just roll back to 4.1.1 but as someone that wants to port a decently sized hand written parser to nom, it doesn't instill much confidence in that this library will be maintained with some stability in the long run. Don't want this to sound overtly critical, I'm a happy user of nom 😄

Geal · 2019-03-25T09:34:34Z

After working a while on nom 5 (cf https://github.com/Geal/nom/compare/5.0 ), I have some answers to these questions:

keeping CompleteByteSlice and CompleteStr means keeping backwards compatibility with nom 4 in macros. Unfortunately, keeping that compatibility means having 3 versions of some functions (nom 4, streaming and complete). I'd rather change the combinators back to nom 3 behaviour, and push users of those types to use the new functions instead
most combinator functions will return a impl Fn(I) -> IResult<I, O, E>, and there's no need for the second version, except for some macros combinators which have type inference issues
combinators like do_parse can be written directly as imperative code (lots of combinators can be written directly, like cond or map and don't need to be rewritten). Others could be implemented on tuples of parsers, maybe?
method and ws will be extracted in other crates

@0X1A if you've followed nom for a few versions, you should have seen that I care a lot about backward compatibility, and major version upgrades should always try to minimize breaks. But this is also a good occasion to clean things up, and there's also a limit to the quantity of code I can maintain. So I'd rather have nom's core be clean and stable, and have external crates with their own release cycles (they should be easier to maintain than nom itself)

Evelyn-H · 2019-03-25T11:35:08Z

Will the macro syntax be kept for 5.0 or is it gonna be only the new function based combinators?

Also, I'd really like to write a (procedural) macro for nom 5 that accepts PEG syntax and converts it to the regular nom combinators under the hood. I feel like that would add a really nice middle ground between parser combinators and parser generators and might speed up writing parsers for simple syntaxes a lot. Especially if you can use previously defined regular nom parsers as a component in the PEG parser.

Is this something you'd be interested in?

A quick example for parsing mathematical expressions:
(syntax is a combination of standard PEG and the lalrpop angle bracket syntax for assigning (sub)results to temporary variables)

let parser = peg!{
    Expr = <l: Product> ("+" <r: Product>)* => { r.fold(l, |a, i| a + i) }
         | <l: Product> ("-" <r: Product>)* => { r.fold(l, |a, i| a - i) }

    Product = <l: Value> ("*" <r: Value>)* => { r.fold(l, |a, i| a * i) }
            | <l: Value> ("/" <r: Value>)* => { r.fold(l, |a, i| a / i) }

    Value = <s: [0-9]+> => { s.parse::<u64>() }
          | "(" <Expr> ")"
}
// and using the (sub) parsers
result: u64 = parser.Expr("2+2*(3-5)") // should return -2

Geal · 2019-03-25T13:07:09Z

the macros will stay in nom 5, but they will use the functions under the hood. A PEG generator sounds nice :)

Evelyn-H · 2019-03-25T14:01:14Z

Great!
I have a simple version working, but it's a bit of a mess right now. I'll clean it up a bit and put it online as soon as I can.

leo60228 · 2019-04-14T17:03:51Z

combinators like do_parse can be written directly as imperative code (lots of combinators can be written directly, like cond or map and don't need to be rewritten).

I needed to do some stuff in a parser that would've been difficult or impossible to do with do_parse, and wrote it as a function. It was annoying to have to update the input every time I used a parser. This is untested, but something like this would've been useful:

pub fn replace_buf<B, T>(input: &mut B, tuple: (B, T)) -> T {
    *input = tuple.0;
    tuple.1
}

This would be called like:

let num = replace_buf(&mut input, le_u8(input)?);

This might've been possible to do using pattern matching, but I wasn't able to figure out how.

Geal · 2019-04-14T17:13:50Z

personally, I don't think it's too complex to do this:

let (input, val1) = parser1(input)?;
let (input, _) = parser2(input)?;
let (input, val3) = parser3(input)?;

Ok((input, MyStruct { val1, val3 }))

But I'm planning to add a tuple based version, so you could do it like that:

let (input, (val1, _, val2)) = parse((parser1, parser2, parser3))(input)?;

Ok((input, MyStruct { val1, val3 }))

leo60228 · 2019-04-14T17:14:56Z

That doesn't work in loops.

Geal · 2019-04-14T17:18:47Z

Use `many0` then

…

On Sun, Apr 14, 2019, 19:14 leo60228 ***@***.***> wrote: That doesn't work in loops. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#878 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAHSAGUR0pQBV2Aqk_WtN8zP7SSodQWvks5vg2IQgaJpZM4bheuz> .

leo60228 · 2019-04-14T17:20:53Z

My specific code was:

    let (mut buf, child_count) = le_u16(buf)?;
    for _ in 0..child_count {
        let (child_buf, child) = take_element(buf, &lookup)?;
        buf = child_buf;
        binel.insert(child);
    }

BinEl is a struct that gets returned from the parser and lookup is an argument to the parser. I suppose I could use a parser that returns a Vec and iterate over it, but that seems like it would have a performance penalty.

Geal · 2019-04-14T17:24:07Z

That does not look very complex to me, I have a lot of code that looks like that, especially in nom. I guess you could use fold_many_m_n for this, to avoid allocations. The switch to functions also mean we can now integrate parsers more easily with iterators. I have not explored it in detail yet, but I guess some interesting patterns could appear.

…

On Sun, Apr 14, 2019, 19:20 leo60228 ***@***.***> wrote: My specific code was: let (mut buf, child_count) = le_u16(buf)?; for _ in 0..child_count { let (child_buf, child) = take_element(buf, &lookup)?; buf = child_buf; binel.insert(child); } BinEl is a struct that gets returned from the parser and lookup is an argument to the parser. I suppose I could use a parser that returns a Vec and iterate over it, but that seems like it would have a performance penalty. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#878 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAHSALO3t7-qubGaapn-eWrWXHKCqrCrks5vg2N2gaJpZM4bheuz> .

leo60228 · 2019-04-14T17:25:35Z

It doesn't matter too much, it was just that buf = child_buf; felt somewhat cumbersome. Great job on nom, by the way!

Geal · 2019-06-17T14:30:59Z

the nom 5 rewrite is now done

Geal added this to the 5.0 milestone Mar 6, 2019

Geal changed the title ~~nom 5.0 conversion~~ nom 5.0 internal design Apr 6, 2019

Geal mentioned this issue Apr 6, 2019

nom 5.0 roadmap #903

Closed

22 tasks

Geal closed this as completed Jun 17, 2019

cpdean mentioned this issue Jun 28, 2019

Regex parsers to be ported from macros to functions #978

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nom 5.0 internal design #878

nom 5.0 internal design #878

Geal commented Mar 6, 2019

Gedweb commented Mar 13, 2019

Geal commented Mar 13, 2019

Gedweb commented Mar 13, 2019 •

edited

Loading

Evelyn-H commented Mar 23, 2019 •

edited

Loading

0X1A commented Mar 24, 2019

Geal commented Mar 25, 2019

Evelyn-H commented Mar 25, 2019 •

edited

Loading

Geal commented Mar 25, 2019

Evelyn-H commented Mar 25, 2019

leo60228 commented Apr 14, 2019 •

edited

Loading

Geal commented Apr 14, 2019

leo60228 commented Apr 14, 2019

Geal commented Apr 14, 2019 via email

leo60228 commented Apr 14, 2019

Geal commented Apr 14, 2019 via email

leo60228 commented Apr 14, 2019

Geal commented Jun 17, 2019

nom 5.0 internal design #878

nom 5.0 internal design #878

Comments

Geal commented Mar 6, 2019

Gedweb commented Mar 13, 2019

Geal commented Mar 13, 2019

Gedweb commented Mar 13, 2019 • edited Loading

Evelyn-H commented Mar 23, 2019 • edited Loading

0X1A commented Mar 24, 2019

Geal commented Mar 25, 2019

Evelyn-H commented Mar 25, 2019 • edited Loading

Geal commented Mar 25, 2019

Evelyn-H commented Mar 25, 2019

leo60228 commented Apr 14, 2019 • edited Loading

Geal commented Apr 14, 2019

leo60228 commented Apr 14, 2019

Geal commented Apr 14, 2019 via email

leo60228 commented Apr 14, 2019

Geal commented Apr 14, 2019 via email

leo60228 commented Apr 14, 2019

Geal commented Jun 17, 2019

Gedweb commented Mar 13, 2019 •

edited

Loading

Evelyn-H commented Mar 23, 2019 •

edited

Loading

Evelyn-H commented Mar 25, 2019 •

edited

Loading

leo60228 commented Apr 14, 2019 •

edited

Loading