Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nom 5.0 internal design #878

Closed
Geal opened this issue Mar 6, 2019 · 17 comments
Closed

nom 5.0 internal design #878

Geal opened this issue Mar 6, 2019 · 17 comments
Milestone

Comments

@Geal
Copy link
Collaborator

Geal commented Mar 6, 2019

After some thought, I reached a satisfying new design for nom 5.0, that I tried in the nomfun repository.
This design uses functions instead of macros, with the same signature as macros combinators, mostly having I -> IResult<I,O,E> functions as arguments, and returning other functions, or applying them directly on some input.
As an example, here is how the pair combinator would be written:

pub fn pair<I, O1, O2, E, F, G>(first: F, second: G) -> IResult<I, (O1, O2), E>
  where F: Fn(I) -> IResult<I, O1, E>,
        G: Fn(I) -> IResult<I, O2, E> {

  move |input: I| {
    let (input, o1) = first(input)?;
    second(input).map(|(i, o2)| (i, (o1, o2)))
  }
}

pub fn pairc<I, O1, O2, E, F, G>(input: I, first: F, second: G) -> IResult<I, (O1, O2), E>
  where F: Fn(I) -> IResult<I, O1, E>,
        G: Fn(I) -> IResult<I, O2, E> {

  pair(first, second)(input)
}

This way we have two versions, one that combines two parsers and makes another one, and another that can take some input.

The macro version can then be rewritten that way:

macro_rules! pair(
  ($i:expr, $submac:ident!( $($args:tt)* ), $submac2:ident!( $($args2:tt)* )) => (
    pair!($i, |i| $submac!(i, $($args)*), |i| $submac2!(i, $($args2)*))
  );

  ($i:expr, $submac:ident!( $($args:tt)* ), $g:expr) => (
    pair!($i, |i| $submac!(i, $($args)*), $g);
  );

  ($i:expr, $f:expr, $submac:ident!( $($args:tt)* )) => (
    pair!($i, $f, |i| $submac!(i, $($args)*));
  );

  ($i:expr, $f:expr, $g:expr) => (
    $crate::pairc($i, $f, $g)
  );
);

As we can see currently in the 5.0 branch, most combinators are easy to replace:

  • extract the macro's code
  • put it in a function
  • add the proper trait bounds to the function
  • replace the macro's code with a function call

The resulting code is functionally equivalent, has less type inference issues and is much faster when built with link time optimization (I have seen results like 20% faster with the new design on some benchmarks).

Another benefit of this system is that it benefits from better import behaviour. Right now, even in edition 2018, macros that are exported are put at the top level (so the module import like macros use is actually a lie). So I cannot make variants of macros 'except by changing their name, to do stuff like separated_list and separated_list_complete.
This was an issue because we expect slightly different behaviour in streaming parsers (where we're not sure we'll get the whole data at once) or in complete parsers (where we're sure we have the whole input). In nom 4, I tried to solve this by introducing the CompleteByteSlice and CompleteStr input types, that would behave differently, so you could use the same macros but have different parsers depending on the input. This proved difficult to use, especially considering that we might want to switch back and forth between behaviours (streaming parsers using TLV will know that they have the complete data inside the TLV, complete parsers might want to use methods that work directly on &[u8] and &str. Also, most people did not bother reading the documentation about it and started directly using &[u8] and &str as input when they expected the other behaviour, which resulted in quite some time spent explaining it.

So with functions, we can actually make specialized versions of combinators. We could imagine having streaming and complete versions of many0, tag, etc. And we would let people use those versions by importing them directly (use nom::multi::streaming::many0, etc), and they could even use both versions in the same file.

The downside is that there's an enormous amount of work for this:

  • for most combinators (that are not affected by streaming or not), two functions to add:
    • a function returning a closure
    • a function that is called directly
  • for combinators affected by streaming, multiply that by 3:
    • a legacy function to keep backward compatibility inside the macro (although I'm not yet sure we should keep the old behaviour, I might just remove CompleteByteSlice and CompleteStr)
    • a function that works in streaming
    • a function that does not work in streaming

These functions will also require their own documentation and tests, and all of nom's documentation and examples should probably be adapted to this.
I'm making steady progress on converting the combinators, but there's still a lot to do. (TODO: make a checklist of which combinators were ported over or not)

Questions I have to solve now:

  • do I keep nom 4's behaviour with CompleteByteSlice and CompleteStr ?
  • do I keep macros related to streaming backward compatible with nom 4?
  • if not, which behaviour should I go with? Streaming or not?
  • do I make two versions of each combinators, or can I assume people will be able to write pair(first, second)(input) without any issues?
  • how do I port combinators that have a variable number of arguments, like do_parse (this one could probably be written directly with the ? syntax like this: https://github.com/Geal/nomfun/blob/master/benches/http.rs#L93-L102 ), tuple, permutation, alt, switch?
  • how do I reimplement ws?
  • should I extract method and ws in their own crates? They're not strictly necessary to nom and would make sense as separate libraries
@Geal Geal added this to the 5.0 milestone Mar 6, 2019
@Gedweb
Copy link

Gedweb commented Mar 13, 2019

Methods was marked as deprecated and reference to nonexistent nom-methods crate, so sad

@Geal
Copy link
Collaborator Author

Geal commented Mar 13, 2019

I have not pushed the crate yet. This major version will be a big cleanup, and it's a good occasion to extract things that are not as closely maintained as the rest

@Gedweb
Copy link

Gedweb commented Mar 13, 2019

How what about compatibility around a macros and functions moved to different crate?

BTW, in my opinion CompleteStr is useful, but not described on the documentation main page, in a cause many people try use &str directly.

@Evelyn-H
Copy link

Evelyn-H commented Mar 23, 2019

do I make two versions of each combinators, or can I assume people will be able to write pair(first, second)(input) without any issues?

My vote is definitely on having only one version. It keeps the amount of duplicate code to a minimum and encourages first building a parser and then feeding it some input later.
If there are some clear examples given in the docs then I feel like this won't be too confusing for new users. Especially when the alternative is: "Why are there two versions of everything? Which one should I use?" I think that'd be much more confusing.

@0X1A
Copy link

0X1A commented Mar 24, 2019

I don't fully understand marking a macro as deprecated on the latest version, without a proper replacement crate existing. I can just roll back to 4.1.1 but as someone that wants to port a decently sized hand written parser to nom, it doesn't instill much confidence in that this library will be maintained with some stability in the long run. Don't want this to sound overtly critical, I'm a happy user of nom 😄

@Geal
Copy link
Collaborator Author

Geal commented Mar 25, 2019

After working a while on nom 5 (cf https://github.com/Geal/nom/compare/5.0 ), I have some answers to these questions:

  • keeping CompleteByteSlice and CompleteStr means keeping backwards compatibility with nom 4 in macros. Unfortunately, keeping that compatibility means having 3 versions of some functions (nom 4, streaming and complete). I'd rather change the combinators back to nom 3 behaviour, and push users of those types to use the new functions instead
  • most combinator functions will return a impl Fn(I) -> IResult<I, O, E>, and there's no need for the second version, except for some macros combinators which have type inference issues
  • combinators like do_parse can be written directly as imperative code (lots of combinators can be written directly, like cond or map and don't need to be rewritten). Others could be implemented on tuples of parsers, maybe?
  • method and ws will be extracted in other crates

@0X1A if you've followed nom for a few versions, you should have seen that I care a lot about backward compatibility, and major version upgrades should always try to minimize breaks. But this is also a good occasion to clean things up, and there's also a limit to the quantity of code I can maintain. So I'd rather have nom's core be clean and stable, and have external crates with their own release cycles (they should be easier to maintain than nom itself)

@Evelyn-H
Copy link

Evelyn-H commented Mar 25, 2019

Will the macro syntax be kept for 5.0 or is it gonna be only the new function based combinators?

Also, I'd really like to write a (procedural) macro for nom 5 that accepts PEG syntax and converts it to the regular nom combinators under the hood. I feel like that would add a really nice middle ground between parser combinators and parser generators and might speed up writing parsers for simple syntaxes a lot. Especially if you can use previously defined regular nom parsers as a component in the PEG parser.

Is this something you'd be interested in?

A quick example for parsing mathematical expressions:
(syntax is a combination of standard PEG and the lalrpop angle bracket syntax for assigning (sub)results to temporary variables)

let parser = peg!{
    Expr = <l: Product> ("+" <r: Product>)* => { r.fold(l, |a, i| a + i) }
         | <l: Product> ("-" <r: Product>)* => { r.fold(l, |a, i| a - i) }

    Product = <l: Value> ("*" <r: Value>)* => { r.fold(l, |a, i| a * i) }
            | <l: Value> ("/" <r: Value>)* => { r.fold(l, |a, i| a / i) }

    Value = <s: [0-9]+> => { s.parse::<u64>() }
          | "(" <Expr> ")"
}
// and using the (sub) parsers
result: u64 = parser.Expr("2+2*(3-5)") // should return -2

@Geal
Copy link
Collaborator Author

Geal commented Mar 25, 2019

the macros will stay in nom 5, but they will use the functions under the hood. A PEG generator sounds nice :)

@Evelyn-H
Copy link

Great!
I have a simple version working, but it's a bit of a mess right now. I'll clean it up a bit and put it online as soon as I can.

@Geal Geal changed the title nom 5.0 conversion nom 5.0 internal design Apr 6, 2019
@Geal Geal mentioned this issue Apr 6, 2019
22 tasks
@leo60228
Copy link

leo60228 commented Apr 14, 2019

combinators like do_parse can be written directly as imperative code (lots of combinators can be written directly, like cond or map and don't need to be rewritten).

I needed to do some stuff in a parser that would've been difficult or impossible to do with do_parse, and wrote it as a function. It was annoying to have to update the input every time I used a parser. This is untested, but something like this would've been useful:

pub fn replace_buf<B, T>(input: &mut B, tuple: (B, T)) -> T {
    *input = tuple.0;
    tuple.1
}

This would be called like:

let num = replace_buf(&mut input, le_u8(input)?);

This might've been possible to do using pattern matching, but I wasn't able to figure out how.

@Geal
Copy link
Collaborator Author

Geal commented Apr 14, 2019

personally, I don't think it's too complex to do this:

let (input, val1) = parser1(input)?;
let (input, _) = parser2(input)?;
let (input, val3) = parser3(input)?;

Ok((input, MyStruct { val1, val3 }))

But I'm planning to add a tuple based version, so you could do it like that:

let (input, (val1, _, val2)) = parse((parser1, parser2, parser3))(input)?;

Ok((input, MyStruct { val1, val3 }))

@leo60228
Copy link

That doesn't work in loops.

@Geal
Copy link
Collaborator Author

Geal commented Apr 14, 2019 via email

@leo60228
Copy link

My specific code was:

    let (mut buf, child_count) = le_u16(buf)?;
    for _ in 0..child_count {
        let (child_buf, child) = take_element(buf, &lookup)?;
        buf = child_buf;
        binel.insert(child);
    }

BinEl is a struct that gets returned from the parser and lookup is an argument to the parser. I suppose I could use a parser that returns a Vec and iterate over it, but that seems like it would have a performance penalty.

@Geal
Copy link
Collaborator Author

Geal commented Apr 14, 2019 via email

@leo60228
Copy link

It doesn't matter too much, it was just that buf = child_buf; felt somewhat cumbersome. Great job on nom, by the way!

@Geal
Copy link
Collaborator Author

Geal commented Jun 17, 2019

the nom 5 rewrite is now done

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants