Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Parser] Pop past unreachables where possible #6489

Merged
merged 7 commits into from
Apr 16, 2024
Merged

[Parser] Pop past unreachables where possible #6489

merged 7 commits into from
Apr 16, 2024

Conversation

tlively
Copy link
Member

@tlively tlively commented Apr 11, 2024

We previously would eagerly drop all concretely typed expressions on the stack
when pushing an unreachable instruction. This was semantically correct and
closely modeled the semantics of unreachable instructions, which implicitly drop
the entire stack and start a new polymorphic stack. However, it also meant that
the structure of the parsed IR did not match the structure of the folded input,
which meant that tests involving unreachable children would not parse as
intended, preventing the test from testing the intended behavior.

For example, this wat:

  (i32.add
   (i32.const 0)
   (unreachable)
  )

Would previously parse into this IR:

  (drop
   (i32.const 0)
  )
  (i32.add
   (unreachable)
   (unreachable)
  )

To fix this problem, we need to stop eagerly dropping stack values when
encountering an unreachable instruction so we can still pop expressions pushed
before the unreachable as direct children of later instructions. In the example
above, we need to keep the i32.const 0 on the stack so it is available to be
popped and become a child of the i32.add.

However, the naive solution of simply popping past unreachables would produce
invalid IR in some cases. For example, consider this wat:

  f32.const 0
  unreachable
  i32.add

The naive solution would parse this wat into this IR:

  (i32.add
   (f32.const 0)
   (unreachable)
  )

But we do not want to parse an i32.add with an f32-typed child. Neither do
we want to reject this input, since it is a perfectly valid Wasm fragment. In this
case, we actually want the old behavior of dropping the f32.const and
replacing it with another unreachable as the first child of the i32.add.

To both match the input structure where possible and also gracefully fall back
to the old behavior of dropping expressions prior to the unreachable, collect
constraints on the types of each child for each kind of expression and compare
them to the types of available expressions on the stack when an unreachable
instruction will be popped. When the constraints are satisfied, pop expressions
normally, even after popping the unreachable instruction. Otherwise, drop the
instructions that precede the unreachable instruction to ensure we parse valid
IR.

To collect the constraints, add a new ChildTyper utility that calls a
different callback for each kind of possible type constraint for each child. In
the future, this utility can be used to simplify the validator as well.

@tlively
Copy link
Member Author

tlively commented Apr 11, 2024

@tlively tlively requested a review from kripken April 11, 2024 02:42
@tlively tlively marked this pull request as ready for review April 11, 2024 02:42
@tlively tlively force-pushed the parser-pop-unreachable-tuple branch from f64f57d to 59c021e Compare April 11, 2024 18:34

// For each child, call `noteSubtype` with a pointer to the child and the most
// specific type that the child must have. Subtypings are based only on
// information that would have been parsed out of a binary.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does the last sentence here mean? Maybe an example of a subtyping that is somehow not from a binary could help.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't remember what a counterexample would be, so I've just gone ahead and removed that sentence.

Base automatically changed from parser-pop-unreachable-tuple to main April 11, 2024 19:10

void noteAnyReference(Expression** childp) {
self().noteAnyReferenceType(childp);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these foo() { self.foo() } stubs needed? In CRTP the right thing should just get called anyhow I think - we don't do this elsewhere.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, they're not needed, but they are convenient for avoiding a bunch of calls to self() everywhere and to provide a shorter internal name.

src/ir/child-typer.h Show resolved Hide resolved

void visitBreak(Break* curr, std::optional<Type> labelType = std::nullopt) {
if (!labelType) {
labelType = getLabelType(curr->name);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not always do this to get the label's type?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sometimes it is more convenient to pass a type explicitly to avoid having to maintain a persistent mapping from labels to types that the callback can use.

For the IRBuilder case, getLabelType is not readily implementable, arguably due to a layering issue. IRBuilder uses indices (like in the binary format) rather than names as the source of truth for labels. It can convert label names into label indices, but only for label names that are explicitly provided at the start of a scope, not for label names that are generated on demand because a branch targets an unnamed label. That means that an implementation of this getLabelType callback would fail to find the type of labels with generated names.

Fixing this would not be as simple as letting IRBuilder translated generated label names into indices because the parser currently uses IRBuilder's name-to-index translation to check whether a name is valid. If the translation worked for generated names, the parser would start admitting modules that happen to use the generated names without them being bound. This is arguably a layering violation if you assume the parser should be responsible for tracking its own mapping of label names to indices, but on the other hand the entire point of IRBuilder is to track state so the parser doesn't have to.

The current solution is to have IRBuilder always pass an explicit type rather than depending on the getLabelType callback. Another solution would be to differentiate generated names from explicit names so that we could have different behavior in the public and private mapping of label names to indices.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, thanks. Yeah, this seems tricky to fix.

I'd suggest adding a comment summarizing some of what you said here, for the benefit of future readers of the code here (though hopefully we'll find a nice refactoring before long).

note(&curr->ifFalse, *type);
} else {
noteAny(&curr->ifTrue);
noteAny(&curr->ifFalse);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't capture that they must be the same type. Is that not important when popping?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They don't need to be the same type in general. In the validator, invalid cases like having children typed i32 and f32 should be caught because the result type would not be the LUB of the children types (unless it is Type::none, in which case it would fail another check that the result should be concrete).

note(&curr->size, Type::i32);
}

void visitTry(Try* curr) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
void visitTry(Try* curr) {
void visitTry(Try* curr) {
note(&curr->body, curr->type);


void visitCallRef(CallRef* curr, std::optional<HeapType> ht = std::nullopt) {
if (!ht) {
ht = curr->target->type.getHeapType().getSignature();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may be unreachable, in which case getHeapType will fail.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would consider it user error to call this function without providing ht in that case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Makes sense but please add a comment explaining that expectation so it is obvious to potential future users of the API.

for (size_t i = 0; i < params.size(); ++i) {
note(&curr->operands[i], params[i]);
}
note(&curr->target, Type(*ht, Nullable));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically this can be any subtype of funcref? This enforces the current type IIANM which means it can't help the validator.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, right, this is what is meant by "Subtypings are based only on information that would have been parsed out of a binary." The binary would have included a type annotation, so we report the subtype constraints based on that same type annotation, inferring it from the children if necessary. So the validator will have to validate that it is possible to recover that type before calling into this utility. Same for every instruction that has a type annotation in the binary format but not in the IR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah ok, now that "parsed out of a binary" comment makes sense. I suggest we clarify this with this example at the top of the file.

src/ir/child-typer.h Show resolved Hide resolved
@tlively
Copy link
Member Author

tlively commented Apr 12, 2024

I've fuzzed this for over 200k iterations of TextRoundtrip with the new parser enabled and a custom modification of this PR that always checks that constraints are met rather than checking constraints only after seeing an unreachable.

tlively added 7 commits April 16, 2024 08:40
We previously would eagerly drop all concretely typed expressions on the stack
when pushing an unreachable instruction. This was semantically correct and
closely modeled the semantics of unreachable instructions, which implicitly drop
the entire stack and start a new polymorphic stack. However, it also meant that
the structure of the parsed IR did not match the structure of the folded input,
which meant that tests involving unreachable children would not parse as
intended, preventing the test from testing the intended behavior.

For example, this wat:

```wasm
  (i32.add
   (i32.const 0)
   (unreachable)
  )
```

Would previously parse into this IR:

```wasm
  (drop
   (i32.const 0)
  )
  (i32.add
   (unreachable)
   (unreachable)
  )
```

To fix this problem, we need to stop eagerly dropping stack values when
encountering an unreachable instruction so we can still pop expressions pushed
before the unreachable as direct children of later instructions. In the example
above, we need to keep the `i32.const 0` on the stack so it is available to be
popped and become a child of the `i32.add`.

However, the naive solution of simply popping past unreachables would produce
invalid IR in some cases. For example, consider this wat:

```wasm
  f32.const 0
  unreachable
  i32.add
```

The naive solution would parse this wat into this IR:

```wasm
  (i32.add
   (f32.const 0)
   (unreachable)
  )
```

But we do not want to parse an `i32.add` with an `f32`-typed child. Neither do
we want to reject this input, since it is a perfectly valid Wasm fragment. In this
case, we actually want the old behavior of dropping the `f32.const` and
replacing it with another `unreachable` as the first child of the `i32.add`.

To both match the input structure where possible and also gracefully fall back
to the old behavior of dropping expressions prior to the unreachable, collect
constraints on the types of each child for each kind of expression and compare
them to the types of available expressions on the stack when an unreachable
instruction will be popped. When the constraints are satisfied, pop expressions
normally, even after popping the unreachable instruction. Otherwise, drop the
instructions that precede the unreachable instruction to ensure we parse valid
IR.

To collect the constraints, add a new `ChildTyper` utility that calls a
different callback for each kind of possible type constraint for each child. In
the future, this utility can be used to simplify the validator as well.
@tlively
Copy link
Member Author

tlively commented Apr 16, 2024

@kripken, PTAL at the last commit: documentation comment

@tlively
Copy link
Member Author

tlively commented Apr 16, 2024

This also just passed 1 million RoundtripText iterations 🎉

@tlively tlively merged commit 359d5aa into main Apr 16, 2024
13 checks passed
@tlively tlively deleted the child-typer branch April 16, 2024 17:43
@gkdn gkdn mentioned this pull request Aug 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants