New sibling nodes added by a transformer during traversal are never traversed #91

rgrove · 2013-12-06T22:04:09Z

Nokogiri doesn't update previously-created NodeSets with new siblings. See #90 for background.

JasonBarnabe · 2014-02-10T17:08:10Z

Did you ever file this against Nokogiri?

rgrove · 2014-02-10T17:56:02Z

No, I plan to work around this in Sanitize, but every time I have a few minutes and sit down to think about this I get all depressed about how much time I spend dealing with Nokogiri issues, and then I start fantasizing about switching to a better, HTML5-compliant parser, and then I check on the status of Gumbo and see that it's stalled, and then I feel tempted to start writing a custom parser, and then I realize what a terrible idea that is given how little time I already have to work on Sanitize, and then I'm out of free time and the bug hasn't gotten fixed.

Sorry. I'll try to get to it soon.

rgrove · 2014-02-11T00:56:27Z

I thought I had a straightforward solution for this:

Keep track of each child we visit while traversing a node's children.
After traversing a node's children, refresh the NodeSet of children. Traverse any children in the refreshed set that we haven't already visited.
Repeat until all the node's children have been traversed.

While this guarantees that we visit any new nodes that were added during traversal, it has a couple of problems.

First, it's easy to get into an infinite loop: just write a transformer that adds a sibling to every node it encounters, and we'll keep traversing and adding and traversing and adding forever. But writing a transformer like this would be dumb, so I'm not too worried about this case.

The bigger problem is that when Sanitize cleans up a non-whitelisted node, it (by default) replaces the original node with its children. This means that during traversal we can potentially end up re-traversing nodes we've already visited many times simply because they keep getting re-parented shallower in the tree.

This is going to be trickier than I thought.

JasonBarnabe · 2014-02-11T01:40:26Z

Would it help if you restricted the solution to only traverse new siblings that are added after the current node? That's all I really need for what I describe in #90 - a situation where one node is split into many by a transformer - and seems reasonable to me as long as the behaviour is documented.

I'm not familiar with the internals of your or Nokogiri's code, but it sounds like all you would have to do is ensure that the next node to be processed actually is the previous node's next sibling.

rgrove · 2014-02-11T01:59:28Z

That was actually the first solution I tried, but it turns out not to be reliable since the current node may be deleted or moved during traversal. If it was deleted and we try to get the next sibling, then there is no next sibling. If it was moved, then the next sibling may be under a completely different parent so it's not what we want.

I played around with verifying that the current node's parent is still the same parent it was originally and, if not, getting the next sibling of the previous sibling that we traversed, but this falls apart with nodes that are children of document fragments, since their parent references don't actually point to the fragment (document fragments are weird).

Without the ability to trust the current node as a stable reference point, I'm not sure there's any reliable way to make forward-only traversal work.

JasonBarnabe · 2014-02-11T02:22:18Z

Without the ability to trust the current node as a stable reference point, I'm not sure there's any reliable way to make forward-only traversal work.

Well again, speaking selfishly about what I'm trying to do, can't you just make the assumption and document it? "New nodes created by a transformer will only be traversed if they are added after the original node and the original node retains its position in the document". That doesn't seem like a particularly onerous restriction, and I'm having trouble thinking up a case where that wouldn't be enough.

As another alternative, what if you allowed the transformer to indicate by way of the return value which additional nodes it wants traversed (i.e. the nodes it just created)?

rgrove · 2014-02-11T03:49:45Z

Well again, speaking selfishly about what I'm trying to do, can't you just make the assumption and document it? "New nodes created by a transformer will only be traversed if they are added after the original node and the original node retains its position in the document".

Yes, I could do that to address your particular use case (although with some degree of uncertainty when it comes to children of document fragments). But I'm not comfortable with this approach. Since people use Sanitize for security purposes, I think it needs to be consistent and reliable. This restriction could create a situation that silently results in behavior that the author of a transformer doesn't expect and might not notice.

Seems like we'd end up with the same problem we're trying to fix (silently failing to traverse certain nodes), but with less consistency about when it happens. At least right now you can be certain that any siblings you add during traversal will not be traversed.

As another alternative, what if you allowed the transformer to indicate by way of the return value which additional nodes it wants traversed (i.e. the nodes it just created)?

This would essentially be the same as calling Sanitize.clean_node! on the additional nodes you want to traverse, which is the workaround you're already using, right? In any case, yeah, I may end up implementing some kind of configuration-based solution if I don't have any other brilliant ideas.

JasonBarnabe · 2014-06-26T20:39:59Z

This seems to work fine in 3.0.0. Thanks!

rgrove · 2014-06-26T21:01:20Z

Glad to hear it!

JasonBarnabe mentioned this issue Dec 6, 2013

"Stack level too deep" when HTML-izing something with lots of whitespace stylish-userstyles/userstyles#1

Closed

rgrove added a commit that referenced this issue May 19, 2014

Add a history entry for #91.

9332762

rgrove closed this as completed in 6208727 Jun 21, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New sibling nodes added by a transformer during traversal are never traversed #91

New sibling nodes added by a transformer during traversal are never traversed #91

rgrove commented Dec 6, 2013

JasonBarnabe commented Feb 10, 2014

rgrove commented Feb 10, 2014

rgrove commented Feb 11, 2014

JasonBarnabe commented Feb 11, 2014

rgrove commented Feb 11, 2014

JasonBarnabe commented Feb 11, 2014

rgrove commented Feb 11, 2014

JasonBarnabe commented Jun 26, 2014

rgrove commented Jun 26, 2014

New sibling nodes added by a transformer during traversal are never traversed #91

New sibling nodes added by a transformer during traversal are never traversed #91

Comments

rgrove commented Dec 6, 2013

JasonBarnabe commented Feb 10, 2014

rgrove commented Feb 10, 2014

rgrove commented Feb 11, 2014

JasonBarnabe commented Feb 11, 2014

rgrove commented Feb 11, 2014

JasonBarnabe commented Feb 11, 2014

rgrove commented Feb 11, 2014

JasonBarnabe commented Jun 26, 2014

rgrove commented Jun 26, 2014