-
Notifications
You must be signed in to change notification settings - Fork 142
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New sibling nodes added by a transformer during traversal are never traversed #91
Comments
Did you ever file this against Nokogiri? |
No, I plan to work around this in Sanitize, but every time I have a few minutes and sit down to think about this I get all depressed about how much time I spend dealing with Nokogiri issues, and then I start fantasizing about switching to a better, HTML5-compliant parser, and then I check on the status of Gumbo and see that it's stalled, and then I feel tempted to start writing a custom parser, and then I realize what a terrible idea that is given how little time I already have to work on Sanitize, and then I'm out of free time and the bug hasn't gotten fixed. Sorry. I'll try to get to it soon. |
I thought I had a straightforward solution for this:
While this guarantees that we visit any new nodes that were added during traversal, it has a couple of problems. First, it's easy to get into an infinite loop: just write a transformer that adds a sibling to every node it encounters, and we'll keep traversing and adding and traversing and adding forever. But writing a transformer like this would be dumb, so I'm not too worried about this case. The bigger problem is that when Sanitize cleans up a non-whitelisted node, it (by default) replaces the original node with its children. This means that during traversal we can potentially end up re-traversing nodes we've already visited many times simply because they keep getting re-parented shallower in the tree. This is going to be trickier than I thought. |
Would it help if you restricted the solution to only traverse new siblings that are added after the current node? That's all I really need for what I describe in #90 - a situation where one node is split into many by a transformer - and seems reasonable to me as long as the behaviour is documented. I'm not familiar with the internals of your or Nokogiri's code, but it sounds like all you would have to do is ensure that the next node to be processed actually is the previous node's next sibling. |
That was actually the first solution I tried, but it turns out not to be reliable since the current node may be deleted or moved during traversal. If it was deleted and we try to get the next sibling, then there is no next sibling. If it was moved, then the next sibling may be under a completely different parent so it's not what we want. I played around with verifying that the current node's parent is still the same parent it was originally and, if not, getting the next sibling of the previous sibling that we traversed, but this falls apart with nodes that are children of document fragments, since their parent references don't actually point to the fragment (document fragments are weird). Without the ability to trust the current node as a stable reference point, I'm not sure there's any reliable way to make forward-only traversal work. |
Well again, speaking selfishly about what I'm trying to do, can't you just make the assumption and document it? "New nodes created by a transformer will only be traversed if they are added after the original node and the original node retains its position in the document". That doesn't seem like a particularly onerous restriction, and I'm having trouble thinking up a case where that wouldn't be enough. As another alternative, what if you allowed the transformer to indicate by way of the return value which additional nodes it wants traversed (i.e. the nodes it just created)? |
Yes, I could do that to address your particular use case (although with some degree of uncertainty when it comes to children of document fragments). But I'm not comfortable with this approach. Since people use Sanitize for security purposes, I think it needs to be consistent and reliable. This restriction could create a situation that silently results in behavior that the author of a transformer doesn't expect and might not notice. Seems like we'd end up with the same problem we're trying to fix (silently failing to traverse certain nodes), but with less consistency about when it happens. At least right now you can be certain that any siblings you add during traversal will not be traversed.
This would essentially be the same as calling |
This seems to work fine in 3.0.0. Thanks! |
Glad to hear it! |
Nokogiri doesn't update previously-created NodeSets with new siblings. See #90 for background.
The text was updated successfully, but these errors were encountered: