-
Notifications
You must be signed in to change notification settings - Fork 114
Nokogiri::HTML5
does not support line numbers
#53
Comments
Unfortunately, this is a bit of a limitation of the underlying libxml2 library. When it parses XML, it sets line numbers on four types of XML nodes, but it does not expose an API to set line numbers on constructed nodes (which is how Nokogumbo operates). Similarly, Nokogiri does not support an API to set line numbers. Nokogumbo can be build using the system libxml2 which may not have the development headers. If this is the case, there's nothing (reasonable) that can be done. If Nokogumbo is built without using the system libxml2, or if the system libxml2 has the headers, then the line numbers can be set more or less directly in the node structure. Nevertheless, I created pull request #55 to add line numbers when possible. Due to another flaw in libxml2, line numbers for lines 1 through 65534 should work correctly for element, text, and comment nodes. Line 65535 should work for text nodes. Lines greater than 65535 may report the wrong line number or fail to report any line number. CDATA nodes on any line may report the wrong number. These limitations are present in Nokogiri using |
Is there any conceivable way Nokogiri could support line numbers so that we could reliably set them? If so this is an excuse to send the issue upstream. |
Nokogiri could provide an API to do precisely what my PR does to set line numbers. Feel free to ask upstream. |
Reported upstream and made PR by copying your code and pasting it there. If anyone here knows Ruby better than me (probably all of you) and you care to look, please see sparklemotion/nokogiri#1658 |
Hello! I have created a PR upstream that gives us what we need here. Problem is there is a build test not passing. And the bigger problem is that I am not a serious Ruby developer, so basically I'm in over my head. May I please request assistance in reviewing that PR (which is pretty simple) and the connect build test? |
I'm happy to work with anybody on the nokogumbo project who can help me understand what the desired behavior is. But the PR referenced is doing bad things with memory (valgrind picks these up as failures in the test suite) and so the PR is not acceptable even if the tests were passing and the JRuby API was similarly enhanced. |
What is "constructed node"? |
@ar7max Sorry, I didn't realize that was a question for me. libxml2 has functions for parsing xml into a tree of nodes. It also has functions to construct nodes programmatically. Nokogumbo does not use libxml2's native xml parsing functions. Instead, it uses Gumbo to parse the HTML and then constructs the tree of nodes. Unfortunately, libxml2 doesn't expose an API for setting line numbers on nodes and it only sets the line numbers in its native xml parsing functions. My PR performs the same actions that libxml2 itself does. (You can see my reading of the line setting algorithm in this comment.) I don't know what @flavorjones's comment about "doing bad things with memory" means. This is the entirety of the code for setting the line numbers: if (line < 65535)
output_node->line = (unsigned short)line;
else {
output_node->line = 65535;
if (output_node->type == XML_TEXT_NODE)
output_node->psvi = (void *)line;
} It's entirely possible I made a mistake. The most likely place would be the |
@stevecheckoway I'm referring to the PR submitted to Nokogiri, which, when run in the test suite, causes Valgrind to log illegal memory access. Details are in the PR, I believe. |
To be more specific, I'm referring to @fulldecent's upstream PR to Nokogiri, sparklemotion/nokogiri#1658.
|
I guess there is a misunderstanding of what psvi is, cuz as I can understand from code and comments in libxml2 source:
Mb they r using psvi as line number in some situations, but in this way we can't write to it directly. |
@flavorjones I see, thanks. @ar7max I don't think it's doing pointer arithmetic. I think the cast to I wouldn't be at all surprised if line numbers greater than 2^16 - 1 expose a bug in libxml2. There's an option if (ctxt->linenumbers) {
if (ctxt->input != NULL) {
if (ctxt->input->line < 65535)
ret->line = (short) ctxt->input->line;
else {
ret->line = 65535;
if (ctxt->options & XML_PARSE_BIG_LINES)
ret->psvi = (void *) (ptrdiff_t) ctxt->input->line;
}
}
} One possibility would be to not support line numbers larger than 65535 at all. libxml2's reported line numbers for larger xml documents aren't reliable anyway since they're only set on text nodes and for other nodes, it looks for nearby text nodes. |
@stevecheckoway yep, line is unsigned short, so 0 to 65,535 |
Neither libxml2 nor Nokogiri contain an API for setting the line numbers for a node. When the libxml2 headers are available, the line numbers can be set directly in the node structure. Closes: rubys#53
@gjtorikian This should be fixed now. But see https://github.com/rubys/nokogumbo/blob/master/README.md#flavors-of-nokogumbo for limitations. |
@stevecheckoway Sorry, to comment back on this closed issue: I'm still struggling getting the source file line numbers even though I'm on nokogiri-1.8.5 and nokogumbo-2.0.0. The bundled version of libxml2 should be 2.9.8 but to play it safe, I've just submitted a version bump to Homebrew and installed 2.9.8 via Homebrew with the updated local formula. Both gems were installed after making sure Nokogumbo would be forced to use libxml2 with |
@gjtorikian Did you get this to work? I've tried all kind of things to no avail (see my previous comment to @stevecheckoway). |
@svoop Sorry for the silence. I've been caught up with other things. I'll take a look now. |
@svoop The fix was simple, just one line. Assuming the CI pass, I'll push version 2.0.1. Somewhat annoyingly, there's currently no good way to test if the line numbers work because if Nokogumbo doesn't link against libxml2, then line numbers are never set so the tests check if Since this is the only difference between linking against libxml2 and not, the tests can't be more precise. We could expose this information, but I'm not sure what the best way to do that is and I'd rather not just introduce some new API that people might depend on without more consideration. Users can currently test if line numbers are supported by checking |
@svoop I've uploaded version 2.0.1 to rubygems. Please give it a try. (I also figured out a better way to test the line number that will hopefully prevent this or a similar bug from recurring.) |
@stevecheckoway It works like a charm now, big thanks for fixing and pushing this so quickly! |
@svoop Great! |
When I run:
each
node
element has theline
method set to the right line number. However,Nokogiri::HTML5
does not seem to support this.The text was updated successfully, but these errors were encountered: