Skip to content
This repository has been archived by the owner on Aug 26, 2023. It is now read-only.

Nokogiri::HTML5 does not support line numbers #53

Closed
gjtorikian opened this issue Mar 4, 2017 · 21 comments
Closed

Nokogiri::HTML5 does not support line numbers #53

gjtorikian opened this issue Mar 4, 2017 · 21 comments

Comments

@gjtorikian
Copy link

When I run:

@html =  Nokogiri::HTML(content)
@html.css('a').each { |node| ... }

each node element has the line method set to the right line number. However, Nokogiri::HTML5 does not seem to support this.

@stevecheckoway
Copy link
Collaborator

stevecheckoway commented Apr 11, 2017

Unfortunately, this is a bit of a limitation of the underlying libxml2 library. When it parses XML, it sets line numbers on four types of XML nodes, but it does not expose an API to set line numbers on constructed nodes (which is how Nokogumbo operates). Similarly, Nokogiri does not support an API to set line numbers.

Nokogumbo can be build using the system libxml2 which may not have the development headers. If this is the case, there's nothing (reasonable) that can be done. If Nokogumbo is built without using the system libxml2, or if the system libxml2 has the headers, then the line numbers can be set more or less directly in the node structure.

Nevertheless, I created pull request #55 to add line numbers when possible.

Due to another flaw in libxml2, line numbers for lines 1 through 65534 should work correctly for element, text, and comment nodes. Line 65535 should work for text nodes. Lines greater than 65535 may report the wrong line number or fail to report any line number. CDATA nodes on any line may report the wrong number. These limitations are present in Nokogiri using HTML() as well as anything else using libxml2 to parse or describe HTML/XML documents.

@fulldecent
Copy link
Contributor

Is there any conceivable way Nokogiri could support line numbers so that we could reliably set them?

If so this is an excuse to send the issue upstream.

@stevecheckoway
Copy link
Collaborator

Nokogiri could provide an API to do precisely what my PR does to set line numbers. Feel free to ask upstream.

@fulldecent
Copy link
Contributor

Reported upstream and made PR by copying your code and pasting it there.

If anyone here knows Ruby better than me (probably all of you) and you care to look, please see sparklemotion/nokogiri#1658

@fulldecent
Copy link
Contributor

Hello! I have created a PR upstream that gives us what we need here.

Problem is there is a build test not passing. And the bigger problem is that I am not a serious Ruby developer, so basically I'm in over my head.

May I please request assistance in reviewing that PR (which is pretty simple) and the connect build test?

@flavorjones
Copy link
Collaborator

I'm happy to work with anybody on the nokogumbo project who can help me understand what the desired behavior is. But the PR referenced is doing bad things with memory (valgrind picks these up as failures in the test suite) and so the PR is not acceptable even if the tests were passing and the JRuby API was similarly enhanced.

@ar7max
Copy link

ar7max commented Jul 22, 2017

What is "constructed node"?

@stevecheckoway
Copy link
Collaborator

@ar7max Sorry, I didn't realize that was a question for me. libxml2 has functions for parsing xml into a tree of nodes. It also has functions to construct nodes programmatically. Nokogumbo does not use libxml2's native xml parsing functions. Instead, it uses Gumbo to parse the HTML and then constructs the tree of nodes.

Unfortunately, libxml2 doesn't expose an API for setting line numbers on nodes and it only sets the line numbers in its native xml parsing functions. My PR performs the same actions that libxml2 itself does. (You can see my reading of the line setting algorithm in this comment.)

I don't know what @flavorjones's comment about "doing bad things with memory" means. This is the entirety of the code for setting the line numbers:

if (line < 65535)
  output_node->line = (unsigned short)line;
else {
  output_node->line = 65535;
  if (output_node->type == XML_TEXT_NODE)
     output_node->psvi = (void *)line;
}

It's entirely possible I made a mistake. The most likely place would be the psvi member, but here's the libxml2 code using that value as the line number.

@flavorjones
Copy link
Collaborator

@stevecheckoway I'm referring to the PR submitted to Nokogiri, which, when run in the test suite, causes Valgrind to log illegal memory access. Details are in the PR, I believe.

@flavorjones
Copy link
Collaborator

flavorjones commented Nov 15, 2017 via email

@ar7max
Copy link

ar7max commented Nov 15, 2017

I guess there is a misunderstanding of what psvi is, cuz as I can understand from code and comments in libxml2 source:

  • it's defined as void *psvi; /* for type/PSVI informations */
  • there is a casting (long) (ptrdiff_t) so it is somehow relates to pointer arythmetics

Mb they r using psvi as line number in some situations, but in this way we can't write to it directly.

@stevecheckoway
Copy link
Collaborator

@flavorjones I see, thanks.

@ar7max I don't think it's doing pointer arithmetic. I think the cast to ptrdiff_t is just wrong. It should be intptr_t, but I doubt that makes any difference.

I wouldn't be at all surprised if line numbers greater than 2^16 - 1 expose a bug in libxml2. There's an option XML_PARSE_BIG_LINES that causes the parser to set the line number in psvi. Starting from line 1911 of SAX2.c, you can see the logic.

    if (ctxt->linenumbers) {
	if (ctxt->input != NULL) {
	    if (ctxt->input->line < 65535)
		ret->line = (short) ctxt->input->line;
	    else {
	        ret->line = 65535;
		if (ctxt->options & XML_PARSE_BIG_LINES)
		    ret->psvi = (void *) (ptrdiff_t) ctxt->input->line;
	    }
	}
    }

One possibility would be to not support line numbers larger than 65535 at all. libxml2's reported line numbers for larger xml documents aren't reliable anyway since they're only set on text nodes and for other nodes, it looks for nearby text nodes.

@ar7max
Copy link

ar7max commented Jan 6, 2018

@stevecheckoway yep, line is unsigned short, so 0 to 65,535

stevecheckoway added a commit to stevecheckoway/nokogumbo that referenced this issue Sep 2, 2018
Neither libxml2 nor Nokogiri contain an API for setting the line numbers
for a node. When the libxml2 headers are available, the line numbers can
be set directly in the node structure.

Closes: rubys#53
@stevecheckoway
Copy link
Collaborator

@gjtorikian This should be fixed now. But see https://github.com/rubys/nokogumbo/blob/master/README.md#flavors-of-nokogumbo for limitations.

@svoop
Copy link
Contributor

svoop commented Nov 7, 2018

@stevecheckoway Sorry, to comment back on this closed issue: I'm still struggling getting the source file line numbers even though I'm on nokogiri-1.8.5 and nokogumbo-2.0.0. The bundled version of libxml2 should be 2.9.8 but to play it safe, I've just submitted a version bump to Homebrew and installed 2.9.8 via Homebrew with the updated local formula. Both gems were installed after making sure Nokogumbo would be forced to use libxml2 with bundle config build.nokogumbo --with-libxml2. Yet, when I ask a node about where he comes from e.g. td.line, the answer is always 0. Any idea what I could check?

@svoop
Copy link
Contributor

svoop commented Nov 11, 2018

@gjtorikian Did you get this to work? I've tried all kind of things to no avail (see my previous comment to @stevecheckoway).

@stevecheckoway
Copy link
Collaborator

@svoop Sorry for the silence. I've been caught up with other things. I'll take a look now.

@stevecheckoway
Copy link
Collaborator

@svoop The fix was simple, just one line. Assuming the CI pass, I'll push version 2.0.1.

Somewhat annoyingly, there's currently no good way to test if the line numbers work because if Nokogumbo doesn't link against libxml2, then line numbers are never set so the tests check if #line returns 0 or the expected line number.

Since this is the only difference between linking against libxml2 and not, the tests can't be more precise. We could expose this information, but I'm not sure what the best way to do that is and I'd rather not just introduce some new API that people might depend on without more consideration.

Users can currently test if line numbers are supported by checking Nokogiri.HTML5('').root.line == 1 but that doesn't work for testing the behavior.

@stevecheckoway
Copy link
Collaborator

@svoop I've uploaded version 2.0.1 to rubygems. Please give it a try.

(I also figured out a better way to test the line number that will hopefully prevent this or a similar bug from recurring.)

@svoop
Copy link
Contributor

svoop commented Nov 12, 2018

@stevecheckoway It works like a charm now, big thanks for fixing and pushing this so quickly!

@stevecheckoway
Copy link
Collaborator

@svoop Great!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants