Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No tracking of closing header tags #1987

Closed
glaforge opened this issue Jul 28, 2023 · 1 comment
Closed

No tracking of closing header tags #1987

glaforge opened this issue Jul 28, 2023 · 1 comment
Assignees
Labels
bug Confirmed bug that we should fix fixed
Milestone

Comments

@glaforge
Copy link

glaforge commented Jul 28, 2023

I have a use case where I want to cut a long HTML document into sections, following the header tags h1, h2, etc...
I'm trying to use position tracking to find the beginning and the end of the headers, but it seems only the opening tag is tracked, but the closing tag is not tracked... except for non-official header tags (ie. h1 to h6 are official HTML header tags, but my document also has h7 and h8 which are not part of the HTML specification.

Let's take a concrete example, let's say you have the following snippet:

<h1>title</h1>
<h2 id="abc">abc</h2>
<p>hello</p>
<h5 id="bcd">bcd</h5>
<p>thanks</p>
<h3 id="cde">cde</h3>
<p>hello</p>
<h7 id="def">def</h7>
<p>thanks</p>
<h3 id="efg">efg</h3>
<p>hello</p>
<h8 id="fgh">fgh</h8>
<p>hello</p>

I'm then setting the flag to track the position with, and selecting the headers:

var doc = Parser.htmlParser().setTrackPosition(true).parseInput(htmlDoc, uri)
var headers = doc.select("h1, h2, h3, h4, h5, h6, h7, h8, h9");

When I print the element's .sourceRange().start() / end() and .endSourceRange().start() / end(), I get the following output:

Start: 1,1:0 End: 1,5:4 <-> Start: -1,-1:-1 End: -1,-1:-1 — h1 — title
Start: 2,1:15 End: 2,14:28 <-> Start: -1,-1:-1 End: -1,-1:-1 — h2 — abc
Start: 4,1:50 End: 4,14:63 <-> Start: -1,-1:-1 End: -1,-1:-1 — h5 — bcd thanks
Start: 6,1:86 End: 6,14:99 <-> Start: -1,-1:-1 End: -1,-1:-1 — h3 — cde
Start: 8,1:121 End: 8,14:134 <-> Start: 8,17:137 End: 8,22:142 — h7 — def
Start: 10,1:157 End: 10,14:170 <-> Start: -1,-1:-1 End: -1,-1:-1 — h3 — efg
Start: 12,1:192 End: 12,14:205 <-> Start: 12,17:208 End: 12,22:213 — h8 — fgh

The opening header tags have correct start/end positions for the opening tag.
But all the closing header tags (except the non-standard ones like h7 and h8) have -1 values, as if it wasn't tracked, or that there was no closing tag at all.

Shouldn't the endSourceRange() return non -1 positions?

@jhy jhy closed this as completed in 0eb8232 Sep 8, 2023
@jhy jhy changed the title No tracking of closing tags No tracking of closing header tags Sep 8, 2023
@jhy jhy self-assigned this Sep 8, 2023
@jhy jhy added bug Confirmed bug that we should fix fixed labels Sep 8, 2023
@jhy jhy added this to the 1.16.2 milestone Sep 8, 2023
@jhy
Copy link
Owner

jhy commented Sep 8, 2023

Thanks, good catch! This is fixed now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Confirmed bug that we should fix fixed
Projects
None yet
Development

No branches or pull requests

2 participants