Lambdasoup eats the doctype #32

copy · 2020-04-19T21:53:41Z

I'm doing some rewriting similar to postprocess.ml and found that either Soup.parse or Soup.to_string remove the doctype.

utop [0]: #require "lambdasoup";;
utop [1]: Soup.parse "<!doctype html><html><body><b>Hi</b></body></html>" |> Soup.to_string;;
val _1 : string = "<html><head></head><body><b>Hi</b></body></html>"

The online docs created by postprocess.ml don't have a doctype either.

The text was updated successfully, but these errors were encountered:

aantron · 2020-04-20T02:46:15Z

It is to_string (and pretty_print). Lambda Soup should probably check if the top-level element is <html> (as opposed to something else, indicating a fragment), and prepend a doctype on output.

See #32.

aantron · 2020-04-20T04:51:19Z

Does the attached commit, which is now in master, address the issue for your usage?

copy · 2020-04-20T16:03:17Z

Yes, current master fixes the issue, thanks!

It seems to have two minor problems though:

It doesn't reproduce the original doctype
It doesn't work if the (technically optional) html tag is not present

None of the two affect me, but could cause some surprise for other users

aantron · 2020-04-22T13:23:40Z

This is now in opam in lambdasoup 0.7.1.

It doesn't reproduce the original doctype

Do you mean the casing? Or the presence/absence of the doctype?

It doesn't work if the (technically optional) html tag is not present

Lambda Soup would need to distinguish explicitly between full documents and fragments to handle that intelligently. Also, <html> can be absent in the input stream, but it is not optional in the DOM. The parser also inserts the tag whenever it is sure it is parsing a full document, even if it is absent. Otherwise, Lambda Soup assumes it is handling a fragment.

Of course, a user can manipulate the tree with the intention of having a complete document, and not add/maintain an <html> tag.

In all cases, there is the escape hatch of using Soup.signals and manually adding or removing the doctype.

We can solve any further issues when they come up.

copy · 2020-04-22T16:17:01Z

Do you mean the casing? Or the presence/absence of the doctype?

Also non-html5 doctypes, although those are probably even less common than missing html tags.

We can solve any further issues when they come up.

Agreed. Cheers for the quick fix.

dmbaturin · 2020-05-31T10:27:21Z

Guys, I have no choice but to go passive aggressive now.
There's a list of reverse dependencies in the opam page.
You could easily check how people use the Soup.pretty_print function and see that they are explicitly compensating for the missing doctype.

See [1] and [2].

You could consider that after this change, their code will produce nonsensical pages with a duplicate doctype. Hell, you could at least mention the maintainers of those projects in the issue and ask their opinions.

Instead you chose to silently break compatibility and leave them without even an option to disable this behaviour or specify their own doctype.

@aantron I'm very grateful to you for creating and maintaining lambdasoup. Soupault would never be possible without your work. But for goodness sake, why couldn't this be an optional argument at least?

aantron · 2020-09-20T19:37:14Z

@dmbaturin, @copy, what about an approach where Lambda Soup saves the doctype, if present, in the top-level soup node, and emits it on serialization?

This makes at least some kind of sense to me, as the soup node represents the whole document. @dmbaturin, would you still want to suppress it?

copy · 2020-09-20T20:46:53Z

@aantron That sounds good to me.

dmbaturin · 2020-09-21T00:18:52Z

@aantron There may be valid reasons to supress the original doctype and supply your own. For example, if you are adding HTML5 elements to user-supplied pages, it makes sense to force the doctype to HTML5 because user's original doctype could be XHTML 1.0 for example, and we just purposely broke XHTML compatibility.

Something like a ~keep_doctype:true argument will solve both of these issues. I think it should be true by default.

aantron · 2020-09-21T05:28:54Z

@dmbaturin, I have two other suggestions:

Provide some functions to manipulate the doctype on a soup node.
Provide a completely separate mechanism for parsing doctypes from the front of strings, and emitting them. So a user that wants to extract the doctype from the input would, separately from calling parse, also call read_doctype, and get a doctype value which can then be emitted (and somehow analyzed).

I'm trying to decide which approach is the least "magical" and least awkward. Ideally, Lambda Soup's behavior would remain simple, and the APIs would also remain simple for people that don't want to bother thinking about the doctype at all.

Second attempt at #32. Closes #33.

aantron · 2020-10-09T19:59:37Z

The commit linked above stores the doctype in the soup node, if the doctype was present.

If the doctype was present and one would like to forcibly drop it, it is possible to do so in this version by selecting the top-level elements from the document and serializing those instead of the document, for example with soup $ "html":

    ("doctype" >:: fun _ ->
      assert_equal
        ("<html></html>" |> parse |> to_string)
        "<html><head></head><body></body></html>";

      assert_equal
        ("<!DOCTYPE html><html></html>" |> parse |> to_string)
        "<!DOCTYPE html><html><head></head><body></body></html>";

      assert_equal
        ("<!DOCTYPE html><html></html>" |> parse $ "html" |> to_string)
        "<html><head></head><body></body></html>");

This isn't obvious, but I decided to defer documenting it until someone asks about it. I also decided to defer adding manipulators for the doctype until they are needed by someone.

I believe the above three cases cover all your needs, @copy and @dmbaturin, as I understood them, and are fairly intuitive. Please let me know if that is not the case.

dmbaturin · 2020-10-10T16:58:27Z

@aantron I think it's a sensible approach, thanks! When do you plan to make an opam release?

aantron · 2020-10-12T05:28:00Z

I'll release this "very soon" (today or tomorrow). In fact, your feedback was the last thing remaining to get before doing it :)

dmbaturin · 2020-10-15T13:56:53Z

@aantron The new version will be 0.8.0?

aantron · 2020-10-16T16:11:26Z

Probably 0.7.2. Sorry about the delay, I had to switch computers and haven't had time to set up. Planning for next week.

aantron · 2020-10-20T16:15:08Z

0.7.2 is now available in opam.

aantron added a commit that referenced this issue Apr 20, 2020

Prepend <!DOCTYPE html> to full documents

2758b31

See #32.

aantron mentioned this issue Apr 22, 2020

Lambda Soup 0.7.1: HTML scraping with CSS ocaml/opam-repository#16258

Merged

aantron closed this as completed Apr 22, 2020

dmbaturin mentioned this issue May 31, 2020

Make forced HTML5 doctype optional. #33

Closed

aantron added a commit that referenced this issue Oct 9, 2020

Store doctype in soup nodes

d066d16

Second attempt at #32. Closes #33.

aantron mentioned this issue Oct 19, 2020

Lambda Soup 0.7.2: HTML scraping with CSS ocaml/opam-repository#17439

Merged

dmbaturin mentioned this issue Oct 21, 2020

Doctypes other than HTML5 aren't emitted correctly #37

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lambdasoup eats the doctype #32

Lambdasoup eats the doctype #32

copy commented Apr 19, 2020

aantron commented Apr 20, 2020

aantron commented Apr 20, 2020

copy commented Apr 20, 2020

aantron commented Apr 22, 2020

copy commented Apr 22, 2020

dmbaturin commented May 31, 2020

aantron commented Sep 20, 2020

copy commented Sep 20, 2020

dmbaturin commented Sep 21, 2020

aantron commented Sep 21, 2020

aantron commented Oct 9, 2020

dmbaturin commented Oct 10, 2020

aantron commented Oct 12, 2020

dmbaturin commented Oct 15, 2020

aantron commented Oct 16, 2020

aantron commented Oct 20, 2020

Lambdasoup eats the doctype #32

Lambdasoup eats the doctype #32

Comments

copy commented Apr 19, 2020

aantron commented Apr 20, 2020

aantron commented Apr 20, 2020

copy commented Apr 20, 2020

aantron commented Apr 22, 2020

copy commented Apr 22, 2020

dmbaturin commented May 31, 2020

aantron commented Sep 20, 2020

copy commented Sep 20, 2020

dmbaturin commented Sep 21, 2020

aantron commented Sep 21, 2020

aantron commented Oct 9, 2020

dmbaturin commented Oct 10, 2020

aantron commented Oct 12, 2020

dmbaturin commented Oct 15, 2020

aantron commented Oct 16, 2020

aantron commented Oct 20, 2020