Doctypes other than HTML5 aren't emitted correctly #37

dmbaturin · 2020-10-21T11:44:13Z

I'm sorry for not giving #32 proper testing!

Your implementation only works for the trivial HTML5 doctype. This makes it impossible to keep the real original doctype for any legacy pages.

Example:

$ cat templates/main.html 
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html lang="en">
  <head>
...

$ utop
utop # #require "lambdasoup";;
utop # Soup.read_file "templates/main.html" |> Soup.parse |> Soup.pretty_print;;
- : string =
"<!DOCTYPE html><html lang=\"en\">\n <head>\n

The text was updated successfully, but these errors were encountered:

aantron · 2020-10-21T12:47:42Z

This is because Markup's HTML serializer follows the spec here, which allows these doctypes only:

<!DOCTYPE html>
<!DOCTYPE html SYSTEM "about:legacy-compat">

That's obviously too strict. I think this is best solved by actually stripping off the doctype if it is present, rather than sending it to Markup, and outputting it directly from Lambda Soup. If someone later wants legacy doctype support in HTML in Markup, we can move the support there and make Lambda Soup use that, as an internal change.

aantron · 2020-10-21T12:50:36Z

Also note that XHTML should probably be parsed using the XML parser in Markup, and serialized with the XML serializer. That should have precluded most of these issues, which arise from parsing XML with an HTML5 parser and serializing it with an HTML5 serializer.

I understand that the XML parser might not be as convenient, because you would have to use Soup.from_signals and Soup.signals to plug it into Lambda Soup.

aantron · 2020-10-21T12:55:15Z

Given that, how would you like to proceed?

Are you facing a mixture of documents, some of which are HTML5 and some XML, and need some kind of easy way of Markup/Lambda Soup automatically figuring out which parser and serializer to use? I haven't considered that before, but it might be the right thing to do.

dmbaturin · 2020-10-21T13:09:41Z

It's an interesting question. It's not a problem I'm having personally, so it's mostly a theoretical issue. I suppose maybe we should leave it as is until a person who actually has that problem comes along.

The HTML5 spec requires outputting <!DOCTYPE html> as the doctype. However, Markup.ml already has a helper (Markup.html5) for replacing any doctype with the HTML5 doctype. So, it is better for the writer to output whatever doctype is present in the signal stream, trusting the user to call Markup.html5 if it is necessary for their purpose. Part of aantron/lambdasoup#37.

Resolves #37.

aantron · 2020-10-21T19:35:14Z

Can you test again with opam pin add markup --dev-repo?

I modified Markup's HTML writer to output whatever doctype is there, instead of trying to output only HTML5 doctypes. This seems reasonable, given that there is Markup.html5 already available for the user to force an HTML5 doctype, so we should trust whatever doctype is present in the signal stream.

I also added a test to Lambda Soup for round-tripping the doctype from this issue. It passes with Markup master and fails with Markup 1.0.0 release.

See the two commits linked above if curious about the code.

If this works for you, I will release it in Markup 1.1.0. Lambda Soup won't need a release.

Note that Markup master now requires Dune 2.7.0, due to its dune files using the syntax for the new Bisect_ppx integration. If that's too high a constraint, I can downgrade it. I was hoping that the next release of Markup wouldn't be so soon, and there would be time for Dune 2.7.0 to become pretty old and established :)

dmbaturin · 2020-10-21T19:43:55Z

I've just tested it and it seems to work fine. Thanks for the fix!

I actually have both markup and lambdasoup dependencies specified explicitly in soupault.opam due to some previous situation, so I don't mind.
I don't mind dune 2.7.0 either.

aantron · 2020-10-21T19:54:53Z

I'll give it a day in case anything else comes to mind, as the doctype is proving to be a rich source of problems for Lambda Soup and Markup.ml :P

dmbaturin · 2020-10-21T19:57:27Z

Well, I'm working on some other improvements for soupault 2.1.0 too, I wasn't planning to release it right today either. ;)

One visual oddity: traditionally people put a newline between the doctype and the <html> tag when writing HTML by hand, but markup.ml doesn't. That's purely cosmetic of course.

aantron · 2020-10-21T20:14:58Z

Can you open an issue on Markup.ml about this and any other such things? I think we can fix this in the pretty printer (I am assuming based on the code at the beginning of this issue, that you are using it).

aantron · 2020-10-21T20:15:48Z

And ping me if/when you would like a release. I'll wait a bit to hopefully include several changes, and then do it.

dmbaturin · 2020-10-21T20:21:52Z

aantron/markup.ml#58

I need to sit with the Markup.ml code to actually understand it so that I can contriubute in the future...

dmbaturin · 2020-10-25T06:17:03Z

@aantron I've completed my work on changes I want in soupault 2.1.0, so I'm ready to release it whenever the new Markup release is available.

aantron · 2020-10-25T07:15:23Z

I plan to do aantron/markup.ml#59 and retag 1.0.0 and release it in the first half of the coming week.

dmbaturin · 2020-10-25T07:18:13Z

Yeah, I've seen that issue. Whenever the re-tagged markup 1.0.0 is available from the OPAM repos, I'll set the dependency to markup { >= 1.0.0 } and tag the release.

aantron added a commit that referenced this issue Oct 21, 2020

Add test for round-tripping non-HTML5 doctypes

952b89e

Resolves #37.

aantron changed the title ~~Doctypes other than HTML5 aren't parsed and stored correctly~~ Doctypes other than HTML5 aren't emitted correctly Oct 21, 2020

dmbaturin closed this as completed Oct 25, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Doctypes other than HTML5 aren't emitted correctly #37

Doctypes other than HTML5 aren't emitted correctly #37

dmbaturin commented Oct 21, 2020

aantron commented Oct 21, 2020

aantron commented Oct 21, 2020 •

edited

Loading

aantron commented Oct 21, 2020

dmbaturin commented Oct 21, 2020

aantron commented Oct 21, 2020

dmbaturin commented Oct 21, 2020

aantron commented Oct 21, 2020

dmbaturin commented Oct 21, 2020

aantron commented Oct 21, 2020

aantron commented Oct 21, 2020

dmbaturin commented Oct 21, 2020

dmbaturin commented Oct 25, 2020

aantron commented Oct 25, 2020

dmbaturin commented Oct 25, 2020

Doctypes other than HTML5 aren't emitted correctly #37

Doctypes other than HTML5 aren't emitted correctly #37

Comments

dmbaturin commented Oct 21, 2020

aantron commented Oct 21, 2020

aantron commented Oct 21, 2020 • edited Loading

aantron commented Oct 21, 2020

dmbaturin commented Oct 21, 2020

aantron commented Oct 21, 2020

dmbaturin commented Oct 21, 2020

aantron commented Oct 21, 2020

dmbaturin commented Oct 21, 2020

aantron commented Oct 21, 2020

aantron commented Oct 21, 2020

dmbaturin commented Oct 21, 2020

dmbaturin commented Oct 25, 2020

aantron commented Oct 25, 2020

dmbaturin commented Oct 25, 2020

aantron commented Oct 21, 2020 •

edited

Loading