Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Doctypes other than HTML5 aren't emitted correctly #37

Closed
dmbaturin opened this issue Oct 21, 2020 · 14 comments
Closed

Doctypes other than HTML5 aren't emitted correctly #37

dmbaturin opened this issue Oct 21, 2020 · 14 comments

Comments

@dmbaturin
Copy link
Contributor

I'm sorry for not giving #32 proper testing!

Your implementation only works for the trivial HTML5 doctype. This makes it impossible to keep the real original doctype for any legacy pages.

Example:

$ cat templates/main.html 
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html lang="en">
  <head>
...

$ utop
utop # #require "lambdasoup";;
utop # Soup.read_file "templates/main.html" |> Soup.parse |> Soup.pretty_print;;
- : string =
"<!DOCTYPE html><html lang=\"en\">\n <head>\n  
@aantron
Copy link
Owner

aantron commented Oct 21, 2020

This is because Markup's HTML serializer follows the spec here, which allows these doctypes only:

  • <!DOCTYPE html>
  • <!DOCTYPE html SYSTEM "about:legacy-compat">

That's obviously too strict. I think this is best solved by actually stripping off the doctype if it is present, rather than sending it to Markup, and outputting it directly from Lambda Soup. If someone later wants legacy doctype support in HTML in Markup, we can move the support there and make Lambda Soup use that, as an internal change.

@aantron
Copy link
Owner

aantron commented Oct 21, 2020

Also note that XHTML should probably be parsed using the XML parser in Markup, and serialized with the XML serializer. That should have precluded most of these issues, which arise from parsing XML with an HTML5 parser and serializing it with an HTML5 serializer.

I understand that the XML parser might not be as convenient, because you would have to use Soup.from_signals and Soup.signals to plug it into Lambda Soup.

@aantron
Copy link
Owner

aantron commented Oct 21, 2020

Given that, how would you like to proceed?

Are you facing a mixture of documents, some of which are HTML5 and some XML, and need some kind of easy way of Markup/Lambda Soup automatically figuring out which parser and serializer to use? I haven't considered that before, but it might be the right thing to do.

@dmbaturin
Copy link
Contributor Author

It's an interesting question. It's not a problem I'm having personally, so it's mostly a theoretical issue. I suppose maybe we should leave it as is until a person who actually has that problem comes along.

aantron added a commit to aantron/markup.ml that referenced this issue Oct 21, 2020
The HTML5 spec requires outputting <!DOCTYPE html> as the doctype.
However, Markup.ml already has a helper (Markup.html5) for replacing any
doctype with the HTML5 doctype. So, it is better for the writer to
output whatever doctype is present in the signal stream, trusting the
user to call Markup.html5 if it is necessary for their purpose.

Part of aantron/lambdasoup#37.
aantron added a commit that referenced this issue Oct 21, 2020
@aantron
Copy link
Owner

aantron commented Oct 21, 2020

Can you test again with opam pin add markup --dev-repo?

I modified Markup's HTML writer to output whatever doctype is there, instead of trying to output only HTML5 doctypes. This seems reasonable, given that there is Markup.html5 already available for the user to force an HTML5 doctype, so we should trust whatever doctype is present in the signal stream.

I also added a test to Lambda Soup for round-tripping the doctype from this issue. It passes with Markup master and fails with Markup 1.0.0 release.

See the two commits linked above if curious about the code.

If this works for you, I will release it in Markup 1.1.0. Lambda Soup won't need a release.

Note that Markup master now requires Dune 2.7.0, due to its dune files using the syntax for the new Bisect_ppx integration. If that's too high a constraint, I can downgrade it. I was hoping that the next release of Markup wouldn't be so soon, and there would be time for Dune 2.7.0 to become pretty old and established :)

@aantron aantron changed the title Doctypes other than HTML5 aren't parsed and stored correctly Doctypes other than HTML5 aren't emitted correctly Oct 21, 2020
@dmbaturin
Copy link
Contributor Author

I've just tested it and it seems to work fine. Thanks for the fix!

I actually have both markup and lambdasoup dependencies specified explicitly in soupault.opam due to some previous situation, so I don't mind.
I don't mind dune 2.7.0 either.

@aantron
Copy link
Owner

aantron commented Oct 21, 2020

I'll give it a day in case anything else comes to mind, as the doctype is proving to be a rich source of problems for Lambda Soup and Markup.ml :P

@dmbaturin
Copy link
Contributor Author

Well, I'm working on some other improvements for soupault 2.1.0 too, I wasn't planning to release it right today either. ;)

One visual oddity: traditionally people put a newline between the doctype and the <html> tag when writing HTML by hand, but markup.ml doesn't. That's purely cosmetic of course.

@aantron
Copy link
Owner

aantron commented Oct 21, 2020

Can you open an issue on Markup.ml about this and any other such things? I think we can fix this in the pretty printer (I am assuming based on the code at the beginning of this issue, that you are using it).

@aantron
Copy link
Owner

aantron commented Oct 21, 2020

And ping me if/when you would like a release. I'll wait a bit to hopefully include several changes, and then do it.

@dmbaturin
Copy link
Contributor Author

aantron/markup.ml#58

I need to sit with the Markup.ml code to actually understand it so that I can contriubute in the future...

@dmbaturin
Copy link
Contributor Author

@aantron I've completed my work on changes I want in soupault 2.1.0, so I'm ready to release it whenever the new Markup release is available.

@aantron
Copy link
Owner

aantron commented Oct 25, 2020

I plan to do aantron/markup.ml#59 and retag 1.0.0 and release it in the first half of the coming week.

@dmbaturin
Copy link
Contributor Author

Yeah, I've seen that issue. Whenever the re-tagged markup 1.0.0 is available from the OPAM repos, I'll set the dependency to markup { >= 1.0.0 } and tag the release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants