Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

github publish: some markup conversion failures #850

Closed
sknebel opened this issue Nov 20, 2018 · 9 comments
Closed

github publish: some markup conversion failures #850

sknebel opened this issue Nov 20, 2018 · 9 comments

Comments

@sknebel
Copy link
Contributor

sknebel commented Nov 20, 2018

I just used a fairly complex GH issue (that I had written on GH) to test POSSE-ing to a test repo:
https://www.svenknebel.de/posts/2018/11/8/ to sknebel/random-test-repo#1

The HTML in my post is a cleaned up version of Githubs HTML (with mentions of users and other issues removed to cut down the noise)

The following things in the created issue were unexpected:

  • the nested lists didn't convert correctly
  • < > were added around bare links
  • in the "output format" section, a space was added before the italicized "not")
  • the JSON code example was cut, but I had forgotten to escape a < - the browser still displayed the code following, probably because <------ clearly wasn't valid HTML tag, but it is understandable bridgy (or maybe even mf2py?) failed there (I have since edited the post to use a &lt;)

EDIT: feel free to move this issue to granary or ask me to split it up or ... - happy to help you as much as I can.

@snarfed
Copy link
Owner

snarfed commented Nov 21, 2018

hey, thanks for the report! yeah, converting HTML to markdown will often be imperfect, and mostly at the mercy of html2text, but i can take a look!

@sknebel
Copy link
Contributor Author

sknebel commented Nov 21, 2018

Just tested: lxml and mf2py handle the <----------- correctly (and escape the < in the html output of the e-content), which makes it surprising it doesn't make it through.

snarfed added a commit to snarfed/granary that referenced this issue Nov 24, 2018
@snarfed
Copy link
Owner

snarfed commented Nov 24, 2018

fixed the < and > around linked URLs.

@snarfed
Copy link
Owner

snarfed commented Nov 24, 2018

the nested lists and space before italicized not are afaict bugs in html2text. i may narrow them down and file issues; we'll see.

@sknebel
Copy link
Contributor Author

sknebel commented Nov 24, 2018

It seems the nested list is something where the original markdown implementation and those based on it accept html2text's output (the markdown documentation doesn't appear to describe nested lists at all), but CommonMark, on which GitHub's markdown support is based, specified it explicitly in a way that requires a deeper indentation. Its specification has a section on this history: https://spec.commonmark.org/0.28/#motivation

@snarfed
Copy link
Owner

snarfed commented Nov 29, 2020

Looks like the space before __not__ is this html2text bug: Alir3z4/html2text#324

@snarfed
Copy link
Owner

snarfed commented Nov 29, 2020

...and the two spaces indent for lists is hard coded here: https://github.com/Alir3z4/html2text/blob/296e6f24d16a36bf88b8042d56ebd69ec37aef9c/html2text/__init__.py#L602

@snarfed
Copy link
Owner

snarfed commented Dec 6, 2020

I've filed Alir3z4/html2text#344 for the list bug, and a PR that fixes it in Alir3z4/html2text#345.

@snarfed
Copy link
Owner

snarfed commented Dec 7, 2020

fixed! the first three at least, if not the <------ one. thanks for your patience!

@snarfed snarfed closed this as completed Dec 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants