option to not use html entities #1737

Pomax · 2020-08-03T22:13:52Z

Pomax
Aug 3, 2020

Marked currently turns things like single quotes, double quotes, ampersands, etc. into their html entity equivalent, and this has by and large not been necessary for a long time. Could an option please be added to turn this off, so that code that relies on marked doesn't need a long list of post-processing replacements to turn all those entities back into the real letters/characters/codepoints?

UziTech · 2020-08-04T01:33:31Z

UziTech
Aug 4, 2020
Maintainer

We are trying to follow the common mark spec. We are focusing marked on creating html from markdown according to the spec and using extensions instead of options to change the behavior.

0 replies

Pomax · 2020-08-06T03:08:04Z

Pomax
Aug 6, 2020
Author

that's very disappointing, given how much bloat "extension" installs tend to carry. It's 2020, html entities for things like apostrophes make literally no sense (unlike the HTML4.01 era, where they kind of did, until UTF8 won).

Right now, I'm stuck having to use the following code, which I can guarantee you is less bloat than an extension:

  let converted = marked(data, {
    gfm: true,
    headerIds: false,
    mangle: false,
  })
    // sigh...
    .replace(/&amp;/g, "&")
    .replace(/&#39;/g, "'")
    .replace(/&quot;/g, '"')

This should not be necessary, and should not be necessary out of the box.

As such, I strongly object to the label added to this issue: this should absolutely not be an extension. Instead, marked should not needlessly bloat resultant HTML with outdated HTML entities. The point is to generate modern HTML5, not ancient HTML4.01, because to quote the spec:

Valid HTML entity references and numeric character references can [emphasis mine] be used in place of the corresponding Unicode character, with the following exceptions:

At the spec level, the word "can" does not mean "should" in the slightest. It means any parser may or may not do this and both options are spec-compliant. Forcing that option without user choice is not "following the spec", it's making the decision for your users, and this should absolutely be an option people should be able to set for marked parsing, not an extension that adds what should be a basic parser flag.

0 replies

UziTech · 2020-08-06T04:56:04Z

UziTech
Aug 6, 2020
Maintainer

what should the following code display?

document.body.innerHTML = marked("`&amp;`");

if you think it should be & then & needs to be changed to &amp; by marked otherwise it would display &.

0 replies

UziTech · 2020-08-06T04:57:22Z

UziTech
Aug 6, 2020
Maintainer

removing html entities doesn't make sense if the goal is to output html.

0 replies

Pomax · 2020-08-06T05:36:00Z

Pomax
Aug 6, 2020
Author

I didn't say to remove entities, I said that marked "currently turns things like single quotes, double quotes, ampersands, etc. into their html entity equivalent".

If the original code literally contains & (or ' etc) then, of course: preserve that, as that was clearly the intention.

But if the original code does not already have html entities, such as marked("# This primer's original & only purpose"), then marked should not be rewriting those perfectly fine characters into HTML entities. They already worked and didn't need replacing to become valid HTML text content.

To wit:

Welcome to Node.js v14.7.0.
Type ".help" for more information.
> const marked = require("marked");
undefined
> marked("# This primer's original & only purpose")
'<h1 id="this-primers-original--only-purpose">This primer&#39;s original &amp; only purpose</h1>\n'
>

That really should be <h1 id="this-primers-original--only-purpose">This primer's original & only purpose</h1>.

0 replies

UziTech · 2020-08-06T14:09:41Z

UziTech
Aug 6, 2020
Maintainer

If the original code literally contains & (or ' etc) then, of course: preserve that, as that was clearly the intention.

That is the problem the amount of code that would be required to accurately only replace the ones that are needed would be huge and changing.

Since <h1 id="this-primers-original--only-purpose">This primer's original & only purpose</h1> displays the same as <h1 id="this-primers-original--only-purpose">This primer's original & only purpose</h1> in the browser any way there is no reason not to replace them.

0 replies

UziTech · 2020-08-06T14:17:33Z

UziTech
Aug 6, 2020
Maintainer

Creating an option that would accurately only replace the html entities that are needed would bloat the marked codebase which is why it would be better as an extension so most people who don't need that functionality would not need the extra code.

0 replies

Pomax · 2020-08-06T15:03:33Z

Pomax
Aug 6, 2020
Author

Since [...] displays the same as [...] in the browser any way there is no reason not to replace them.

Except that it needlessly grows the content by quite a lot of bytes. That's not super important for short content, but I have two websites that are 120 and 400 page books, respectively: it makes a huge difference not turning relatively frequently used single letters into five or more letters.

As for the code needing lots of changes, I'm not sure it does? There are only two characters that must be converted to html entities in order not to invalid HTML, namely < and >, but the markdown spec already treats those as active tag syntax and marked doesn't even touch them at all, it leaves them as is. So... really the change would be to bypass the html entity generator if the options said not the perform html entity conversion?

(e.g. it feels like this would a matter of function convertToHTMLEntitiy(blah) { if (options.noHTMLEntitiyConversion) return blah; /* and then the original code */ }, rather than trying to rework the actual parsing and target everywhere that tokens are identified as potential HTML entity targets)

0 replies

UziTech · 2020-08-06T15:47:36Z

UziTech
Aug 6, 2020
Maintainer

marked does not convert &, ', and " to html entities because it would be invalid html otherwise. Marked converts them because the html that is output would not display what the user wants if they weren't converted.

Consider writing markdown that contains any html entity inside a code block:

`&copy;`

marked converts this markdown to the html: <p><code>&copy;</code></p> which will display as © in a browser.

If marked didn't convert & to & the html would be: <p><code>©</code></p> and the browser would display ©

Which is not the correct way to display that markdown.

There are similar examples of how not converting ' and " would create the wrong output in the browser.

This is just one example of many situations where marked would need to use html entities in order to display the correct output in a browser.

The code required to 100% accurately only replace the ones that are needed would be incredibly complex, definitely not as easy as .replace(/&/g, "&").

The decision to move to extensions instead of options was so that marked doesn't become a massive library catering to everyone's individual situations. With extensions there can be an millions of ways to change marked without the library becoming bloated for the majority of users who just need markdown turned into html.

0 replies

UziTech · 2020-08-06T15:54:02Z

UziTech
Aug 6, 2020
Maintainer

If you would like to research the code needed to only convert html entities when needed I would be happy to help you develop an extension for marked.

0 replies

Pomax · 2020-08-06T23:56:26Z

Pomax
Aug 6, 2020
Author

I might, I'm in the process of completely rewriting https://pomax.github.io/bezierinfo atm anyway, which relies on marked for the principal conversion step (with the back corrections for html entities of course). I can see if there's time left on my budget when I finish the core of the rewrite work.

0 replies

burtonator · 2021-05-29T18:45:41Z

burtonator
May 29, 2021

HTML entities are no longer needed in unicode right? This should be an option. We're having to replace them too...

1 reply

UziTech May 30, 2021
Maintainer

They are still needed. See example in #1737 (comment)

abbychau · 2021-09-11T16:27:55Z

abbychau
Sep 11, 2021

I will be useful when markedjs is used as a library. like in vue or react.

0 replies

henri42 · 2023-10-17T19:55:47Z

henri42
Oct 17, 2023

Hello !

It is a bit late, but as an easy workaround, it is possible to use the package html-entities to replace entities to their equivalent HTML5 characters, with the decode function.

0 replies

manwithafox · 2023-11-02T13:34:37Z

manwithafox
Nov 2, 2023

What I consider an architectural/design flaw is that HTML encoding happens during parse i.e. the CST already contains different text nodes. IMHO this should be part of the renderer, potentially with an option to turn off.

I'm writing my own renderer and not interested in HTML at all - just using the CST and have to decode to original - but only if the original (raw property) was not encoded. Just using the raw property is also not straight forward and requires other treatment (e.g. whitespace).

Please allow to disable this encoding. Thanks a lot!

1 reply

UziTech Nov 2, 2023
Maintainer

If you would like to create a PR to change HTML encoding to the renderer I would be ok with that 😁👍

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

option to not use html entities #1737

{{title}}

Replies: 15 comments 2 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

option to not use html entities #1737

Pomax Aug 3, 2020

Replies: 15 comments · 2 replies

UziTech Aug 4, 2020 Maintainer

Pomax Aug 6, 2020 Author

UziTech Aug 6, 2020 Maintainer

UziTech Aug 6, 2020 Maintainer

Pomax Aug 6, 2020 Author

UziTech Aug 6, 2020 Maintainer

UziTech Aug 6, 2020 Maintainer

Pomax Aug 6, 2020 Author

UziTech Aug 6, 2020 Maintainer

UziTech Aug 6, 2020 Maintainer

Pomax Aug 6, 2020 Author

burtonator May 29, 2021

UziTech May 30, 2021 Maintainer

abbychau Sep 11, 2021

henri42 Oct 17, 2023

manwithafox Nov 2, 2023

UziTech Nov 2, 2023 Maintainer

Pomax
Aug 3, 2020

Replies: 15 comments 2 replies

UziTech
Aug 4, 2020
Maintainer

Pomax
Aug 6, 2020
Author

UziTech
Aug 6, 2020
Maintainer

UziTech
Aug 6, 2020
Maintainer

Pomax
Aug 6, 2020
Author

UziTech
Aug 6, 2020
Maintainer

UziTech
Aug 6, 2020
Maintainer

Pomax
Aug 6, 2020
Author

UziTech
Aug 6, 2020
Maintainer

UziTech
Aug 6, 2020
Maintainer

Pomax
Aug 6, 2020
Author

burtonator
May 29, 2021

UziTech May 30, 2021
Maintainer

abbychau
Sep 11, 2021

henri42
Oct 17, 2023

manwithafox
Nov 2, 2023

UziTech Nov 2, 2023
Maintainer