-
Notifications
You must be signed in to change notification settings - Fork 865
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Abandon or Modify ElementTree? #420
Comments
Why not? The main feature of Sphinx is not directives, but beautiful themeable output (HTML, LaTeX, etc). I think that if we start using Docutils document tree, it will be possible to add Markdown support to Sphinx. |
By the way, here is an attempt by someone else to write a Markdown parser for Docutils. |
Yes, I was aware of that. However, it does not offer the kind of extension API we do and it is based on a PEG Parser 1. With our extension API, support for Docutils could really be something. Regarding Sphinx, if it turns out to work, great, but that is not at all a goal of mine at this time. Footnotes
|
I guess that is the real question: does the extra Docutil possibilities sway you enough to use it? It sounds like you will have to do some customizing whether you use Docutils or ET, but you would have more control over say ET. But I suspect most users would say go Docutils. If you tell them you have an option that gives more potential features and one that does not, I suspect most will pick the former 😄. |
I did some work mapping the Markdown Elements (as they are output in HTML) to Docutils Node Elements. I have made no effort to map anything used by extensions. This is solely the base syntax. This is what I have come up with before doing any testing:
A few things to note: Headers ( Hard lines breaks ( A Raw HTML should go in a I also need to confirm that Images are interesting as Docutils allows an image to either be an inline element or a block level element, which is supported by the HTML spec. However, in Markdown, images only ever appear as inline elements. Shouldn't be a problem. Just need to be careful. Everything else should be pretty straightforward. All Docutils elements can be imported from |
Given the situation with headers ( I should note that because each header generates a nested section, Docutils does not allow header levels to be skipped. In fact, it will crash hard on inconsistently nested levels. Markdown should never crash hard on bad input. Another issue is that there is no hard rule for which characters in the ReST syntax represent which level. It is simply assumed that they appear in the order they are found (the inconsistency comes when you step back up, then down again -- it is assumed that you use the same pattern going back down). Therefore, the first header is always a level 1 header ( Given the above, I don't think I'll be pursuing the use of Docutils at this time. However, I think reviewing it had been beneficial. It is made it more clear how I want to modify/subclass ElementTree. A benefit of using ElementTree is that the changes can be made more incrementally, running tests as I go, which feels much less intimidating. |
Yeah, I was going to comment that when I saw the wacky header tag stuff that I was thinking it was looking less desirable. But seeing that it can actually break things makes it quite a bit worse considering Python Markdown's goals. |
Over the past week I've been slowly putting together an altered ElementTree lib which has ended up re-implementing almost everything (only ElementTree's XPath is used which we don't really use anyway). It is hardly close to done, but it occurred to me that I was more-or-less re-implementing Beautiful Soup's document object. The one weird thing about Beautiful Soup is that you can't create a document unless you parse something. So to create an empty document which elements can be added to, you need to parse an empty string. I assume parsing as empty string is not too much of a performance hit (I should probably confirm that), so its not too big of a deal, just weird. Once you get past that, the API is very extensive and easy to work with. It is specifically designed for working with HTML and even gives more control that anything I would have custom built myself. Text is represented as child nodes alongside child element nodes. Every node in the document tree knows about its parents, siblings, children, etc. Methods are provided on each node to insert, insert_before, insert_after, append, and the list go on... I should note that I would be using version 4 of Beautiful Soup (which breaks compatibility with the more popular version 3, but is necessary to get Python 3 support). Unlike earlier versions, Beautiful Soup 4 is not an HTML Parser. Is simply wraps existing third party parsers (lxml, html5lib, and Python's HTMLParser) and provides an easy-to-use API for accessing and manipulating the parsed document. Python-Markdown's use case is manipulating/building an HTML document, so the goals align fairly well. As the home page states: "Beautiful Soup is licensed under the MIT license, so you can also download the tarball, drop the bs4/ directory into almost any Python application (or into your library path) and start using it immediately." That was not case with earlier versions. Although, there is the issue that the 2to3 tool needs to be run for Python 3, so just copying it into the Markdown lib doesn't make much sense. But it can be listed as a dependency and get installed automatically by the setup script as long as an internet connection is available. An added plus is that Beautiful Soup comes with its own serializer which is built specifically for HTML (with pretty-printing build in). Although we would loose the ability to distinguish between HTML and XHTML, the only difference in Markdown's syntax is Any thoughts? |
Can we change the BeautifulSoup API (i.e. submit a pull request) to make it easier to create an empty document? So that we can simplify our code, at least in the future? |
I looked into that (although I have not submitted or requested anything upstream). As I understand it, the assumption is that any BeautifulSoup document is always assumed to have been created by a parser and that assumption is interwoven throughout the code. For example, the serializer checks which parser was used for various branches in its behavior (eg: output In fact, if you create a fragment and try to serialize it, it will crash hard. It needs to be contained in a "document" object, which is a special Node which holds reference to the parser, among other things. When serializing a child, it looks up the tree to the document root for various data to determine how it should behave. I tried creating a subclass of that document root class which statically sets all the moving parts on the document root, but I am still getting weird errors I can't seem to figure out. |
Apparently, I didn't make a copy of my attempt to override the default BeautifulSoup document root class (with one that skips the parsing step and sets defaults). However, I did do this. As I explain in the comments:
Perhaps it is just a crazy idea, but it might actually work. If so, the first step in parsing a document would be to pass it to the HTML parser. All of the non-HTML parts would simply be text nodes. Then, loop through those text nodes and convert them to the appropriate block level nodes, then inline nodes, etc. If we went that way, the no-need-for-a-HTML-parser would no longer exist. What do you think? A bad idea? or brilliant? |
If anyone is interest, my (failed) attempt to create a BeautifulSoup document root subclass is here. |
This looks like a giant hack, doesn't it? |
Yeah, your right. Moving on... |
Well, it's up to you to make the decision. I didn't even look at the code, just read your summary, so my opinion shouldn't really matter… |
I am not sure I understood the rationale for dismissing Docutils so quickly. It seems to me skipping header levels is pretty easy, provided that you put in the right level of The advantages you get are immense — you get a lot of interesting writers (which you also don't have to maintain, so you have that going for you) from both the Docutils and Sphinx ecosystems, the very intriguing |
@lehmannro the issue is that the HTML output is not valid according to the Markdown rules. According to the Markdown rules, In fact, if you look at the test frameworks for the various Markdown implementations (including the reference implementation), they all contain a bunch of Markdown files with matching HTML files. The tests are run by passing the Markdown file through the parser and comparing the output to the HTML file. Even one character of difference results in a test failure. The Markdown syntax is expected to produce very specific HTML. Any significant variation from that specific HTML is an error. Markdown is very much tied to HTML. See here for why this matters. That is very different from the approach taken by Docutils. While parts of Docutils closely mirror HTML, that is coincidental. AFAICT, from the get-go, Docutils was designed for representing a document structure regardless of the output format. Therefore, even the ReST to HTML conversion is not always the most obvious. It simply does not give us the option to output the HTML that Markdown users expect/require. While I agree that a Markdown to Docutils tool would be very useful, it does not serve the Markdown community at large very well. As we are the leading (most downloaded from PyPI at least) implementation of Markdown in Python, unfortunately I don't think we can adopt the use of Docutils for the reasons explained above. That said, a less-mainstream Markdown implementation which supports Docutils and is upfront about the fact that it does not output the expected HTML certainly has a place in the world. As mentioned previously, such an implementation already exists and nothing it stopping anyone from creating others. |
I'm not sure the document you linked and your statement are consistent. Look at any of the examples under What are some examples of interesting divergences between implementations? (eg. ATX headers with escapes) — they ALL have different output. From the description of the document, its purpose is to _“promote discussion of how _and whether* certain vague aspects of the markdown spec should be clarified.”* I don't think it's trying to publicly shame parsers which do not adhere to the standard (which, and please correct me if I'm wrong, is very loose as illustrated by the document.) While I cannot stop you from implementing your favorite Markdown parser in any way you want, a statement such as “If we use Docutils, we can't guarantee that that is what we will get.” is simply FUD. There are strict (and, I would claim, stricter than in the Markdown spec) guarantees as to what Docutils does and doesn't produce. The first I don't have any experience with the “Markdown test frameworks” but if they test something that's not in the spec they are plain and simple wrong. (That being said, if stripping extraneous |
There is the problem. If I start a Markdown document with Regarding my linking to the Babelmark2 FAQ, the point is that implementations should not differ in their output. Yes, unfortunately many do. However, we should not be creating more differences which would only make matters worse. Adopting Docutils would do just that. |
Sorry, I missed this point the first time, so I'll address it now. AFAICT, Transforms run on the Docutils document object, which does not yet contain any To make this work with Markdown, the way I would want to do it would be to not use Docutils' As I can't use I'm open to alternate solutions here. But personally, I'm not seeing them. |
Did you have a look at https://gist.github.com/lehmannro/2d2127b7c839282a673d which I linked earlier? It produces a |
In my opinion, the work (and headaches) to create and maintain code which implements that hack far outweigh the benefits. Give me an example that does not use sections (Markdown has no concept of sections), and perhaps I'll reconsider. In order words, Docutils is not an HTML document object, which is what I need. My personal opinion is that given the very close mapping between Markdown and HTML, the best way to get from Markdown to Docutils is to do Markdown => HTML => Docutils. An HTML to Docutils tool can exist separately from Markdown and serve a much wider audience but also provide a decent way to get from Markdown to Docutils. In fact, whatever HTML document object library Markdown uses could also have a [document object] to Docutils tool which would eliminate the need to first serialize the Markdown document and then parse the HTML into another document object. Think ElementTree2Docutils or BeautifulSoup2Docutils. Personally, I'm surprised those tools don't exist already. They could work great for converting Markdown to all of Docutils supported output formats and would serve a much broader audience as well. In fact, BeautifulSoup2Docutils would be immensely useful. You could use it to parse HTML using your choice of any of the decent Python HTML Parsers (as per BeautifulSoup's API) and could output to any of Docutils supported output formats. At that point, any markup language's lack of explicit support for Docutils would only be an optimization issue (skipping the HTML serialization and subsequent parsing would obviously be an optimization -- but offers no additional advantages that I can see). |
To be clear, this is a valid Markdown document:
I could keep going, but the point should be obvious. Keeping track of nested section levels would be a real headache when building a Docutils document. Unless someone can point me to a way to not use Docutils sections, consider the subject closed for discussion. |
For completeness, I just stumbled on this project: AdvancedHTMLParser. It looks interesting, but its history is limited and I know nothing of its stability. The interesting part is the AdvancedTag object, which is both a node and self printing (using innerHTML and outerHTML). The lib more-or-less mirrors the JavaScript DOM, which may or may not be a good idea. |
Just a quick update. My work on an HTML Node toolkit has stabilized. I'd do a release, except that I haven't actually used it for any real work yet. In any event, its ready to use. However, I'm not sure I want to use it in Markdown. Now I'm thinking a simpler node structure would be preferred. Perhaps only the nodes represented in the Markdown text. For example, Markdown only has list items, but no parent ul or ol it actually represented in the document (they are only implied). So perhaps the node tree should reflect that. Each list item node could retain which type it is (alphanumeric (and value), dash, asterisk, plus...), but have no parent list node. Then, when rendering (or perhaps in some intermediary transform) the specifics could be worked out (parent node, list type, item value, etc). That should give much more control regarding the various ways that people prefer to have lists rendered and doesn't actually require any modification of the parser, only the rendering (or transform) step would need to be modified. |
What about using a json-like structure? Would it be faster to process? also probably it would make it easier to create writers for new output formats. |
@andya9 I'm thinking of something very similar to that. It would be more performant to use native Python objects, but yes, something very JSON-like. Perhaps a string representation of the document tree would even be in JSON. |
I’m glad to hear that! 😃 |
Also thanks for the link to Pandoc's documentation. I have looked but never found the definition of their internal document structure before. Apparently it was more recently broken out into a separate package and they have a complete definition of the structure. Could be helpful. |
I’m really glad I could be useful! |
There’s also remarkjs allowing json output |
remarkjs is very cool. See mdast for a definition of the Markdown Abstract Syntax Tree which it uses. |
Cool indeed! Thank’you for the link, I’ll take a good look |
Here is a simple JSON based AST in Python I threw togeather. |
Beautiful! |
I’m not skilled enough to deal with the preprocessing part, but I’d be glad to help with transforming the ast into output (dict object > html element + extensions) |
Just wanted to say that I support the decision to leave out Docutils. |
We have decided to defer implementing this until a future release. Therefore I am removing this from the 3.0 milestone. The reason is that out most important asset is the rich collection of extensions (both first and third party). Making a change of this magnitude would require every third party extension to undergo a complete refactor. At this time the costs outweigh the gains. However, we will re-evaluate again in the future. |
When considering a different AST, it would be useful to revisit #215. I think it's very useful for an application to be able to get access to the AST, but the fact that currently essential post-processing happens after serialization prevents that. Is it necessary for the post-processing to happen on the serialized output or could it, in theory, be done on the AST instead? I guess that with ElementTree as the AST, post-processing on the AST would mean you'd have to parse inline HTML and insert that into the tree as well. This would mean that HTML errors would have to be handled (escalated or repaired) instead of forwarded to the output. In my opinion that's a positive, because catching errors early is generally a good thing, but it might not be the Markdown way. But if you'd go for a different AST implementation, you could have inline HTML as raw HTML nodes inside the AST and still post-process those textually, while post-processing all generated HTML in the AST instead. |
The short version:
As part of version 3.0 (see #391) should Python-Markdown perhaps abandon ElementTree for a different document object like Docutils' node tree or use a modified ElementTree for internally representing the Parsed HTML document?
Any and all feedback is welcome.
The long version:
Starting in Python-Markdown version 2.0, internally parsed documents have been represented as ElementTree objects. While this mostly works, there are a few irritations. ElementTree (hereinafter ET) was designed for XML, not HTML and therefore a few of its design choices are less than ideal when working with HTML.
For example, by design, XML does not generally have text and child nodes interspersed like HTML does. While ET provides
text
andtail
attributes on each element, it is not as easy to work with as it would be if the text was contained in child "TextNodes" (much like JavaScript's DOM). Additionally, ET nodes have no knowledge of their parent(s), which can be a problem in certain HTML specific situations (some elements cannot contain other elements as children or grandchildren or great-grandchildren...).I see two possible workarounds to this: Modify ET or use a different type of object.
Modifying ElementTree
We already have a modified serializer which gives us better HTML output (it is actually a modified HTML serializer from ET) and we already import ET and document that all extensions should import ET from Markdown. Therefore, if we were to change anything (via subclasses, etc) those changes would propagate throughout all extensions without too much change.
In fact, some time ago, I played around with the idea of making ET nodes aware of their parents. While it worked, I quickly abandoned it as I realized that it would not work for cElementTree. However, on further consideration, we don't really need cElementTree (most of the benefits are in a faster XML parser which we don't use).
Interestingly, in Python 3.3 cElementTree is deprecated. What actually happens is that ET defines the Python implementation and then at the bottom of the module, it tries to import the C implementation, which upon success, overrides the Python objects of the same name. What is interesting about this is that the Python implementation of the
Element
class (ET's node object) is preserved as_Element_Py
for external code which needs access to it (as explained in the comments).I envision a modified ET lib to basically subclass the Python
Element
object to enforce knowledge of parents for all nodes. Then aTextNode
would be created which works essentially like Comments work now:The serializer would then be updated to properly output
TextElement
s. In fact, at some point, the serializer might even be able to loose knowledge of thetext
andtail
attributes on regular nodes. However, that last bit could wait for all extensions to adopt the new stuff.In addition to
TextElement
we could also haveRawTextElement
andAtomicTextElement
. Both would be ignored by the parser (no additional parsing would take place). However, aRawTextElement
would be given special treatment by the serializer in that no escaping would take place (raw HTML could be stored inline in the document rather than in a separate store with placeholders in the document), whereas anAtomicTextElement
would be serialized like a regularTextElement
.The advantage of an
AtomicTextElement
(over the existing AtomicString) is that a single node could have multiple child text nodes. Today, each node only gets onetext
attribute. Therefore, when a AtomicString is concatenated with an existingtext
string, we lose the 'atomic' quality of the sub-string. However, with this change each sub-string can reside in its own separate text node and maintain the 'atomic' quality when necessary.Using Docutils
Rather that creating our own one-off hacked version of ET, we could instead use an already existing library which gives us all of the same features (and more). Today, the only widely supported and stable library I'm aware of is Docutils' Document Tree. While the Document Tree is described as an XML representation of a document, Docutils provides a Python API to work with the Document Tree which is very similar to the modified ET API I described above (known parents, TextElement, FixedTextElement...). Unfortunately that API is not documented. Although, the the source code is easy enough to follow.
Until recently, I was of the assumption that to implement something that used Docutils, one would need to define a bunch of directives (etc) which more-or-less modify the ReST parser. However, take a look at the Overview of the Docutils Architecture. A parser simply needs to create a node tree. In fact, the base Parser class is only a few simple methods. The entire directives thing is in a separate directory under the ReST Parser only. Theoretically, one could subclass the base Parser class, and build a node tree using whatever parsing method desired and Docutils wouldn't care.
For that matter, Python-Markdown would not have to replicate Docutils "Parser" API. We could just use the node tree internally. As a plus, this would give us access to all of the built-in and third party Docutils writers (serializers). In other words, we would get all of Docutils output formats for free.
Additionally, Docutils' node tree also provides for various meta-data to be stored within the node tree. For example, each node can contain the line and column at which its contents were found in the original source document. This provides an easy way for a spellchecker to run against the parser and report misspelled words in the document without first converting it to HTML, among other uses which do not require serialized output.
No, this would not make Python-Markdown suddenly able to be supported by Sphinx. Sphinx is mostly a collection of custom directives built on top of the ReST parser. ReST directives do not make sense in Markdown. However, we could convert Markdown to ReST as many other third party parsers convert various formats to ReST via a ReST writer. There is also at least one third party writer which outputs Markdown from a node tree. By adopting Docutils node tree, Python-Markdown could become part of an ecosystem for converting between all sorts of various document formats (an expandable competitor to Pandoc?).
The downsides to using Docutils are that we are then relying on a third party library (up till now, Python-Markdown has not) and all extensions would absolutely be forced to change to support the new version. It is also possible that we wouldn't be able to use the available HTML writer as the default because of some inherent differences with Markdown and ReST (ReST is much more verbose and we might need to hack the node tree or the writer to get the writer to output correct HTML from a Markdown perspective -- I have not investigated this).
As it stands now, there are various small changes required of extensions between version 2 and 3, but I expect that most extensions would be able to support both without much effort. If we went with Docutils, that would no longer be the case.
Or, maybe this whole thing is a bad idea and we should just continue to use ET as-is.
Any and all feedback is welcome.
The text was updated successfully, but these errors were encountered: