Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a parse tree structure (a CST) in addition to the AST in oder to preserve all relevant details of the original markdown #223

Open
nojb opened this issue Jul 6, 2020 · 17 comments
Assignees

Comments

@nojb
Copy link
Contributor

nojb commented Jul 6, 2020

Should the AST preserve as much as possible of the original markdown source, making it a bit more inconvenient to use, or should it strive for convenience and try to simplify as much as possible of the original document?

Features that are currently simplified in the AST (there may be a couple more, will amend if necessary):

  • links are inlined, so references (eg [foo][bar]) are not kept in the AST.
  • the kind of delimiter used in emphasis/strong emphasis (_ or *) is not kept in the AST.
  • the number of backquotes used to delimit a code block is not kept in the AST.

What do you think?

@nojb nojb changed the title Shoudl AST preserve as much as possible of the original markdown? Should AST preserve as much as possible of the original markdown? Jul 6, 2020
@shonfeder
Copy link
Collaborator

What about providing two different representations (as suggested in #216)? At the point that we are encoding the location of link references and noting which character is used for emphasis, or how many characters are used to indicate code blocks, I really think it's reasonable to start seeing the representation as capturing the CST. So we could provide a Cst module that reflects this kind of stuff, and an Ast module (derived by mapping out of the former) that provides a much sleeker interface.

@shonfeder
Copy link
Collaborator

shonfeder commented Jan 22, 2021

While study the spec.txt I just encountered this specification, which i think helps put a point on this problematic:

A [link reference definition] does not correspond to a structural element of a document.

The question here, I think, is whether the AST is meant to represent the structure of a document or the structure of markdown syntax. From the perspective of document modeling, things like link labels are just accidental complexity introduced for the sake of a particular concrete syntax. But from the perspective of a faithful representation of the flavour of markdown we're supporting, the difference between a reference link and an inline link may be significant.

As a user, I had initially been approaching Omd as something like (a limited) pandoc: it reads markdown into an AST representing a document structure. However, iiuc the direction the package has subsequently evolved is more towards a high-fidelity representation of concrete markdown syntax. I suspect that different users will want these different interfaces, but I don't have any empirical evidence to support that yet (except, perhaps, the success of pandoc. But of course that project has a much wider scope than Omd).

@jfrolich
Copy link
Contributor

I think capturing this in the AST would be good because then it's possible to build a markdown printer on the same AST (or converting markdown to something similar). The added complexity wouldn't be too large (I think?).

@shonfeder
Copy link
Collaborator

shonfeder commented Jan 22, 2021

@jfrolich that seems like a good motivating example!

Just to clarify: I'm sure there are use cases where we'd want to preserve this kind of detail. The question to my mind is whether most use cases call for such complications. What I've advocated for so far is that we support both these specialized use cases and the more general ones by having two different representations, a CST and an AST (this approach is used in Dune, for instance). If you want to do more low-level handling of markdown (e.g., requires knowing whether a * or _ was used for emphasis) you'd use the CST. But in the more common case of just wanting to operating on the semantics of the document, we'd want a streamlined AST that gets rid of such implementation details.

re: a markdown printer, what do we imagine is the source from which its printing? Let's consider two cases:

  1. Suppose we want a markdown formatter/pretty-printer, so from markdown -> markdown.

Let's say one requirement is that we change the source document as little as possible while removing irregularities. In this case, we'd probably want to preserve information like the link reference label. However, iiuc, in that case we'd actually need much more detail than we currently have, because we'd also need to disambiguate these two cases:

[foo][bar]

[bar]: /url
[bar]: /url

[foo][bar]

So we'd nee to record the relative location of all link reference definitions. Would we require consumers of the tree look through the entire document to find the url of their links? Or would we put the URL in both the nodes representing the link and the ones representing the link reference, and require users who didn't want to preserve the link reference defs to filter them out? Either case causes twice as many nodes in the tree to represent links as we currently have. IMO, twice as much complexity is a lot more :)

On the other hand, let's say our requirements were different, and that we don't need to preserve the source structure as much as possible. Instead, we just rewrite everything into a minimal, regular markdown form. Let's say we did this by inlining all links. Then both of these examples would become:

[foo](/url)

and our syntax tree can be correspondingly simple.

So it seems that even when we're printing markdown -> markdown, whether or not we need such details depends on our specific requirements and aims.

  1. Let's imagine instead we are printing html -> markdown. Well, in this case, afaik, link reference definitions are irrelevant. In HTML all links are inline. Of course, we could decide on a convention that instead favored representing all links via the reference definition syntax. But I'm not sure why we'd do this.

To conclude, I'll just reiterate my position that it seems to me that only some very specific use cases with particular requirements call for preserving this level of detail. And the cost is not trivial, as each complication could effectively double the tree size for representing the element in question. So, if we are going to further complicate the tree presented to consumers of this library, I think we should be very clear about who benefits and how exactly it serves their needs.

IMO, if we do further complicate our tree, we should only do so as one optional representation.

@shonfeder
Copy link
Collaborator

shonfeder commented Feb 20, 2021

Perhaps what we want here is something like this;

  1. Parse input into an intricate parse tree, preserving lots of minute details like line numbers (Add locations to the AST #21), the character used for emphasis (Capture type of Emphasis/Strong in AST #214), and link reference anchors (in the proper order and with their location). Let's call this type parse_tree. This will be the primary internal representation.
  2. Extend the API with a function to_parse_tree : string -> parse_tree. This will serve those who want low-level access to parse information, but don't want to implement their own parsers.
  3. Refocus and refine the doc type to provide a Pandoc-like pattern-matching interface to the semantic content of the document (for which details like line numbers and anchor locations are completely irrelevant). This will serve those who want a high-level representation of the content encoded in a markdown document.

What do people think of this approach?

@shonfeder
Copy link
Collaborator

shonfeder commented Mar 13, 2021

During the maintainer hand-off meeting last week we discussed versioning the API (particularly the AST), to give us the freedom of evolving our representation while maintaining backwards compatibility. Would the proposed use of a parse tree representation help simplify this?

The idea: instead of needing to write mappings back from the latest AST to previous versions, any new versions of the AST would just be a new maps out from the parse tree. When the parse tree itself is updated, of course, we'd also need to make adjustments to the various versioned ASTs, but I think that should should be very rare.

(User's of the parse tree would need to take on the risk of less backwards compatibility guarantees in exchange for the added flexibility of using a lower-level representation.)

@patricoferris
Copy link
Contributor

Hi all,

I thought I'd add some thoughts here as I've been working on a Omd.doc -> string function based off of the AST that is currently available (that very much WIP branch can be found here https://github.com/patricoferris/omd/tree/omd-print) -- the basic idea I was following was to take the common mark test suite, parse it, print it back to file, re-parse it and compare the HTMLs and it is highlighting some corner cases (I haven't made my way through the many tests yet) that a parse tree structure might very much make easier.

The first one was nested emphasis for example:

Omd.of_string "_*hello*_"
- : Omd.doc = [Omd.Paragraph ([], Omd.Emph ([], Omd.Emph ([], Omd.Text ([], "hello"))))]

A naive printer (like my first one) might just choose * or _ for empasis but then you would get:

Omd.of_string "**hello**"
- : Omd.doc = [Omd.Paragraph ([], Omd.Strong ([], Omd.Text ([], "hello")))]

This isn't too hard to catch, but for me it's a +1 in the corner for having some parse tree like structure which captures most (all) information about the markdown. Another interesting one is the following markdown snippet:

hello
 world
======

Resulting in

[Omd.Heading ([], 1,
  Omd.Concat ([],
   [Omd.Text ([], "hello"); Omd.Soft_break []; Omd.Text ([], "world")]))]

In this case we have a heading with a newline (I didn't know this was possible!) and again a naive printer might choose # for a heading of size 1, but this results in something that is later parsed as:

[Omd.Heading ([], 1, Omd.Text ([], "hello"));
 Omd.Paragraph ([], Omd.Text ([], "world"))]

Again this can be handled with more logic in the printer, but perhaps the parse tree would greatly reduce the complexity of the printer?

Just some thoughts as I've been implementing this code, thanks for all the hard work that goes into maintaining omd :))

@shonfeder
Copy link
Collaborator

Thanks very much or sharing your thoughts here, @patricoferris.

These examples and your rationale seem compelling to me. I think having a parse tree might also help us simplify our parsing routines considerably, likely making it easier to squash the remaining bugs, and easier to support various markdown extensions in the future. IMO, the slight reduction in pure speed we might expect from having to go through an intermediate tree would be well worth the gains in maintainability and extensibility.

Currently open issues a parse tree might help with include:

I'll plan to take a crack at this in a few weeks, as I hope to have some of backlog cleared by then.

@shonfeder shonfeder changed the title Should AST preserve as much as possible of the original markdown? Add a parse tree structure (a CST) in addition to the AST in oder to preserve all relevant details of the original markdown May 7, 2022
@shonfeder
Copy link
Collaborator

It seems to me like conversation around this has converged on the desirability of having an intermediate parse tree, so I've renamed the issue accordingly.

@shonfeder
Copy link
Collaborator

shonfeder commented May 7, 2022

Important reminder here about the context of previous design decisions: when @sonologico reworked the AST in #234, part of the proposed design was

Make the types generic over attributes as well. This could give us a way to support locations and extra details for users that care about it in a lightweight manner.

I think this should provide a way to capture most of the metadata we want from the CST approach, while keeping the main overall API provided by the current AST.

Some preliminary details on this approach provided here: #228 (comment)

@ejgallego
Copy link

ejgallego commented Aug 7, 2022

Hi folks, lack of locations in the parse tree is making my life complex when trying to use omd to support literate programming styles (cc: #21 )

I am wondering tho if the current AST type could story attributes in a per-node uniform basis, as a step to implement this. Concretely, instead of

type 'attr block =
    | Paragraph of 'attr * 'attr I.t
    | List of 'attr * list_type * list_spacing * 'attr block list list
    | Blockquote of 'attr * 'attr block list
    | Thematic_break of 'attr
    | Heading of 'attr * int * 'attr I.t
    | Code_block of 'attr * string * string
    | Html_block of 'attr * string
    | Definition_list of 'attr * 'attr def_elt list

you could do [as we do in other projects:

type 'attr block_r =
    | Paragraph of 'attr I.t
    | List of list_type * list_spacing * 'attr block list list
    | Blockquote of 'attr block list
    | Thematic_break
    | Heading of int * 'attr I.t
    | Code_block of string * string
    | Html_block of string
    | Definition_list of attr def_elt list
and 'attr block =
  { v : 'attr block_r
  ; attr : attr
  }

What do you think? That would make the parsing modifications for Loc.t a bit simpler IMVHO.

@sonologico
Copy link
Collaborator

sonologico commented Aug 19, 2022

The uniformity of access is nice. The compiler parsetree sets a good precedent for the approach. I wouldn't mind switching the representation to that.

A few helper functions to hide away the non-uniform access could also work I imagine. I don't have a preference tbh.

@shonfeder
Copy link
Collaborator

Thanks for helping us explore the design space here, @ejgallego. iiuc, the proposed change to the AST would largely be a reversion back to the representation we left behind due the reasons discussed in #228.

My inclination is to prioritize providing a type representation that allows pattern matching to provide a clean UI for end users to transform and consume the AST. My initial feeling is that the proposed change makes it slightly easier for implementors at the expense of a worse UI for end-users. But I'd definitely be open to some explanations/sketches showing why I'm wrong here.

If, however, I am right that this would make for a worse UI for most end-users of the library, then I'd be in favor of @sonologico's solution that would provide some accessor functions.

@shonfeder
Copy link
Collaborator

@tatchi has asked to take a crack on this, and we'll begin entering into preparatory and design work to plan out the implementation. I'm assigning them to the issue to record the fact! 🎉

@ejgallego
Copy link

Thanks for helping us explore the design space here, @ejgallego. iiuc, the proposed change to the AST would largely be a reversion back to the representation we left behind due the reasons discussed in #228.

@shonfeder , if you mean the attribute sharing node, I don't think that is a revert to #228 , pattern matching is kept for users, but indeed they gotta pattern match e.v . This is well tested in some OCaml projects.

Building terms is a bit more annoying, but you get rid of some other stuff, compare:

[ Heading ([], 1, [Text "A document"]);
    List ([], Bullet '-', Tight,
          [Paragraph ([], [Text "Item 1"]);
           Paragraph ([], [Text "Item 2"]);
           Paragraph ([], [Text "Item 3"])]);
    Paragraph ([], [Link ([],
                          {label = Text "and a link";
                           destination = "./somewhere";
                           title = None})])]

vs
List.map _n
[ Heading (1, [ _n (Text "A document")])
; List (Bullet '-', Tight,
          [ _n (Paragraph [ _n (Text "Item 1")])
          ; _n (Paragraph [ _n (Text "Item 2")])
          ; _n (Paragraph [ _n (Text "Item 3")])
          ])
; Paragraph ([ _n (Link label = Text "and a link";
                                    destination = "./somewhere";
                                    title = None})])]

etc... where let _n v = { v; attr = [] }

So actually, I find the pattern matching and building actually clearer in that second version.

@shonfeder
Copy link
Collaborator

shonfeder commented Sep 12, 2022

@ejgallego

This is well tested in some OCaml projects

Do the projects you have in mind expose a data-type representing an AST that can be manipulated via pattern patching as the primary UI for end-users? Could you share which projects you have in mind?

For the high-level data structure presented to end users for manipulation, we have been targeting an interface that resembles Pandoc's or Omd 1's. This seems pretty well suited for representing a simple document model for end users.

I don't see any obvious advantage in the second example you've given. It looks (slightly) more complicated, more verbose, and (obviously) more expensive to compute. Perhaps I am missing something? In any case, w/r/t to ease of constructing terms construction for end users is probably best delegated to helper functions (as per #252) IMO.

I'm mostly concerned about the representation of the AST as a way of inspecting and matching against, and I think you agree that the uniform attributes would make this less convenient? The representation you're suggesting may be more useful for the CST?

@ejgallego
Copy link

ejgallego commented Sep 12, 2022

Do the projects you have in mind expose a data-type representing an AST that can be manipulated via pattern patching as the primary UI for end-users? Could you share which projects you have in mind?

I'd like to parse Markdown and still have info such as location etc...

I don't think there is a lot here other than a matter of style, I find using a node for the attributes a bit more uniform, and you can use CAst.map etc... to preserve them to some extent, but of course you can also defined ast_map.

I guess it is purely a matter of style.

I don't see how having uniform attributes make thing less convenient, in the end the types:

type 'a ast = { v: 'a ; attrs : attrs }
type md_r = | A of .. | B of ...
and md = md_r ast

and

type md = | A of attrs ... | B of ...

are isomorphic (and both allow pattern matching etc...) the first being a bit more "modular".

This was referenced May 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants