Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support extending CodeMirror highlighting with plugins #12

Open
bollwyvl opened this issue Nov 3, 2020 · 15 comments
Open

Support extending CodeMirror highlighting with plugins #12

bollwyvl opened this issue Nov 3, 2020 · 15 comments

Comments

@bollwyvl
Copy link
Contributor

bollwyvl commented Nov 3, 2020

Markdown/CommonMark are hard to syntax highlight anyway, and a number of the plugins in #10 change the syntax in a way that is not covered by the existing ipythongfm mode. CodeMirror has been a sticky wicket in supporting Lab3 #11 as CodeMirror's approach doesn't

The diagrams modes like the ...

```mermaid

and...

```bob

syntaxes would be covered by new modes, as existing mode already defers fenced blocks, but other things like footnote and deflist would need a fair amount of massaging.

Generally, there should be a mechanism for a plugin to confidently add (and test) new syntax highlighting features. Hopefully this wouldn't mean rewriting the mode, but who knows!

@agoose77
Copy link
Owner

agoose77 commented Nov 3, 2020

Agreed, let's bike shed this ⚡(tomorrow for me!)

@agoose77
Copy link
Owner

I think CodeMirror overlay modes might be the right tool for this job. Each extension that want's to extend the syntax can just add an overlay. If this were the case, we wouldn't need to implement an interface for this; the existing CodeMirror interface should (I believe) suffice

@agoose77
Copy link
Owner

agoose77 commented Jan 21, 2022

I had a brief chat with the EBP people, and spent a little bit of time looking into the feasibility of this. AFAICT, with the current markdown-it + CM5 approach, each plugin will need to write two different tokenizers, one for markdown-it and one for a CM mode.

The Double-Implementation Problem

This doesn't sit hugely well with me - it seems crazy that we do effectively the same work twice. The simplest solution here is to use a Markdown library that does include position information, and fit it into a CM Mode. There would be some challenges here:

  • performance: CM modes need to be fast, and I'm not sure how well this full-reparse would work with single character edits while typing. Unlike markdown rendering, this needs to be immediate*.
  • look-ahead: CM modes can look ahead, but then we'd need to handle the case where we've performed a look-ahead to produce some tokens, and then read that same line in the next token() call. I'm sure this is do-able, but a bit messy.
  • granularity: CM modes want to style syntactical tokens. I'm not convinced that this abstract is well enforced with markdown-it-style parsers - particular markdown-it (despite not producing usable position info), it's very easy to produce "tokens" that are not syntactic. For example, consider (where I've replaced ` with :)
    ::: fence
    some fence
    :::
    
    Markdown-it produces a single AST-like Token. Conversely, Lezer produces
    Document(
      FencedCode(
        CodeMark(":::")
        CodeInfo("python")
        CodeText("some code!")
        CodeMark(":::")
      )
    )
    

I mentioned Lezer - CM6 standardises language information around a concrete syntax tree, which can either be generated by Lezer's LR runtime, or by another parser that produces the same structures.

The summary here is that CodeMirror (5 and 6) really needs an incremental parser, both wrt. performance and API-matching. So, if one wanted to re-use the parser for Markdown rendering and highlighting, then really we want to satisfy that. I think CM6 has a nicer API here: instead of feeding the parser line-by-line and requiring the formatting each time, CM6 wants the entire parse tree, but can later call in to reparse only a subset. I don't know CM5 well enough to be sure, but I suspect that we would have to handle parse-tree invalidation inside the CM mode ourselves.

Relatedly, there is discussion about how to move beyond TextMate grammars for VSCode.

Wider Issues

Before I thought about the re-parse cost, I was going to suggest something radical — having spoken to the EBP team made me think about the fact that we have two separate implementations of markdown-it and the plugin ecosystem; one in Python, and one in JS. We could think-ahead and use a Rust/WASM base for markdown parsing, which could then be used by the Python tools too. I think this is where the space is probably headed, but it's a lot of work.

Additionally, LSP markdown-support is something we've talked about, and being able to share some of the implementation here would also be nice too.

Conclusions

I think there are two separate issues now being discussed in this post:

  • sharing parsing between CM and markdown rendering in the browser
  • sharing parsing + rendering with backend for EB / LSP

I am just not familiar enough yet with the problem space to know what the best long-term solution is. If incremental parsing is viable for rendered markup, then it sounds like the best approach - it will also reduce our repaint times (although the DOM/VDOM is ultimately going to be the bottleneck I suspect). However, we would need a second pass IIRC to handle things like link validation which aren't possible in a single forward-pass.

I've seen a few ideas here:

The only WASM-friendly option is the last one. Toastmark extends commonmark.js to add enough information to be able to build an AST. It seems like they rely on being able to use contextual information to move back to a CST:

Furthermore, the token structure that composes different elements of the Markdown is simple enough that it can be implemented using an abstract syntax tree just by adding a few pieces of information.

I think Toastmark is avoiding the CM mode API by instead using marks. Maybe that would be a good interrim, because the Mode API only handles highlighting and indentation (i.e. not folding)

Additionally, rendering / analysis tools might want more than the CST - an AST would be much easier to render. This would warrant a second pass.

I am considering whether it's better to take a longer-term view of the solution here. Rather than investing time into getting highlighting working with CM5 Modes + markdown-it (and writing everything twice), maybe actually moving to lezer (or at least generating a lezer CST) would be a good thing™ in the long run? By dropping markdown-it we would immediately lose the entire ecosystem, which would not be ideal. However, the core Markdown extensions that make it worth using are not too complex. Maybe a community-wide effort here would be sufficient to keep things ticking over?

I can't see a way that we can have our cake and eat it unless we make some bold decisions regarding the future plans here :/

Useful links / recap:

@chrisjsewell
Copy link

chrisjsewell commented Jan 21, 2022

@agoose77 something I think you mentioned to me, that you don't mention here, is https://github.com/syntax-tree/mdast (based on https://github.com/syntax-tree/unist), which is basically a nice language agnostic (JSONable) and extensible AST format for Markdown (and also includes line/column source mapping).

I'm not sure how this would fit in with the incremental parsing (and lezer etc),
but it feels like a nice, standardised, format to centre around, without tying yourself to one "technology", and its particular AST format.

I'm thinking to write the MyST spec basically as an extension on mdast and then, in principle 😬, you can just use any parser, renderer, LSP that supports it

@bollwyvl
Copy link
Contributor Author

Yeah: the balance between future architectural correctness and getting software into peoples' hands is elusive.

In the near term...

We probably need to continue making the most pragmatic choices such that we can ship software that folk can use, today, with other tools they like. So for now: we have to deal with in CM(5). That first step might be a new ipythongfm, to be a little better, not optimal... in my mind, the most important being non-bog-standard markdown magic tokens e.g. {directive}, etc. and the ability to switch into dedicated modes.

So, the really messy option today that would be to make it possible would be maybe some kinda middleware junk:

export interface IMarkdownModeOpts {
  modes: {[key: string]: any}  // initially, gfm, tex
  multiplexingModes: any[];
  config: CodeMirror.EditorConfiguration;  // the runtime ones
  modeOptions?: any; // the runtime ones
}

export interface IPluginProvider {
  // ...
  syntaxExtension: (options: IMarkdownModeOpts) => IMarkdownModeOpts;
}

...and then we stack everything up when a mode is requested.

But longer term...

I am a pretty big proponent of WASM. Seems like an appropriate thing for a rendering engine, but feels overkill for "just" syntax highlighting. Indeed, we had to deploy some wasm for jupyterlab-simple-syntax because textMate highlighting bundles use a flavor of regex that is... non-trivial. Felt icky. But for full LSP-grade analysis, as has been in noted this thread and elsewhere (e.g. sync scrolling)... yeah, might as well get your syntax highlighting in the same parse.

Moving outside of the text editing/rendering experience: the jupyterlite experiment has been great, showing that a (mostly) familiar interactive computing experience (pyolite on pyodide on emscripten) is workable... but as a new platform, we're still limited in many ways. I'd say stay tuned in 2022 for more composable stuff that people can plug into...

I think WASM is going to be the bottom of a next-level version of reproducible, interactive computing, and jupyter is well suited to be a banner under which it gets into users' hands. I doubt the next generation of users will think so much about what language a particular function is implemented in, and whether code is being run in-loop in the browser per keystroke or being executed in a massively parallel HPC setting. Things like WASM Types, extended to work with Arrow, and wrapped with metadata like real SI units, will make doing real science pretty awesome.

@chrisjsewell
Copy link

#Thanks @bollwyvl, is there any good reading on wasm, it's not something I've looked into much yet.
Does it basically mean you have to implement the parser in C++/Rust?
And @agoose77 did you mention that there is also some possibility of direct integration with Python? (as opposed to having to call it in a subprocess)

@agoose77
Copy link
Owner

agoose77 commented Jan 22, 2022

Yes, my thoughts on this topic are motivated by the wider landscape of who is using jupyterlab-markup, and who needs markdown rendering more generally. I just don't like the fact that if I want to implement extensions to commonmark that support syntax highlighting + executable books, I'd have to write the same parser/lexer three times!

With respect to solving "delivering solutions now", I am currently in favour of not using the CM5 Mode API, and instead relying solely on the Marks API. I think that is workable, and if so it would allow us to get started on using a high-granularity parser today.

The common problem that we all have is generating a document-aware syntax tree. Whether that is a CST or AST is less important. If we could standardise the parsing of Markdown for "commonmark extensions", then LSP + EB + Jupyter would all get that for free. The rendering again could be shared between EB + Jupyter. I don't know how VSCode would fit into this w.r.t rendering - they seem currently reluctant to expose the Markdown renderer itself as an extension point. Maybe it wouldn't be so bad to add another editor, which seems to be what they recommend.

mdast (based on unist) does include position information, and so it should be possible to reconstruct the concrete syntax tree from the AST and the original source. For highlighting purposes, one would probably want to keep the CST around and generate the AST from that. As a general syntax tree, it should be possible to generate mdast from lezer, for example. I think choosing mdast as a specification would not be a bad idea. It would certainly move one step towards unifying the ecosystem.

WASM & Rust (which can compile to WASM) are both accessible from Python. This means it is possible to write the implementation once, and re-use it in Python + JS. Of course, this would mean writing code in the common denominator language, e.g. Rust. One could do this in Python as Python can be compiled to WASM, but right now that involves a lot of work & bloat as @bollwyvl alludes to.

@agoose77
Copy link
Owner

agoose77 commented Jan 22, 2022

There is also the benefit of standardising the AST for existing tools: the ToC extension IIRC parses the Markdown to identify headings in notebooks / Markdown documents. Having a generated AST / being able to request the AST would mean:

  • we only do this parsing once (hopefully)
  • syntax extensions are supported out of the box

Another benefit is prosemirror integration: I imagine that is a lot easier if you're working at the level of an AST.

@bollwyvl
Copy link
Contributor Author

write the same parser/lexer three times!

Well... part of that comes from tools hand-writing parser/lexers in an implementation language in the first place. But markdown is a crazy mess, to parse properly, even before adding extensibility. But if starting over... rather than jumping straight to PARSER IN RUST NOW, at least taking a cursory look at a portable stack like antlr or lark, which focus effort on writing declarative specifications and then generating implementations, might be worthwhile.

Indeed: Jupyter would really benefit from a declarative (preferrably, JSON-compatible) way for e.g. kernels to describe their language grammars (especially dynamic deviances a la jupyter-lsp/jupyterlab-lsp#191). Briefly, on jupyter-lsp: despite its warts, for the larger code editing mission, we can't afford to lose what CM5 already represents to the community. We are excited to get to our hands on CM6. Maybe i'll warm simple-syntax back up, as those TextMate bundles, supported by half the editors out there, would be even better, but again, see magic-regexes-that-need-wasm! Which brings us to...

any good reading on wasm

Here's a high level site, some specs (including the forthcoming types) as well as some nuts-and-bolts blog posts, like asciinema, and some position pieces.

you have to implement the parser in C++/Rust?

WASM is a target for a number of compiled languages, now: c, rust, erlang, go, haskell, etc. There are some higher-level languages, such as the typescript-like assemblyscript. Initially, this grew out of the corpus of tricks in asm.js, and was to enable reasonably performant in-browser execution of otherwise-opaque software: in 2022, it's not much of a stretch to say it's easier to run a lot of things in the browser than natively (and well) on windows. More recently, is proving interesting as a non-browser technology due to its sandboxing: or even more weirdly, firefox will soon be shipping some vendored stuff compiled from C to WASM, and then back into C!

direct integration with Python

In JupyterLite, which only cares about (real) browsers, we're using pyodide to deliver the IPython/ipykernel stack, including ipywidgets. Most packages run unmodified! But the biggest win is that you can deploy certain interactive experience to, theoretically, millions of simultaneous users (willing to maybe download ~100mb of python to their browsers 🤣) with just a free/low cost static web host and a CDN.

work & bloat

Pyodide is basically a CPython distribution, and has a conda-like build chain, to get up a Linux-like system with numpy/pandas with emscripten. Unfortunately, its build chain is just conda-like... there's some work starting soon to see if this can actually be conda(-forge) so that we can start getting automated updates of thousands of packages, instead of one every pyodide release to update/add libraries.

However: the ticket to get in the door for that python integration is ~20mb, per kernel. As such, we have been pushing back against using any python wasm as part of the "web server" that runs in the browser, instead re-implementing key parts of jupyter_server and jupyterlab_server in typescript.

Meanwhile...

On the "server" there are a number of standalone runtimes, such as wasmer and wasmtime, as well as things that are shooting for even greater security such as enarx. Wasmer, in particular, has many language-specific bindings, such as wasmer-python. The win here for jupyter-adjacent projects would be to not be chasing the moving target of python ABI complexity per-platform-per-python-per-wheel, and just be able to ship a single WASM blob that would execute anywhere, including the browser, but enjoy a performance profile closer (by order of mangitude) to C-level code than python code.

@chrisjsewell
Copy link

least taking a cursory look at a portable stack like antlr or lark,

I'm not sure it's even theoretically possible to parse commonMark as context free grammar? (See e.g. https://roopc.net/posts/2014/markdown-cfg/). Let alone with any syntax extensions

@chrisjsewell
Copy link

Here's a high level site, some specs (including the forthcoming types) as well as some nuts-and-bolts blog posts, like asciinema, and some position pieces.

Cheers, will check it out!

@bollwyvl
Copy link
Contributor Author

even theoretically possible to parse commonMark as context free grammar

right, I'll grant that even "old high markdown" is basically the social media engrish of markup languages.

But there are grammars and then there are grammars. for syntax highlighting, especially in a narrative language, it just needs to be good enough and fast enough and be really good at handling broken state.

Indeed, having a lenient grammar with terminals like IDK_MAYBE_BROKEN_LOL that quickly consume ambiguous conditions until the next block boundary is probably not the worst thing in the world... for syntax highlighting.

And there's no helping things like footnote-style markdown refs.

But even something block level would be a fairly big step up for portability, especially for the case of embedding multiple syntax modes inside other syntaxes.

syntax extensions

in lark, at least, one can make extensible grammars... but if that particular feature isn't portable to other implementations it wouldn't be as much fun. and i would not wish runtime antlr generation on anyone!

@agoose77
Copy link
Owner

agoose77 commented Jan 23, 2022

least taking a cursory look at a portable stack like antlr or lark,

I'm not sure it's even theoretically possible to parse commonMark as context free grammar? (See e.g. https://roopc.net/posts/2014/markdown-cfg/). Let alone with any syntax extensions

Right, from the reading that I've done (given that I've not had time to look at it myself yet), writing a formal grammar for Markdown is a very difficult challenge. The author of this link makes a few other comments elsewhere, and essentially their argument is:

  • Any text is valid Markdown, so it is not possible to formalise it in a conventional grammar.
  • Given this, the best thing for a markdown specification is an algorithm implementation e.g. state machine, that encodes the parse rules unambiguously.

There are definitely a number of different concerns/priorities in this thread. As I see them, we have:

  • Performant parsing
  • Non-DRY code for plugin authors
  • Implementation conformance between Jupyter projects
  • Code duplication between ecosystems
  • Standardisation of markup

Maybe some of these concerns do not need to be solved any time soon. But, if we allow ourselves the opportunity to consider them, we have:

  • Performant parsing (for edits) - use an incremental parser. Ultimately, a parser needs to avoid scaling O(doc size) with any small change.
  • Non-DRY code - use the same tokenizer / parser (depending upon how you highlight) for HTML generation and syntax highlighting
  • Implementation conformance - define a canonical implementation, or share the implementation itself! (e.g. targeting WASM)
  • Code duplication between ecosystems - share the implementation, or interop e.g. a headless browser process.
  • Standardisation - see below

I did note that roopc implemented a Markdown specification for a modified Markdown. However, once you start having variations (plugins) on this specification, it would be difficult to resolve how ambiguities should be handled. The easiest and most robust solution that I can see to that is to just have a canonical implementation and decree that that is the right way to parse it. My gut feeling is that the best direction for Jupyter projects as a whole is to:

  • Require that plugins do not break commonmark (test) conformance
  • Share implementations as much as possible

We are already doing this in part with EB + jupyterlab-markup: both use Markdown-it / markdown-it ports, and (assuming conformance) that is more consistency than the range of Markdown renders in use by different platforms (JupyterLab/notebook with Markedjs, colab? kaggle? GitHub renderer?)

If we don't consider an implementation-defined spec, then the next best thing is a big test suite defining implementation.

@bollwyvl
Copy link
Contributor Author

That's a lot of stuff, and sorry for encouraging wandering off down the wasm path.

I look forward to a future where a user-driven set of choices are documented and honored by the tools (a la #13) but feel like "conformance" is a very big word to use in use case, and definitely out of scope of a PR that answers the title/description of this issue.

Basically, after said PR was merged, installing a future 1.x release of this extension would extend the existing JupyterLab 3.x editing experience to highlight some of the new syntax it supports rendering e.g. mermaid (now part of GFM), without breaking the experience provided by other extension authors (e.g. LSP, modes from other languages, collaborative editing with presence). Ideally, this would be managed in a way that downstreams of this plugin could also add additional features... but maybe #40 would demand this anyway.

Even having gross block-level modes, as supported by the existing cm5+ipythongfm would be sufficient for today's notebook markdown cell editing experience and markdown documents of reasonable size, like a project's README, and whole (jupyter) books are again a whole other beast.

@chrisjsewell
Copy link

Just as an additional point of reference, you also now have https://microsoft.github.io/language-server-protocol/specifications/lsp/3.17/specification/#textDocument_semanticTokens

I guess this is similar to overlays, in that it is not intended to provide the full highlighting just enhance it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants