Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discuss: 3rd party language packages spec #2328

Closed
joshgoebel opened this issue Dec 19, 2019 · 76 comments · Fixed by #2312
Closed

Discuss: 3rd party language packages spec #2328

joshgoebel opened this issue Dec 19, 2019 · 76 comments · Fixed by #2312
Labels
big picture Policy or high level discussion language

Comments

@joshgoebel
Copy link
Member

joshgoebel commented Dec 19, 2019

Starting a discussion for an official 3rd party grammar format/spec (directory structure, how modules are exported, etc). I've been working on adding support to the build process for seamlessly building 3rd party grammars. The idea being that you just check them out:

mkdir extra
cd extra
git clone git@github.com:highlightjs/highlightjs-robots-txt.git
git clone git@github.com:highlightjs/highlightjs-tsql.git

And then building "just works":

node ./tools/build.js -t browser :common tsql robots-txt

And high-level grammar testing just works (detect tests, markup tests, etc) in context of the whole test suite. Just run tests like normal. The tests for extra languages are bundled with the extra languages, but magically "just work".

In looking at the existing 3rd party stuff it seems we have two major types of layouts. I think perhaps we could support both.

The slightly flatter npm package:
Screen Shot 2019-12-19 at 9 03 10 AM

A more "bundled"/traditional layout:
Screen Shot 2019-12-19 at 9 03 21 AM

One alternative to supporting both is to always prefer/require the bundled layout. I prefer this layout slightly as it makes it easier to add multiple languages to a single "package/repo". For example you might bundle Python, Python REPL... bundled layout gives you an easy way to do that (keeping the same directly structure we use internally). Also if we ever again decided to pull a 3rd party into the core library it'd be trivial to do with the bundled layout as it'd just essentially be copying files from one repo to another.

As to how NPM would work with bundled layout I'd suggest an index.js that simply required the individual modules and exported a hash of them by name:

# index.js
const python = require("./src/languages/python")
const pythonREPL = require("./src/languages/python_repl")
module.exports = { python, pythonREPL }

Thoughts?

@joshgoebel joshgoebel added language big picture Policy or high level discussion labels Dec 19, 2019
@joshgoebel joshgoebel changed the title 3rd party language official format Discuss: 3rd party language official format Dec 19, 2019
@joshgoebel
Copy link
Member Author

joshgoebel commented Dec 19, 2019

Also an anti-pattern (in my opinion) I'm seeing is:

module.exports = function(hljs) {
  hljs.registerLanguage('robots-txt', hljsDefineRobotsTxt);
};

module.exports.definer = hljsDefineRobotsTxt;

Resulting in custom registration code that hides the language name/key:

var hljsDefineRobotsTxt = require('highlightjs-robots-txt');
hljsDefineRobotsTxt(hljs);

I'm not sure what the big advantage here is and I'd prefer to instead see:

module.exports = function(hljs) {
  return {
    case_insensitive: true,
    contains: [
    ...

or (same thing really):

function tsql(hljs) { ... }

module.exports = tsl

And README examples like:

var tsqlGrammar = require('highlightjs-robots-tsql');
hljs.registerLanguage('tsql',tsqlGrammar);

IE, what is exported is the grammar function itself, nothing else. Prefer to let people do the registration themselves. This will allow a simpler build process and also I like that it keeps the naming explicit rather than implicit.

@joshgoebel joshgoebel changed the title Discuss: 3rd party language official format Discuss: 3rd party language grammar official package format Dec 19, 2019
@joshgoebel joshgoebel mentioned this issue Dec 19, 2019
@joshgoebel joshgoebel changed the title Discuss: 3rd party language grammar official package format Discuss: 3rd party language grammar official packaging format Dec 19, 2019
@joshgoebel joshgoebel added this to the 10.0 milestone Dec 19, 2019
@joshgoebel
Copy link
Member Author

joshgoebel commented Dec 19, 2019

From another thread about testing 3rd party languages in isolation.

Feel free to prototype some sort of prospective shared .travis.yml in https://github.com/highlightjs/highlightjs-shexc.

That might wait until later (or require someone else to step up). My focus right now is to make things "just work" in the context of a working highlight.js checkout. (Since you need that to build the actual library anyways, etc). And of course that's immediately most useful to me, as a maintainer of the library - everything in one place.

So checkout highlight.js and then vendored language projects, run full highlight.js test suite. No 3rd step.


The other side of things (as you suggest) would be to have your language stand-alone... then you'd pull in highlight.js as a dependency via npm/yarn... and you might have some sort of scaffold to run your custom tests plus "standardized" detect/markup test... of course currently none of the testing stuff ships with our npm library at all since the library itself is just a build product of the actual source.

Open to discussion this element of things here, but it might be outside the scope of what I'm currently working on.

@joshgoebel
Copy link
Member Author

joshgoebel commented Dec 19, 2019

Another question is how to distribute "web"/CDN versions of languages. Currently I was imagining the source files would all be CommonJS (module.exports)... so they'd work with Node out of the box, but require building for distributable/web usage. I'd be game for ES6 modules, but Node.js isn't 100% there yet, so CommonJS is probably simpler for now.

I think some people may want an easy way to just get the CDN ready version... of course our build process could handle this and then drop CDN ready "binaries" in the dist directory of 3rd party languages (that they could then check-in/distribute)... Short of that I think we're looking at something like UMD? But that seems pretty ugly and complex for this... no?

Ok, actually Solidity does it with just a custom function name (which of course pollutes the global namespace also), but I think this is a bit messy:

// in the external solidity.js file
function hljsDefineSolidity(hljs) {
...

// and then to register after you loaded the JS
hljs.registerLanguage('solidity', window.hljsDefineSolidity);

This was referenced Dec 19, 2019
@joshgoebel
Copy link
Member Author

joshgoebel commented Dec 25, 2019

What I've done with the ShEx module is a little different in that it exposes one grammar by default but allows parametric entrance into that grammar.

Though now that I'm thinking about it all I've seen is that you have two different main modes... inside and outside... can't you just:

  • One language is completely normal
  • The other uses requireLanguage and rawDefinition to change its "context"

But they'd both register normally.

hljs.registerLanguage('shexc', shexc);
hljs.registerLanguage('shexc-texpr', shexc_texpr);

All the custom config logic is hidden inside shexc_inside. At this point you have two normal API language modules and the build system would churn out CDN modules easily.

Am I missing something?


If this is such a common need that most users of your library would need it, it seems (given current limitations of HLJS) that the easily way to do it from a user standpoint would be just to register two slightly different languages - rather than have them copy and pasting code from the README that is different than every other language module?

@joshgoebel
Copy link
Member Author

joshgoebel commented Dec 25, 2019

IE if I'm understanding properly:

shexc-texpr.js

module.exports = function (hljs) {
  return hljs.requireLanguage("shexc").rawDefinition({startingProduction: "tripleExpression"});
}

Or some such. Now it's just a standard language module.

Obviously I wouldn't suggest this if you had 5 or 15 or 50 variant modes, but for just two it doesn't seem a terrible compromise to me. This is also VERY similar to the question we're currently discussing regarding PHP and whether it's a snipper or a full PHP template:

#2330


Although at this point I'm starting to wonder if perhaps the "child mode" should perhaps be it's own language anyways and use "sublanguage" (perhaps overkill) or else the parent should depend on the child (requiring it) - instead of the either/or configuration way you're doing it. Many different ways to solve this problem.

@ericprud
Copy link
Contributor

Obviously I wouldn't suggest this if you had 5 or 15 or 50 variant modes, but for just two it doesn't seem a terrible compromise to me. This is also VERY similar to the question we're currently discussing regarding PHP and whether it's a snipper or a full PHP template:

I've only tested two entry points but there are ~8 corresponding to the assignments to productions. I don't know how many will emerge from the SPARQL module but it is a more complicated language. Given that, it seems worth it to establish the convention internally. It probably won't be worth it for highlightjs unless some other zealots follow the same pattern.

It seems

@joshgoebel
Copy link
Member Author

If you have more than 2-3 I'd almost suggest building your own auto-detect on top of HLJS. What you really want is what I've mentioned elsewhere, one language that maps to multiple variants - then auto-detect could figure out which variant makes the most sense:

php => [php-template OR merely php-snippet]
shexc => [top level context or snippet context]

Long-term this might be internal to a grammar. But for now you'd just build a wrapper very similar to highlightAuto that would maintain a table of these mappings so that if it saw "sparql" for example it would know which variants that mapped to and it would try them all for the "Best fit".

You almost have to deal with it yourself somehow I think. As soon as you get past 2-3 asking user of your grammar to label every snippet with say 8-15 different variants is really a no go. ;-) Or at least not every user friendly. :-)

@joshgoebel
Copy link
Member Author

joshgoebel commented Dec 26, 2019

Actually this MIGHT be possible with the callback support that there is already a PR for... the missing piece of the puzzle is how to "hide" a bunch of language variants... registering them all isn't great, though I suppose if you namespaced them and turned off auto-detect (at the sub-module level) that wouldn't be so bar:

registerLanguage("sparq_context-1", sparq(...))
registerLanguage("sparq_context-2", sparq(...))
registerLanguage("sparq_context-3", sparq(...))

// sparq grammar
{
  multiLanguage: ["sparq_context-1", "sparq_context-2", ...],
  // or you could even have a determiner callback if the context was "easy" to sniff without parsing a the whole content
  // for example for PHP we could just look for "<?php" in the beginning that would be a pretty safe bet...
  determinerCallback: function (...) =>

Then your "main" sparq language would provide a "smart" language and a "multi-context" plugin... and then auto-detected which one (and probably even returned the results). In this case you'd be using a before hook to DO the actual highlighting. Seems like something we could support.

Does that make any sense?

@ericprud
Copy link
Contributor

ericprud commented Dec 26, 2019

Actually this MIGHT be possible with the callback support that there is already a PR for... the missing piece of the puzzle is how to "hide" a bunch of language variants... registering them all isn't great, though I suppose if you namespaced them and turned off auto-detect (at the sub-module level) that wouldn't be so bar:

registerLanguage("sparq_context-1", sparq(...))
registerLanguage("sparq_context-2", sparq(...))
registerLanguage("sparq_context-3", sparq(...))

Then your "main" language would provide a dummy language and a plugin that handled "sparq"... and then auto-detected when one (and probably even returned the results). In this case you'd be using a before hook to DO the actual highlighting. Seems like something we could support.

And then the markup tests would be under the various sparq_context-n directories? That seems possible. I guess we need to measure this against the alternatives:

  1. all packages are responsible for their own tests; the highlightjs test suite calls them and makes no assumptions about what's in them.
  2. funny packages like mine have tests that have to be run explicitly.

@joshgoebel
Copy link
Member Author

And then the markup tests would be under the various sparq_context-n directories? That seems possible.

That is a very good question. I'd love to find a "better" format for the markup tests in general... they definitely don't scale to multiple variants of a single language every well at all as they currently stand.

funny packages like mine have tests that have to be run explicitly.

That may be where you're at for now.

@joshgoebel joshgoebel pinned this issue Dec 26, 2019
@joshgoebel joshgoebel changed the title Discuss: 3rd party language grammar official repository format Discuss: 3rd party language packages spec Dec 26, 2019
@idleberg
Copy link
Contributor

It should also be discussed how 3rd party languages are published. I'm not sure how which permissions come with the standard member role of the npm organization. Can any member publish any package that was published under that organization or only those to which he has access on GitHub?

Speaking of which, @marcosc and I are currently the only members of the @highlightjs organization. I've sent out invites to the rest of the team some months ago. Also, I will step down from my role as the owner once others are aboard.

@joshgoebel
Copy link
Member Author

joshgoebel commented Dec 28, 2019

It should also be discussed how 3rd party languages are published. I'm not sure how which permissions come with the standard member role of the npm organization. Can any member publish any package that was published under that organization or only those to which he has access on GitHub?

That should be a whole other thread - this one is long enough. I see that as a different discussion, but a worthwhile one. If you're like to open one that'd be great.

@idleberg I'd take an invite, but don't think I've seen one.

@ericprud
Copy link
Contributor

What's left to implement/prototype before this rolls out?
Will it be a new major version?
If so, is there any other stuff we want to tackle before release? (I have in mind adding more nuance to the autodetect scoring, which I can write up in another issue.)

@joshgoebel
Copy link
Member Author

What's left to implement/prototype before this rolls out?

It's part of the new build system, so it'll get merged when that gets merged. Only a few more things to polish and then just need to get Egor to sign off on it.

Will it be a new major version?

Should be part of v10 which should be the next release I THINK. If it takes a bit longer I might bump out one last 9 series just to get some of the more recent language fixes out sooner, but right now the goal is mid-Jan, and we might hit that.

If so, is there any other stuff we want to tackle before release?

See the issue topic on v10, but I'm not sure all that will make it.

(I have in mind adding more nuance to the autodetect scoring, which I can write up in another issue.)

I wouldn't mind discussing that with you. I could point you to a few utility/workflow based things I think would really help that you could probably bang out if you were interested. What's needed first (IMHO) is a MUCH larger test suite for auto-detection before more nuanced changes back be made - right now we have no real true way to measure improvement vs regressions.

@joshgoebel
Copy link
Member Author

What's left to implement/prototype before this rolls out?

Of course this is still making the tests easier to run from inside a folder, etc... but I don' think that's a blocker for this functionality, just a future improvement we should find time for.

@gusbemacbe
Copy link
Contributor

I have updated the highlightjs-cypher and generated with your plugin, but the minimal JS file was not generated or updated, @yyyc514 .

@joshgoebel
Copy link
Member Author

@gusbemacbe Think you're all good now right? I made the logging a bit more verbose today:

Starting build.
Finished build.
Writing style files.
Building language files.
................................................................................................................................................................................................
Building extra/highlightjs-cypher/dist/cypher.min.js.
Building extra/highlightjs-lustre/dist/lustre.min.js.
Building extra/highlightjs-robots-txt/dist/robots-txt.min.js.
Building extra/highlightjs-shexc/dist/shexc.min.js.
Building extra/highlightjs-solidity/dist/solidity.min.js.
Building extra/highlightjs-structured-text/dist/iecst.min.js.
Building extra/highlightjs-tsql/dist/tsql.min.js.

Building highlight.js.
-----
Embedded Lang       : 798990 bytes
All Lang            : 798990 bytes
highlight.js        : 1321733 bytes
highlight.min.js    : 840771 bytes
highlight.min.js.gz : 269519 bytes
-----

@joshgoebel joshgoebel unpinned this issue Feb 1, 2020
@joshgoebel
Copy link
Member Author

Closing this thread for now. Improving testing/integration process for 3rd party devs is really a different thing than defining the core format/structure itself. I think we handled that decently here and we're seeing lots of people contributing 3rd party modules, so that's awesome.

One unanswered question here is how a multi-language pack might ship a multi-language distribution distributable and my thoughts on that:

  • Ship all the auto-generated files (in case a user only needs a single file)
  • cd dist; cat lang1.js lang2.js lang3.js > multi-package.min.js;

So the multi package would just be all the individual files pasted together... each one of them doing auto-registration... so then a user could just include that single file and get all 3 languages... or they could (theoretically) include all 3 files separates and get the same result.

I don't know if we should add this last step to the build process and have a standardized name for the bundle? Perhaps just using the repository name?

extra/highlightjs-mde-languages:

  • dist/atl.min.js
  • dist/ocl.min.js
  • dist/xcore.min.js
  • dist/mde-languages.min.js

joshgoebel referenced this issue in highlightjs/highlightjs-mde-languages Feb 6, 2020
Fix code to correctly register languages to reveal.js, when loading from a browser.

Add missing keywords to ATL.
@Serhioromano
Copy link

I did not follow this thread. Can you finalize what should I do to the repo to comply?

@joshgoebel
Copy link
Member Author

If your making a new 3rd party grammar just check out the quick start readme in my branch and look at the other examples repositories.

@ericprud
Copy link
Contributor

ericprud commented Feb 7, 2020

so for code that's shared between the different language files (which is probably 80% for the RDF suite I modeled), would you advise the old exports trick?

@joshgoebel
Copy link
Member Author

joshgoebel commented Feb 7, 2020

so for code that's shared between the different language files (which is probably 80% for the RDF suite I modeled), would you advise the old exports trick?

Hard to say. I'm not sure I've decided how the main repo should work long-term. Your choices are either:

  • exports trick, by which I assume you mean make one grammar requireLanguage the other and have a run-time dependency (all our dependencies now are run-time)
  • use ES6 modules and just put the shared code in it's own file that you import/require into both the languages that need it

If the amount of code is small (and not a burden to download repeatedly - OR is likely to be bundled in a monolithic build) I'd actually consider modules. Run-time dependencies get annoying if you don't load things in the right order, or someone forgets to load one, etc...

Right now using modules would result in a slightly larger build since the code ends up repeated in multiple places - though in the future I plan to fix that (at least for those building a monolithic library). The benefit is no dependency hell.

Related:
#2248

@joshgoebel
Copy link
Member Author

Of note for 3rd party language authors:

#2400

Should be a quick add.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
big picture Policy or high level discussion language
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants