Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CS2] Fix handling of tabbed code blocks in .litcoffee files #4485

Merged
merged 10 commits into from
Apr 6, 2017

Conversation

GeoffreyBooth
Copy link
Collaborator

Fixes #4463.

So since #4345, code blocks in .litcoffee files that were indented with tabs were simply skipped over, and the code never executed. That’s why the tests were “passing,” because they were never running. Oops!

I wanted the tests to be a bit more thorough, so I duplicated the literate.litcoffee file to a version where all the code blocks are indented with tabs. Now not only are we testing that tabbed code blocks work, but that they work in all the same scenarios as spaced code blocks.

After a lot of investigation, and a detour in trying to implement this with Illiterate (see branch) I think the root cause is that our Markdown parsing library Marked just doesn’t handle tabbed code blocks properly. @billymoon was aware of this in his initial implementation, which had a magic token substituting for tabs, but it was buggy (for one thing, it was only replacing the first tab in the file, not all tabs or all leading tabs). I realized that since this input is just getting turned into JavaScript anyway, we might as well replace the leading tabs with spaces before Marked does its parsing, to sidestep Marked‘s bug. This seems to have solved it.

I went a little further to try to preserve correct line numbers for stack traces, which have also been completely blown away since we switched to using Marked. I pretty much have it fixed if you don’t do your own word wrapping for list items, which of course we’re doing in our tests. (Maybe we shouldn’t; I doubt it’s common for people to manually word-wrap their Markdown.) Anyway improving it further could be the subject of another PR if anyone is interested.

…parsing of tabbed .litcoffee files; and more accurate stack traces (assuming you don’t do your own word wrapping within list items)
@GeoffreyBooth GeoffreyBooth added this to the 2.0.0 milestone Apr 5, 2017
# `marked` really doesn’t process tabs too well. All this code is headed
# into JavaScript anyway, so replace the tabs with two spaces to make
# `marked`’s job a little easier.
code = code.replace /^(\t)+/gm, (match) ->
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about multiline strings and regexes?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code here is the source code of an entire file. This function is intending to replace all leading tabs (between the start of any line and the first non-tab character) with 2 spaces × the number of leading tabs on that line. Does it not do that?

Copy link
Collaborator

@lydell lydell Apr 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m worried that it changes the meaning of mulitline strings and regexes.

Here’s a multiline string:

	myString = """
		line one
			line two
	"""

	eq myString, "line one\n\tline two"

Copy link
Collaborator

@lydell lydell Apr 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps only replacing the first tab on each line is enough? code.replace /^\t/gm, ' '

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was the previous approach, replacing the first tab with a tab plus a magic token. Then after the parsing by Marked, the tab plus token were replaced back with a tab. For some reason I was getting lots of compiler errors regarding indentation when I did that, that I couldn't figure out. I'll look into it some more.

@lydell
Copy link
Collaborator

lydell commented Apr 5, 2017

I doubt it’s common for people to manually word-wrap their Markdown.

Many text editors have a function to split long lines and insert newlines to make all lines fit within, say, 80 columns. In my experience, I’ve seen just as much “soft wrapped” as “hard wrapped” markdown in the wild.

@GeoffreyBooth
Copy link
Collaborator Author

Throw a console.log out before the last line of exports.invertLiterate and you can see what gets produced by Marked and our additional processing. It’s quite close to the original. Parsing literate.litcoffee produces a parsed out that’s 175 lines long, from an original that is 163. The extra lines come from newlines added between the lines of the list items (the Latin paragraphs).

Parsing scope.litcoffee produces an out that is 118 lines long, exactly the same as the original. sourcemap.litcoffee comes in at 166 to the original’s 177, because the original occasionally uses two blank lines between sections. Basically if you’re careful to follow a few rules, you can get your stack traces to match up perfectly. I’m not sure how we would handle these edge cases, but again, I think that’s a separate PR.

@carlsmith
Copy link
Contributor

Just an aside: Almost every style guide for every language recommends spaces over tabs these days. Officially, Python only tolerates tabs to support existing codebases.

There's an opportunity now to mandate spaces. It's obviously going to upset some people, but it would simplify things if the new spec simply disallowed tabs in source code entirely.

@GeoffreyBooth
Copy link
Collaborator Author

@carlsmith That would be a breaking change with no upside for users. Sure it would make our lives easier, but supporting tabs isn't so hard. The compiler shouldn't also be a linter.

@GeoffreyBooth
Copy link
Collaborator Author

How bad would it be to return to the v1 implementation? What concrete bugs does using Marked solve? At the very least it makes stack traces difficult to keep correct. @billymoon

@carlsmith
Copy link
Contributor

Most languages are trying to phase out the use of tabs, but if it's a non-starter, that's cool. It was worth a shot.

Thanks for everything you've been doing @GeoffreyBooth. Appreciate it.

@lydell
Copy link
Collaborator

lydell commented Apr 5, 2017

Yet an approach would be to switch to another markdown parser, such as markdown-it.

@billymoon
Copy link
Contributor

billymoon commented Apr 5, 2017

@GeoffreyBooth I am actually in favour of removing support for literate coffee from the project, and creating a more general project to handle any file-type, which is what I was experimenting with a POC for (https://github.com/billymoon/illiterate). The new workflow would be something like...

illiterate something.coffee.md | coffee

... and for other files ...

illiterate something.js.md | babel

... etc

Ultimately, I really like literate style, and want to be able to use it on all my code, not just coffee-script. Actually, I remember when adding a -l flag for supporting piped literate in coffeescript, we dropped linting feature to make space for literate with the rationale being that linting belongs in an external tool. I think literate is the same.

@carlsmith
Copy link
Contributor

Just a thought: If the CoffeeScript compiler allows scripts to import other CoffeeScript modules, wouldn't it need to understand literate files? I may be wrong. I'm not familiar with how imports work, but in interpreted languages, the interpreter would need to understand literate syntax to import those files.

@lydell
Copy link
Collaborator

lydell commented Apr 5, 2017

CoffeeScript doesn’t care what you try to import.

@GeoffreyBooth
Copy link
Collaborator Author

@billymoon take a look at my illiterate branch. I agree that that’s a good approach long-term, but it would be a pretty major change to CoffeeScript at the moment; not least because part of the compiler itself is written in .litcoffee.

One big problem that needs solving is source maps, or rather how to preserve accurate line numbers in stack traces. Check out that branch and introduce a syntax error near the end of literate.litcoffee or literate_tabbed.litcoffee and you’ll see how useless the error message is.

Also it really should understand tabs properly. Regardless of whatever movement I may be unaware of to get rid of tabs across the world, they make up substantial percentages of repos on GitHub, and I would only expect that to increase as more people realize there are no technical limitations anymore to working with tabs and source control. Tabs provide the benefit that different members of a team can choose their own indentation level (i.e. tell their editor to render a tab as being 2 spaces or 4) and that has really kept the peace on many of the teams I manage. I wouldn’t want to lose that option.

@lydell
Copy link
Collaborator

lydell commented Apr 5, 2017

Please let’s not discuss tabs vs spaces. Markdown supports both spaces and tabs for code blocks, so we have to as well. Period. :)

https://daringfireball.net/projects/markdown/syntax#precode
http://spec.commonmark.org/0.27/#tabs

@carlsmith
Copy link
Contributor

@GeoffreyBooth - I never said anything about any movements across the world. I said...

Almost every style guide for every language recommends spaces over tabs these days. Officially, Python only tolerates tabs to support existing codebases.

...which is objectively true, these days. Tons of code exists that was written with tabs, which is why they're a bitch to phase out, but the general consensus, at least amongst the authors of official style guides, is that we should stop using them.

The Python Style Guide says don't use tabs. Same for Ruby, PHP and Java, as well as Google's style guides for CSS, C++ and JavaScript. They're the first examples I could find yesterday during a similar conversation. I just looked at Google's Markdown Style Guide as well, and it says "use four spaces".

You made it clear that you aren't interested, which I happily accepted, and thanked you for your hard work:

Most languages are trying to phase out the use of tabs, but if it's a non-starter, that's cool. It was worth a shot.
Thanks for everything you've been doing @GeoffreyBooth. Appreciate it.

I don't understand why you mocked me for that, but it doesn't really matter. If you think tabs are better than spaces, and that they're going to grow in popularity (despite the style guides), then you should obviously support them. There's no point countering your argument for tabs over spaces, as we've all heard both sides of that debate a thousand times.

@GeoffreyBooth
Copy link
Collaborator Author

@carlsmith My apologies if I came across as mocking, that wasn’t my intent. And I agree with @lydell, I don’t want yet another flame war about tabs vs spaces. I was trying to avoid taking a side; my point was just that tabs are still quite popular, despite style guides advising against them, and so we need to support them. I doubt many of those projects are just legacy codebases that haven’t been updated to the latest styles. Spaces have been most languages’ recommended style for a long, long time.

@carlsmith
Copy link
Contributor

@GeoffreyBooth - No problem. To be fair, I'm tired, and probably shouldn't have let your comment bother me as much as I did. Sorry mate. I have these days sometimes, and can only apologise.

I understand what you're saying about tabs, and it does make sense. I'm a four-spaces guy, so there's a good chance I'm guilty of an unconscious bias here.

The thing is, when you don't really contribute to a project, but care about it a lot, you end up in a tricky situation. On one hand, you always want to chip in with your own perspective, but on the other hand, you don't want to be the ideas-guy that argues for a particular position, but then expects someone else to implement it.

@GeoffreyBooth
Copy link
Collaborator Author

@carlsmith And I bet it irks you to read the CoffeeScript codebase with its 2-space indentation, doesn’t it? If only it had been written in tabs, that you could choose to display as 4 spaces . . . 😛

I wouldn’t say you don’t contribute to the project. You’ve written a lot in many of the issue threads, and your comments have been valuable and helped guide the discussion and consensus. That in and of itself is a positive contribution, and I appreciate it.

At work I manage teams of developers. I’ve had to mediate this dispute way too many times. I’ve adopted a policy of letting each team decide for themselves per language what they want. Most people—certainly not everyone, but more than half—prefer spaces, and often they can agree on either 2 spaces or 4 for indentation; but sometimes there are fierce disagreements over 2 spaces versus 4. When that happens the solution is to compromise on tabs, so that each camp can see the code the way they prefer. I once had someone insist that even that wasn’t good enough, that she needed not just 4-space indentation but 4 actual space characters as well, and I wrote her some Sublime Text macros to auto-format her code to 4-spaces on file open and back to tabs on save. That’s pretty extreme, but it gets to the larger point that we’re not in the 1970s or 1990s anymore: we’re not constrained by technical limitations that force one way or another, so each team member should get to see their own preference, just like each developer gets to choose their own editor or syntax highlighting. There’s nothing wrong with tabs, like there’s nothing wrong with 2 spaces or 4. They’re all reasonable choices, and we shouldn’t make people preprocess their code to our standard before the compiler will accept it.

I don’t know about Python, but Go is going in the opposite direction: http://stackoverflow.com/a/19094725/223225

@carlsmith
Copy link
Contributor

carlsmith commented Apr 5, 2017

And I bet it irks you to read the CoffeeScript codebase with its 2-space indentation, doesn’t it? If only it had been written in tabs, that you could choose to display as 4 spaces . . . 😛

You've got me there :)

I wouldn’t say you don’t contribute to the project. You’ve written a lot in many of the issue threads, and your comments have been valuable and helped guide the discussion and consensus. That in and of itself is a positive contribution, and I appreciate it.

Thank you. That's very kind of you.

The Go approach is really interesting.

@carlsmith
Copy link
Contributor

@GeoffreyBooth - I hope your team appreciate how well you deal with people. People are difficult. You have the qualities a good lead should have, but all too often lack. Thanks again mate, and sorry for taking things the wrong way.

@billymoon
Copy link
Contributor

@lydell thanks for the tip, I updated illiterate to use markdown-it

@GeoffreyBooth I broke the back of adding source maps to illiterate, currently as inline data uri string. I need to think about the whole cli command structure to allow for optional source maps and on a per file basis, and how to deal with input from stdio, but the source map writing part is done (mapping full source line to destination line, disregarding columns, which are going to be identical in this case). Does that help at all with your experimental illiterate branch?

@GeoffreyBooth
Copy link
Collaborator Author

@billymoon I’m not sure how I would pass along and process the sourcemaps generated by Illiterate in the main CoffeeScript parser. The invertLiterate helper just returns a string. Certainly we can update it to return an object and the main compiler can pull out the source and the sourcemap, but then we’d have to merge this sourcemap with CoffeeScript’s sourcemap at the end of compilation. Do you want to take a crack at that?

I at least figured out why the “replace tabs with a magic token” approach was failing: Marked itself replaces tabs with spaces! See:

node -e 'console.log(require("marked").lexer("\t\tfoo();\t\t\tbar();", {}))'
[ { type: 'code', text: '    foo();            bar();' },
  links: {} ]

@GeoffreyBooth
Copy link
Collaborator Author

Okay, I think I’ve got it! Including correct line numbers in stack traces!

The function is so short now I’ll just paste it here:

exports.invertLiterate = (code) ->
  out = []
  md.renderer.rules =
    code_block: (tokens, idx) ->
      startLine = tokens[idx].map[0]
      lines = tokens[idx].content.split '\n'
      for line, i in lines
        out[startLine + i] = line
  md.render code
  out.join '\n'

Yes it could squeeze even shorter, but I’m not trying to golf here. Anyway let’s unpack this. First, @lydell’s suggestion of MarkdownIt (md above) is a winner! That library doesn’t convert tabs into spaces, so there’s no tab substitution/reparation necessary. I added a new test to both the spaced and tabbed Literate CoffeeScript tests to confirm that block strings are unaffected.

Second, @billymoon’s commit had the brilliant idea of using MarkdownIt’s rules engine for pulling out the code blocks. Much easier than my first idea, which was to use its parse method.

I inspected what the code_block function was getting, and the tokens[idx] object not only contained a content property with the code source, but also a map property that’s an array like [ 70, 72 ] telling us the first and last line of the original source that this token covers. This is perfect! Now we can line up every code block token with its correct line number! So that’s what I’m doing here: splitting each code block into an array of lines, then assigning those lines to out using the line number of each line of code as out’s array index. Then when we join out back together into a string separated by newlines, all the lines of code are have the same line numbers as they did in the original .litcoffee file. Stack traces for syntax errors in Literate CoffeeScript files now always show the correct line content and number.

There is one caveat: the new library introduces a two minor breaking changes to Literate CoffeeScript:

  1. Code blocks within HTML blocks must now be unindented. b8eed42. In other words, if you want a code block within an HTML paragraph, it can’t be entirely indented. At least one line of the block must be at the first column, i.e. unindented.
  2. Code blocks within lists need a blank line separating them from the list item text. 7705438. It’s no longer enough to just indent a code block adjacent to a list item. You need a blank line first.

I can live with these changes. I had to modify two tests to accommodate them, but nothing in scope.litcoffee or sourcemap.litcoffee needed to change. We can note these in the docs, and I don’t think they’d cause too much of an outcry.

@billymoon
Copy link
Contributor

@GeoffreyBooth just checking, is it still desirable to externalise literate feature and pass source maps from illiterate? I can have a bash at that if it's useful.

p.s. you know the code is right when there are only 10 lines :)

@GeoffreyBooth
Copy link
Collaborator Author

@billymoon Yes and no 😄 I think at this point people expect the CoffeeScript compiler to process .litcoffee files, and it would be a breaking change if it didn’t, so coffee test.litcoffee needs to keep working no matter what. I would be happy to replace my 10 lines with a call to Illiterate, if it wants to return the same string that out currently gets returned as. Though that would mean that Illiterate essentially just becomes this function, which may or may not be what you want.

There’s actually no need for source maps using the final implementation I have now, as every line of code is at the same row and column as the original Markdown; Illiterate could take the same approach, which might simplify things.

@GeoffreyBooth GeoffreyBooth merged commit 5e1d978 into jashkenas:2 Apr 6, 2017
@GeoffreyBooth GeoffreyBooth deleted the litcoffee-bug branch April 6, 2017 17:09
@billymoon
Copy link
Contributor

@GeoffreyBooth well, if that's the case, I don't really see much use in putting illiterate in. Keep up the good work :)

@GeoffreyBooth
Copy link
Collaborator Author

@billymoon sorry. I support what you’re doing though, I think you’re right and Illiterate is the way Literate CoffeeScript should’ve been implemented in the first place. If you want to propose a way to replace our current hodgepodge with Illiterate, and it wouldn’t be a breaking change, I would heartily support that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants