Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add common extensions to Motorola 68k Assembly #4637

Merged
merged 6 commits into from
Jan 14, 2020

Conversation

idrougge
Copy link
Contributor

@idrougge idrougge commented Sep 1, 2019

Description

Common extensions for m68k assembly are .asm, .s, .i, .inc. The only currently registered extension x68 is only used within one specific development environment AFAIK.

Checklist:

@stale
Copy link

stale bot commented Oct 1, 2019

This pull request has been automatically marked as stale because it has not had recent activity, and will be closed if no further activity occurs. If this pull request was overlooked, forgotten, or should remain open for any other reason, please reply here to call attention to it and remove the stale status. Thank you for your contributions.

@stale stale bot added the Stale label Oct 1, 2019
@lildude lildude removed the Stale label Oct 2, 2019
lib/linguist/languages.yml Outdated Show resolved Hide resolved
- ".asm"
- ".i"
- ".inc"
- ".s"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any registers or opcodes unique to Motorola we can use to disambiguate assembly files with?

We're definitely going to need some heuristics for .asm and .inc. The latter of which is particularly important because it sees very general use across a range of unrelated) languages…

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

M68k assembly is easily distinguished from other assembly languages, both by registers and opcodes.

How would such a heuristic look?

Copy link
Collaborator

@Alhadis Alhadis Oct 3, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A regular expression; I'm happy to write it for you, provided you give me the names of substrings guaranteed (or highly unlikely) to appear in the source code of any other assembler language.

Here are our existing heuristics.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(?im:moveq\b.*?d\d|move\.[bwl]\s+.*\b[ad]\d|movem\.[bwl]\b|btst\b|dbra\b)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be reasonable to limit the moveq heuristic to match two registers (one address, one data)? From what I hear, 68k is unique for differentiating between the two.

If so, we could try this:

(?xi)
	# Mnemonic
	\b moveq (\.l)? \s+
	
	# Address
	\#( \$ -? [0-9a-f]{1,3}
	  |    %  [0-1]{1,8}
	  |    -? [0-9]{1,3}
	  )
	, \s*
	
	# Register
	d[0-7] \b

When writing heuristics, it's best to be as specific as possible; anything which doesn't match is passed down to the (less accurate) classification techniques.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Credit for this expression belongs to @zerkman, since it's taken from the language-m68k grammar we're using to highlight 68k on GitHub (I did clean it up and remove some redundant syntax for clarity).

I've amended the other parts of that expression to use other bits of that grammar, bringing us down to:

(?xim:
	# Mnemonic
	\b moveq (\.l)? \s+
	
	# Address
	\#( \$ -? [0-9a-f]{1,3}
	  |    %  [0-1]{1,8}
	  |    -? [0-9]{1,3}
	  )
	, \s*
	
	# Register
	d[0-7] \b
	
	| ^ \s* move     (\.[bwl])? \s+ (sr|usp), \s* [^\s]+
	| ^ \s* movem     \.[bwl]  \b
	| ^ \s* move[mp] (\.[wl])? \b
	| ^ \s* btst  \b
	| ^ \s* dbra  \b
)

Notice that I've anchored the remaining parts to match at the beginning of a line (with or without indentation). This reduces the risk of incorrectly matching part of a comment in an unrelated file. For the same reason, you'll notice I avoid using wildcards when possible (.*).

Copy link
Collaborator

@Alhadis Alhadis Oct 3, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@idrougge If the above revisions look good to you, then the changes to make to heuristics.yml are below.

I'll still need to test them thoroughly on my end, as well as investigate any possible formats using the .i extension that we've not registered yet.

Click to show diff
--- heuristics.yml	2019-10-03 21:45:48.000000000 +1000
+++ heuristics.yml	2019-10-03 22:30:25.000000000 +1000
@@ -49,8 +49,12 @@
   rules:
   - language: ActionScript
     pattern: '^\s*(package\s+[a-z0-9_\.]+|import\s+[a-zA-Z0-9_\.]+;|class\s+[A-Za-z0-9_]+\s+extends\s+[A-Za-z0-9_]+)'
   - language: AngelScript
+- extensions: ['.asm']
+  rules:
+  - language: Motorola 68K Assembly
+    named_pattern: m68k
 - extensions: ['.asc']
   rules:
   - language: Public Key
     pattern: '^(----[- ]BEGIN|ssh-(rsa|dss)) '
@@ -191,8 +195,10 @@
     pattern: '\A\s*[{\[]'
   - language: Slice
 - extensions: ['.inc']
   rules:
+  - language: Motorola 68K Assembly
+    named_pattern: m68k
   - language: PHP
     pattern: '^<\?(?:php)?'
   - language: SourcePawn
     pattern: '^public\s+(?:SharedPlugin(?:\s+|:)__pl_\w+\s*=(?:\s*{)?|(?:void\s+)?__pl_\w+_SetNTVOptional\(\)(?:\s*{)?)'
@@ -383,8 +389,12 @@
   - language: Rust
     pattern: '^(use |fn |mod |pub |macro_rules|impl|#!?\[)'
   - language: RenderScript
     pattern: '#include|#pragma\s+(rs|version)|__attribute__'
+- extensions: ['.s']
+  rules:
+  - language: Motorola 68K Assembly
+    named_pattern: m68k
 - extensions: ['.sc']
   rules:
   - language: SuperCollider
     pattern: '(?i:\^(this|super)\.|^\s*~\w+\s*=\.)'
@@ -486,7 +496,15 @@
   - '^[ \t]*(private|public|protected):$'
   - 'std::\w+'
   fortran: '^(?i:[c*][^abd-z]|      (subroutine|program|end|data)\s|\s*!)'
   key_equals_value: '^[^#!;][^=]*='
+  m68k:
+  - '(?im)\bmoveq(?:\.l)?\s+#(?:\$-?[0-9a-f]{1,3}|%[0-1]{1,8}|-?[0-9]{1,3}),\s*d[0-7]\b'
+  - '(?im)^\s*move(?:\.[bwl])?\s+(?:sr|usp),\s*[^\s]+'
+  - '(?im)^\s*move\.[bwl]\s+.*\b[ad]\d'
+  - '(?im)^\s*movem\.[bwl]\b'
+  - '(?im)^\s*move[mp](?:\.[wl])?\b'
+  - '(?im)^\s*btst\b'
+  - '(?im)^\s*dbra\b'
   objectivec: '^\s*(@(interface|class|protocol|property|end|synchronised|selector|implementation)\b|#import\s+.+\.h[">])'
   perl5: '\buse\s+(?:strict\b|v?5\.)'
   perl6: '^\s*(?:use\s+v6\b|\bmodule\b|\b(?:my\s+)?class\b)'

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested the ^ \s* move (\.[bwl])? \s+ (sr|usp), \s* [^\s]+ line, and it only catches moves to sr or usp, not moves to any given register. I feel that the Motorola syntax of `move.size with source or destination as a register named Dn or An is sufficiently dissimilar to other assembly syntaxes to avoid confusion with other assembly languages while also catching even the shortest snippet.
Testing may prove the totality of heuristics to still be sufficient to catch all m68k assembly sources.

.i , like .inc, .asm or .s is used by assemblers on most platforms AFAIK.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case, the third line of the m68k heuristic becomes:

- '(?im)^\s*move\.[bwl]\s+.*\b[ad]\d'

I'll update the diff I just posted.

.i , like .inc, .asm or .s is used by assemblers on most platforms AFAIK.

There are currently 11,273,157 .i files publicly indexed on GitHub. Surely there must be other formats hidden out there...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That regex looks fine for heuristics.

A quick glance at those results indicate that a lot of .i files are SWIG files, which may need a language definition of their own.

@stale
Copy link

stale bot commented Nov 2, 2019

This pull request has been automatically marked as stale because it has not had recent activity, and will be closed if no further activity occurs. If this pull request was overlooked, forgotten, or should remain open for any other reason, please reply here to call attention to it and remove the stale status. Thank you for your contributions.

@stale stale bot added the Stale label Nov 2, 2019
@stale stale bot removed the Stale label Nov 2, 2019
@stale
Copy link

stale bot commented Dec 2, 2019

This pull request has been automatically marked as stale because it has not had recent activity, and will be closed if no further activity occurs. If this pull request was overlooked, forgotten, or should remain open for any other reason, please reply here to call attention to it and remove the stale status. Thank you for your contributions.

@stale stale bot added the Stale label Dec 2, 2019
@Alhadis Alhadis removed the Stale label Dec 5, 2019
@stale
Copy link

stale bot commented Jan 1, 2020

This pull request has been automatically marked as stale because it has not had recent activity, and will be closed if no further activity occurs. If this pull request was overlooked, forgotten, or should remain open for any other reason, please reply here to call attention to it and remove the stale status. Thank you for your contributions.

@stale stale bot added the Stale label Jan 1, 2020
@Alhadis Alhadis removed the Stale label Jan 1, 2020
@Alhadis Alhadis self-assigned this Jan 1, 2020
@Alhadis
Copy link
Collaborator

Alhadis commented Jan 3, 2020

Christ. Going through the search results for .i files reveals not one, but two potential language additions: SWIG interfaces and something called Yorick, which has apparently been around since 1996, according to Wikipedia. Making matters worse is the large number of assembly files that weren't matched by our proposed heuristic, suggesting we either need to improve our heuristic or register .i as an ordinary assembly extension.

@idrougge Are these files Motorola 68k assembly? If so, we can add asm6809 as an alias, which will identify them to Linguist through their Vim modelines (Vim also has a asm68k mode, so it's worth adding that as an alias as well…).

What makes me unsure is that when I googled asm6809, it led me to a cross-assembler targeting Motorola 6809 and Hitachi 6309 processors:

asm6809 is a portable macro cross assembler targeting the Motorola 6809 and Hitachi 6309 processors. These processors are most commonly encountered in the Dragon and Tandy Colour Computer.

@idrougge
Copy link
Contributor Author

idrougge commented Jan 7, 2020

@Alhadis The files you linked are all Motorola 6809 assembly, which is an 8-bit processor totally distinct from 68000.

I agree that .i should be added as an ordinary assembly extension — it is analogous to .s.

@Alhadis
Copy link
Collaborator

Alhadis commented Jan 7, 2020

Ah, my bad. I took Motorola 68k to mean anything in the range of "Motorola 68,000—68,999". Wasn't aware that it was specific to 68000 only. Guess that answers my question about Vim modelines, then. 👍

We'll need to add .i to the Assembly entry then. In addition, SWIG will need to be added as a new entry. Regarding Yorick, I concluded that it fails to meet our usage requirements (there were hundreds of files, but only a handful of repositories).

I can push these changes to your branch later, as I'll need to run Linguist on the files I harvested (I'm not currently on a computer where I'm able to do so). Thanks for your input and patience, I realise this PR's been left hanging open fo some time. 👍

@idrougge
Copy link
Contributor Author

idrougge commented Jan 7, 2020

Indeed 68k is 68000…68999, but 6809 < 68000. ;)

@Alhadis
Copy link
Collaborator

Alhadis commented Jan 7, 2020

Erp. So it is.

This just keeps getting better. 😅 I should stop talking now.

@Alhadis
Copy link
Collaborator

Alhadis commented Jan 10, 2020

Alright. I've pushed the aforementioned changes to your branch; running Linguist locally on the harvested .i files produced satisfactory results. Have updated the original post to include source/license links for the new samples.

All that's left now is for @lildude to give a final 👍 before we can merge.

@Alhadis Alhadis requested a review from lildude January 10, 2020 06:58
@idrougge
Copy link
Contributor Author

Thanks!

@lildude lildude merged commit 9cc1b39 into github-linguist:master Jan 14, 2020
ayoubserti pushed a commit to ayoubserti/linguist that referenced this pull request Jan 22, 2020
* Add common extensions to Motorola 68k

* Revert ACE mode for m68k assembly

* Add heuristics for Motorola 68K Assembly

* Add SWIG language and `.i` Assembly extension

Co-authored-by: John Gardner <gardnerjohng@gmail.com>
lildude pushed a commit that referenced this pull request Jan 28, 2020
* add .4dm extensons

* no language for the moment

* change the source of syntax highlighting for Agda (#4768)

* Add interpreters 'csh' and 'tcsh' for language 'Tcsh' (#4760)

* Update languages.yml

* Create regtest_nmmnest.csh

Source: https://github.com/barlage/WRF-kill/blob/master/tools/regtest_nmmnest.csh

* Register `.bibtex` as a BibTeX file-extension (#4764)

* Register `.dof` as an INI file-extension (#4766)

* Register `.epsi` as a PostScript file-extension (#4763)

* Add common extensions to Motorola 68k Assembly (#4637)

* Add common extensions to Motorola 68k

* Revert ACE mode for m68k assembly

* Add heuristics for Motorola 68K Assembly

* Add SWIG language and `.i` Assembly extension

Co-authored-by: John Gardner <gardnerjohng@gmail.com>

* Add file extension for SnakeMake (#3953)

* Add file extension for SnakeMake

Previously a file name was defined for [SnakeMake[(snakemake-wrappers.readthedocs.io): #1834

Currently, the canonical extension is `smk` (see [this discussion](https://groups.google.com/forum/#!topic/Snakemake/segLE-RlV_s) with the author (@johanneskoester) of SnakeMake, and the [FAQ](http://snakemake.readthedocs.io/en/stable/project_info/faq.html#how-do-i-enable-syntax-highlighting-in-vim-for-snakefiles)).

* Adding two Snakemake (smk) example files

* add .4dm extensons

* no language for the moment

* add lang-4d tmLanguage

* link syntax highliting

* typo

Co-authored-by: Guillaume Brunerie <guillaume.brunerie+github@gmail.com>
Co-authored-by: friedc <52925889+friedc@users.noreply.github.com>
Co-authored-by: John Gardner <gardnerjohng@gmail.com>
Co-authored-by: Iggy Drougge <idrougge@mac.com>
Co-authored-by: Nils Homer <nh13@users.noreply.github.com>
@github-linguist github-linguist locked as resolved and limited conversation to collaborators Jun 17, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants