Consider taking shebang line into account when identifying files #189

sschuberth · 2018-07-03T07:08:38Z

Currently, searchcode does not seem to take shebang lines into account then identifying the language. I.e. a file starting with #! /usr/bin/env python3 should be identified as Python even if the .py file extension is missing.

The text was updated successfully, but these errors were encountered:

sschuberth · 2018-07-03T14:09:15Z

Or, for an even more sophisticated solution, maybe something like https://github.com/github/linguist could be used.

boyter · 2018-07-03T21:59:11Z

I just moved this over to use the same list that http://github.com/boyter/scc/ uses actually. However it is totally based on file extensions.

There used to be some logic in there to guess the file but that was only in the case of duplicate extensions. It was very slow and inaccurate hence its removal.

This looks like a reasonable compromise.

boyter · 2019-03-10T22:05:27Z

Have updated based on scc to now work with duplicate extensions.

As for dealing with shebang... that might be better as a pure searchcode implementation as it would needlessly slow down scc.

@sschuberth I don't suppose you know of some sort of list of these? If I can get them all in one go it would save some time.

sschuberth · 2019-03-11T09:18:48Z

I don't suppose you know of some sort of list of these?

No, and I don't believe there can be such an official / complete list, because you can use the path to any arbitrary interpreter after !#. My suggestion is to simply hard-code a few common cases in the form of (pseudocode)

if first line starts with "'!#" then
    if first line contains case insensitive "python" then
        language = Python
    else if first line contains case insensitive "ruby" then
        language = Ruby
    end
    // ...
end

And only do the above as a fallback if the language hasn't been identified yet by other means, like the file extension.

The above could probably be implemented as a FileTypeDetector to power probeContentType.

sschuberth · 2019-03-11T09:30:45Z

Alternatively, maybe you can find a way to use the GNU file command's "database" at https://github.com/file/file/tree/master/magic/Magdir from Java (e.g. via https://github.com/j256/simplemagic), as the file command already seems to recognize most shebang lines.

boyter · 2019-03-11T21:08:20Z

I had a feeling that was the case, but was hoping it not to be.

There are some pretty neat ideas in file. I shamelessly steal ideas from the GNU tools so I might have a look in there as well. Thanks for the pointers.

sschuberth mentioned this issue Jul 3, 2018

Consider taking shebang line into account when identifying files boyter/searchcode#44

Closed

boyter added the enhancement label Jul 3, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider taking shebang line into account when identifying files #189

Consider taking shebang line into account when identifying files #189

sschuberth commented Jul 3, 2018

sschuberth commented Jul 3, 2018

boyter commented Jul 3, 2018

boyter commented Mar 10, 2019

sschuberth commented Mar 11, 2019

sschuberth commented Mar 11, 2019 •

edited

Loading

boyter commented Mar 11, 2019

Consider taking shebang line into account when identifying files #189

Consider taking shebang line into account when identifying files #189

Comments

sschuberth commented Jul 3, 2018

sschuberth commented Jul 3, 2018

boyter commented Jul 3, 2018

boyter commented Mar 10, 2019

sschuberth commented Mar 11, 2019

sschuberth commented Mar 11, 2019 • edited Loading

boyter commented Mar 11, 2019

sschuberth commented Mar 11, 2019 •

edited

Loading