Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider taking shebang line into account when identifying files #189

Open
sschuberth opened this issue Jul 3, 2018 · 6 comments
Open

Comments

@sschuberth
Copy link

Currently, searchcode does not seem to take shebang lines into account then identifying the language. I.e. a file starting with #! /usr/bin/env python3 should be identified as Python even if the .py file extension is missing.

@sschuberth
Copy link
Author

Or, for an even more sophisticated solution, maybe something like https://github.com/github/linguist could be used.

@boyter
Copy link
Owner

boyter commented Jul 3, 2018

I just moved this over to use the same list that http://github.com/boyter/scc/ uses actually. However it is totally based on file extensions.

There used to be some logic in there to guess the file but that was only in the case of duplicate extensions. It was very slow and inaccurate hence its removal.

This looks like a reasonable compromise.

@boyter
Copy link
Owner

boyter commented Mar 10, 2019

Have updated based on scc to now work with duplicate extensions.

As for dealing with shebang... that might be better as a pure searchcode implementation as it would needlessly slow down scc.

@sschuberth I don't suppose you know of some sort of list of these? If I can get them all in one go it would save some time.

@sschuberth
Copy link
Author

I don't suppose you know of some sort of list of these?

No, and I don't believe there can be such an official / complete list, because you can use the path to any arbitrary interpreter after !#. My suggestion is to simply hard-code a few common cases in the form of (pseudocode)

if first line starts with "'!#" then
    if first line contains case insensitive "python" then
        language = Python
    else if first line contains case insensitive "ruby" then
        language = Ruby
    end
    // ...
end

And only do the above as a fallback if the language hasn't been identified yet by other means, like the file extension.

The above could probably be implemented as a FileTypeDetector to power probeContentType.

@sschuberth
Copy link
Author

sschuberth commented Mar 11, 2019

Alternatively, maybe you can find a way to use the GNU file command's "database" at https://github.com/file/file/tree/master/magic/Magdir from Java (e.g. via https://github.com/j256/simplemagic), as the file command already seems to recognize most shebang lines.

@boyter
Copy link
Owner

boyter commented Mar 11, 2019

I had a feeling that was the case, but was hoping it not to be.

There are some pretty neat ideas in file. I shamelessly steal ideas from the GNU tools so I might have a look in there as well. Thanks for the pointers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants