Incorrect language detected (C++ as C, XML as TypeScript, etc.) #26

aaronfranke · 2019-02-16T20:45:49Z

https://github.com/github/linguist

Linguist is a tool developed by GitHub for the specific purpose of detecting languages. It's a very mature tool that gets it right the majority of the time by using complex rules.

o2sh · 2019-02-17T10:39:20Z

But why ?
https://github.com/Aaronepower/tokei is written in Rust and does a great job detecting languages.

aaronfranke · 2019-02-17T18:14:14Z

Is that what Onefetch currently uses? It detects C++ as C in the case of Godot, and it didn't detect anything for the repo of a Godot project (while GitHub detects GDScript).

o2sh · 2019-02-17T18:47:06Z

it only detects the languages that are currently supported by onefetch (WIP):

C
Clojure
C++
C#
Go
Haskell
Java
Lisp
Lua
Python
R
Ruby
Rust
Scala
Shell
TypeScript
JavaScript
Php

Also tokei ignores all commented lines which is why the language distribution sometimes differs from GH.

Supported languages by tokei --> https://github.com/Aaronepower/tokei#supported-languages

aaronfranke · 2019-03-10T00:01:02Z

Upstream issues: XAMPPRocky/tokei#305 and XAMPPRocky/tokei#67

We can leave this closed though if you want.

o2sh · 2019-03-10T11:51:31Z

Ok, with the new title it makes more sense to keep this open.

We'll wait for tokei to fix it then.

Thx @aaronfranke

stale · 2020-08-21T23:45:08Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

aaronfranke · 2020-08-21T23:57:44Z

This issue still exists, though it is likely seen by the devs as low priority, so I'll probably have to bump this again later to please the stale bot.

stale · 2020-11-20T00:32:24Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

aaronfranke · 2020-11-20T01:37:38Z

This issue still exists, though it is likely seen by the devs as low priority, so I'll probably have to bump this again later to please the stale bot.

mapau · 2020-12-28T17:56:04Z

Hi, I added the c header and cpp header to language.rs file to my fork
https://github.com/mapau/onefetch/tree/feature/add-c-cpp-header

https://github.com/Aaronepower/tokei already detects the c header and cpp header only the mapping in onefetch is missing.

Here is a PR #365

o2sh · 2020-12-30T17:46:04Z

I'm not very fund of this idea of having separate entries for header files (CHeader and C++Header). I personally prefer the GitHub Linguist approach of extending C and C++ detection scope to include their respective header files:

C++:
  type: programming
  tm_scope: source.c++
  ace_mode: c_cpp
  codemirror_mode: clike
  codemirror_mime_type: text/x-c++src
  color: "#f34b7d"
  aliases:
  - cpp
  extensions:
  - ".cpp"
  - ".c++"
  - ".cc"
  - ".cp"
  - ".cxx"
  - ".h"
  - ".h++"
  - ".hh"
  - ".hpp"
  - ".hxx"
  - ".inc"
  - ".inl"
  - ".ino"
  - ".ipp"
  - ".re"
  - ".tcc"
  - ".tpp"

  C:
  type: programming
  color: "#555555"
  extensions:
  - ".c"
  - ".cats"
  - ".h"
  - ".idc"
  interpreters:
  - tcc
  tm_scope: source.c
  ace_mode: c_cpp
  codemirror_mode: clike
  codemirror_mime_type: text/x-csrc
  language_id: 41

I doubt the people over at tokei would be ready to make that shift...So, either we stick to tokei's detection rules and merge @mapau's PR, or we override the logic in Onefetch or...

stale · 2021-04-02T17:55:50Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

aaronfranke · 2021-04-02T18:08:40Z

This issue still exists, though it is seen by the devs as low priority, so I'll probably have to bump this again later to please the stale bot. The problem still deserves some kind of solution eventually.

stale · 2021-07-01T19:15:32Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

aaronfranke · 2021-07-01T19:21:16Z

This issue still exists, though it is seen by the devs as low priority, so I'll probably have to bump this again later to please the stale bot. The problem still deserves some kind of solution eventually.

atluft · 2022-10-08T16:33:34Z

Incorrect detection of Verilog using tokei.

tokei uses file extension *.vg for verilog while the more popular choice is *.v
tokei *.v is defined as COQ, all this is described in tokei issue 520

When considering a new approach, please consider verilog file identification as a useful test case.

spenserblack · 2022-10-08T19:19:43Z

Thanks for the report, @atluft.
I guess this also means that tokei can have a problem with V, it it is ever implemented.

spenserblack · 2022-10-08T19:29:16Z

@o2sh We might want to create a known-issues.md file, documenting workarounds (alternate extensions to use, tokei config file snippets, etc.).

o2sh · 2022-10-09T16:36:55Z

I'd be happy to do so, but do we actually have any workaround for this? 🤔

As far as I know tokei still doesn't provide an option to allows users to override the extensions - as suggested here

spenserblack · 2022-10-10T13:22:30Z

Sorry, I incorrectly assumed that tokei allowed language overrides, but I guess that's not implemented yet.

Well, the only workaround that I know of is renaming all Verilog files to *.vg 😅

spenserblack · 2023-02-20T02:19:28Z

Coming back to a really old issue to document a potential solution:

It might be worth creating a new crate that acts as a wrapper for tokei. This wrapper would provide its own function for getting languages, adding the following:

heuristics using regexes
naive bayesian classification using code samples
Basically, it should do what github-linguist does.

Also, it should probably re-export the rest of tokei's public interface to make usage easier.

Such a crate should probably be in a separate repository, as I anticipate releases occurring on a very different schedule from onefetch.

Additionally, this crate would probably need a lot of community support to provide the heuristics and code samples.

I might attempt to do this sometime, but I can't promise that it will be soon. If someone else wants to take this on, I'll be happy to help and discuss this further.

o2sh · 2023-02-20T10:50:39Z

that acts as a wrapper for tokei.

You mean exposing the same set of APIs? or is it gonna reuse some of tokei's code?

If I'm understanding correctly, the new project will be similar to github-linguist but implemented in Rust. Does that mean sacrificing some performance for better accuracy?

Regardless, it's definitely an intriguing challenge. If executed well, it could gain a lot of traction, especially given the current state/limitations of existing solutions 😢.

I'd be happy to help 👍

spenserblack · 2023-02-20T13:40:22Z

You mean exposing the same set of APIs? or is it gonna reuse some of tokei's code?

Mostly exposing the same API. Basically a bunch of pub use. For the re-implementation of get_statistics, some code reuse might be necessary.

This could also be a fork of tokei. I was just thinking about "wrapping" tokei since AFAIK get_statistics is the only part we're not satisfied with. But maybe forking makes more sense so we wouldn't have to re-implement as much internal code 🤷

Also, since we're mentioning github-linguist, I should note that linguist actually analyzes the HEAD (or other rev) of a repository, not the filesystem, which is pretty cool. But I'm not sure if we'd want that since we display pending change stats.

So I guess the first question is: do we want to improve tokei for our purposes, or port linguist to Rust?

spenserblack · 2023-08-14T18:27:02Z

Hey everyone following this 👋

There's been a bit of discussion here, but to keep you all up to date: I went ahead and started a project called gengo that should be more linguist-like, to hopefully improve our language detection eventually. Unlike tokei, there can be file extension collisions, and gengo will try to pick the right language using heuristics. For example, for this comment, it would need to register ts as an XML file extension, and include a heuristic to be confident that the .ts file is actually XML.

But right now, gengo doesn't support nearly enough languages. While I can just grab the data from linguist (and maybe I eventually will), right now I'm hoping that language support grows more organically, with discussion for each added language. So if you'd like to contribute, please do! I'll definitely need help with languages that I'm unfamiliar with, especially when it comes to adding heuristics, for example for C and C++ .h header files.

Edit: See spenserblack/gengo#34

o2sh closed this as completed Mar 9, 2019

aaronfranke changed the title ~~Use GitHub Linguist to detect languages~~ Improve language detection system to recognize C++ headers Mar 10, 2019

o2sh reopened this Mar 10, 2019

aeter mentioned this issue Apr 10, 2019

Add assembly detection and ascii image #30

Merged

erikgaal added the feature request label Nov 10, 2019

stale bot added the wontfix label Aug 21, 2020

stale bot removed the wontfix label Aug 21, 2020

stale bot added the wontfix label Nov 20, 2020

stale bot removed the wontfix label Nov 20, 2020

o2sh added the discussion label Dec 30, 2020

stale bot added the wontfix label Apr 2, 2021

stale bot removed the wontfix label Apr 2, 2021

stale bot added the wontfix label Jul 1, 2021

o2sh added the pinned label Jul 1, 2021

stale bot removed the wontfix label Jul 1, 2021

o2sh added the upstream label Oct 9, 2022

o2sh removed the feature request label Mar 11, 2023

o2sh removed the pinned label Mar 19, 2023

This comment was marked as off-topic.

Sign in to view

spenserblack mentioned this issue May 1, 2023

Support analysis of tracked files #1033

Closed

spenserblack mentioned this issue Aug 16, 2023

Support XML spenserblack/gengo#43

Merged

This was referenced Aug 25, 2023

Switch to gengo #1152

Draft

Onefetch languages checklist spenserblack/gengo#34

Closed

This comment was marked as off-topic.

Sign in to view

spenserblack mentioned this issue Nov 12, 2023

Make it easier to discover that data languages (JSON, YAML) and prose (Markdown) are not reported by default #1208

Closed

spenserblack linked a pull request Apr 8, 2024 that will close this issue

Switch to gengo #1305

Open

3 tasks

o2sh unpinned this issue Sep 1, 2024

Incorrect language detected (C++ as C, XML as TypeScript, etc.) #26

Incorrect language detected (C++ as C, XML as TypeScript, etc.) #26

Comments

aaronfranke commented Feb 16, 2019

o2sh commented Feb 17, 2019

aaronfranke commented Feb 17, 2019 • edited Loading

o2sh commented Feb 17, 2019 • edited Loading

aaronfranke commented Mar 10, 2019

o2sh commented Mar 10, 2019

stale bot commented Aug 21, 2020

aaronfranke commented Aug 21, 2020

stale bot commented Nov 20, 2020

aaronfranke commented Nov 20, 2020

mapau commented Dec 28, 2020 • edited Loading

o2sh commented Dec 30, 2020 • edited Loading

stale bot commented Apr 2, 2021

aaronfranke commented Apr 2, 2021

stale bot commented Jul 1, 2021

aaronfranke commented Jul 1, 2021

atluft commented Oct 8, 2022

spenserblack commented Oct 8, 2022 • edited Loading

spenserblack commented Oct 8, 2022

o2sh commented Oct 9, 2022 • edited Loading

spenserblack commented Oct 10, 2022

spenserblack commented Feb 20, 2023

o2sh commented Feb 20, 2023

spenserblack commented Feb 20, 2023 • edited Loading

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

spenserblack commented Aug 14, 2023 • edited Loading

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

aaronfranke commented Feb 17, 2019 •

edited

Loading

o2sh commented Feb 17, 2019 •

edited

Loading

mapau commented Dec 28, 2020 •

edited

Loading

o2sh commented Dec 30, 2020 •

edited

Loading

spenserblack commented Oct 8, 2022 •

edited

Loading

o2sh commented Oct 9, 2022 •

edited

Loading

spenserblack commented Feb 20, 2023 •

edited

Loading

spenserblack commented Aug 14, 2023 •

edited

Loading