Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Port new Tokeniser from Linguist #193

Open
bzz opened this issue Jan 28, 2019 · 2 comments
Open

Port new Tokeniser from Linguist #193

bzz opened this issue Jan 28, 2019 · 2 comments
Assignees
Milestone

Comments

@bzz
Copy link
Contributor

bzz commented Jan 28, 2019

Part of the #155

Right now enry uses content tokenization approach based on regexps from linguist before v5.3.2.

This issues is about enry supporting/producing same results as a new, flex-based scanner introduced in github/linguist#3846.

This is important as it affects Bayesian classifier accuracy and classifier tests in both projects make a strong assumption that all samples can be distinguished by a content classifier alone.

@bzz
Copy link
Contributor Author

bzz commented Feb 8, 2019

Linguist tokenize is defined using flex-based tokenizer.l.

1. Generating Go code from flex grammar

Golang does have limited version of it in ported https://gitlab.com/cznic/golex but it is missing 2 features to in order to be used with the above definition:

  • Trailing context (re1/re2).
  • All flex % prefixed options except %s and %x.

(see logs in details for reproduction instructions)

wget https://raw.githubusercontent.com/github/linguist/master/ext/linguist/tokenizer.l
go get -u modernc.org/golex
golex -o lex.go tokenizer.l

tokenizer.l:35:1: unknown %option "never-interactive yywrap reentrant nounput warn nodefault header-file=\"lex.linguist_yy.h\" extra-type=\"struct tokenizer_extra *\" prefix=\"linguist_yy\""
tokenizer.l:87:16 - "\<[[:alnum:]_!.^/?-]+              {" - trailing context not supported
tokenizer.l:103:15 - "[[:alnum:]_.@#^/*]+                {" - trailing context not supported

At this point it's a hard to estimate the effort of adding those features upstream.

2. Porting lexer grammar to Ragel

Instructive go-nuts thread on this subject points out worth trying a bit more complex solution, similar to discussion in #167, based on ragel, another FSM generator that can be "compiled" to Go code. That would only require porting 1 file .l -> .rl which is much more manageable effort.

3. Using flex-generated native lexer through the cgo

Hidden behind a compilation tag, this option includes direct usage of the same native, flex-generated tokenizer from the Linguist. This is a low-hanging fruit as does not require much effort to port and is a simplest way to verify the hypothesis of classifier accuracy from #194.

@bzz
Copy link
Contributor Author

bzz commented Apr 8, 2019

#193 (comment) updated to include another option of using existing flex-based tokenizer though cgo.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant