Port new Tokeniser from Linguist #193

bzz · 2019-01-28T10:10:52Z

Part of the #155

Right now enry uses content tokenization approach based on regexps from linguist before v5.3.2.

This issues is about enry supporting/producing same results as a new, flex-based scanner introduced in github/linguist#3846.

This is important as it affects Bayesian classifier accuracy and classifier tests in both projects make a strong assumption that all samples can be distinguished by a content classifier alone.

bzz · 2019-02-08T21:34:44Z

Linguist tokenize is defined using flex-based tokenizer.l.

1. Generating Go code from flex grammar

Golang does have limited version of it in ported https://gitlab.com/cznic/golex but it is missing 2 features to in order to be used with the above definition:

Trailing context (re1/re2).
All flex % prefixed options except %s and %x.

(see logs in details for reproduction instructions)

wget https://raw.githubusercontent.com/github/linguist/master/ext/linguist/tokenizer.l
go get -u modernc.org/golex
golex -o lex.go tokenizer.l

tokenizer.l:35:1: unknown %option "never-interactive yywrap reentrant nounput warn nodefault header-file=\"lex.linguist_yy.h\" extra-type=\"struct tokenizer_extra *\" prefix=\"linguist_yy\""
tokenizer.l:87:16 - "\<[[:alnum:]_!.^/?-]+              {" - trailing context not supported
tokenizer.l:103:15 - "[[:alnum:]_.@#^/*]+                {" - trailing context not supported

At this point it's a hard to estimate the effort of adding those features upstream.

2. Porting lexer grammar to Ragel

Instructive go-nuts thread on this subject points out worth trying a bit more complex solution, similar to discussion in #167, based on ragel, another FSM generator that can be "compiled" to Go code. That would only require porting 1 file .l -> .rl which is much more manageable effort.

3. Using flex-generated native lexer through the cgo

Hidden behind a compilation tag, this option includes direct usage of the same native, flex-generated tokenizer from the Linguist. This is a low-hanging fruit as does not require much effort to port and is a simplest way to verify the hypothesis of classifier accuracy from #194.

bzz · 2019-04-08T14:38:52Z

#193 (comment) updated to include another option of using existing flex-based tokenizer though cgo.

bzz added the enhancement label Jan 28, 2019

bzz mentioned this issue Jan 28, 2019

Bayesian classifier cann't distinguish "SQL" vs "PLpgSQL" #194

Open

smola mentioned this issue Jan 28, 2019

Sync with github/linguist #155

Open

4 tasks

bzz mentioned this issue Mar 15, 2019

Breakdown of django/django is different from Linguist #204

Closed

bzz added this to the v1.8.0 milestone Mar 15, 2019

creachadair mentioned this issue Apr 4, 2019

CLI: sync report logic \w Linguist #214

Merged

2 tasks

bzz self-assigned this Apr 8, 2019

bzz mentioned this issue Apr 8, 2019

New, optional flex-based tokenizer #218

Merged

smacker mentioned this issue Apr 11, 2019

Auto-detect language when code is pasted bblfsh/web#190

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Port new Tokeniser from Linguist #193

Port new Tokeniser from Linguist #193

bzz commented Jan 28, 2019

bzz commented Feb 8, 2019 •

edited

Loading

bzz commented Apr 8, 2019

Port new Tokeniser from Linguist #193

Port new Tokeniser from Linguist #193

Comments

bzz commented Jan 28, 2019

bzz commented Feb 8, 2019 • edited Loading

1. Generating Go code from flex grammar

2. Porting lexer grammar to Ragel

3. Using flex-generated native lexer through the cgo

bzz commented Apr 8, 2019

bzz commented Feb 8, 2019 •

edited

Loading