You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Just upgraded my deps for a project and some tests broke – managed to track it down to a difference in natural's SentenceTokenizer
I wrote this quick script to get words and sentences for a given string:
constnatural=require("natural");constwordTokenizer=newnatural.WordTokenizer();constsentenceTokenizer=newnatural.SentenceTokenizer();consttokenizeStuff=(input)=>{constsentences=sentenceTokenizer.tokenize(input);constwords=wordTokenizer.tokenize(input);return{ sentences, words };};consttestInput=`This is some test content.We're trying to figure out variations in versions of the package.`.trim();console.log(tokenizeStuff(testInput));
and then tried running it with 6.3.1 and then 6.5.0. This was the result:
james ~/Documents/natural-test
$ yarn add natural@6.3.1 && node test.js
...
{
sentences: [
'This is some test content.',
"We're trying to figure out variations in versions of the package."
],
words: [
'This', 'is',
'some', 'test',
'content', 'We',
're', 'trying',
'to', 'figure',
'out', 'variations',
'in', 'versions',
'of', 'the',
'package'
]
}
james ~/Documents/natural-test
$ yarn add natural@6.5.0 && node test.js
...
{
sentences: [
'This is some test content.',
"We're trying to figure out variations in versions of the",
'package.'
],
words: [
'This', 'is',
'some', 'test',
'content', 'We',
're', 'trying',
'to', 'figure',
'out', 'variations',
'in', 'versions',
'of', 'the',
'package'
]
}
The sentence We're trying to figure out variations in versions of the package. has been incorrectly split on the last word using version 6.5.0, whereas the result produced by previous version 6.3.1 seems accurate to me
(pretty sure the word tokenizer is unaffected, I had just included it in my test when trying to track down where the difference came from)
Just upgraded my deps for a project and some tests broke – managed to track it down to a difference in
natural
'sSentenceTokenizer
I wrote this quick script to get words and sentences for a given string:
and then tried running it with
6.3.1
and then6.5.0
. This was the result:The sentence
We're trying to figure out variations in versions of the package.
has been incorrectly split on the last word using version6.5.0
, whereas the result produced by previous version6.3.1
seems accurate to me(pretty sure the word tokenizer is unaffected, I had just included it in my test when trying to track down where the difference came from)
As far as I can tell, this is a bug
Looks like the change was introduced in a7a8a23
The text was updated successfully, but these errors were encountered: