Version 6.5.0 introduces a breaking change for sentence tokenizing #689

jmsv · 2023-06-18T01:00:27Z

Just upgraded my deps for a project and some tests broke – managed to track it down to a difference in natural's SentenceTokenizer

I wrote this quick script to get words and sentences for a given string:

const natural = require("natural");

const wordTokenizer = new natural.WordTokenizer();
const sentenceTokenizer = new natural.SentenceTokenizer();

const tokenizeStuff = (input) => {
  const sentences = sentenceTokenizer.tokenize(input);
  const words = wordTokenizer.tokenize(input);

  return { sentences, words };
};

const testInput = `
This is some test content.

We're trying to figure out variations in versions of the package.
`.trim();

console.log(tokenizeStuff(testInput));

and then tried running it with 6.3.1 and then 6.5.0. This was the result:

james ~/Documents/natural-test 
$ yarn add natural@6.3.1 && node test.js
...

{
  sentences: [
    'This is some test content.',
    "We're trying to figure out variations in versions of the package."
  ],
  words: [
    'This',    'is',
    'some',    'test',
    'content', 'We',
    're',      'trying',
    'to',      'figure',
    'out',     'variations',
    'in',      'versions',
    'of',      'the',
    'package'
  ]
}

james ~/Documents/natural-test 
$ yarn add natural@6.5.0 && node test.js
...

{
  sentences: [
    'This is some test content.',
    "We're trying to figure out variations in versions of the",
    'package.'
  ],
  words: [
    'This',    'is',
    'some',    'test',
    'content', 'We',
    're',      'trying',
    'to',      'figure',
    'out',     'variations',
    'in',      'versions',
    'of',      'the',
    'package'
  ]
}

The sentence We're trying to figure out variations in versions of the package. has been incorrectly split on the last word using version 6.5.0, whereas the result produced by previous version 6.3.1 seems accurate to me

(pretty sure the word tokenizer is unaffected, I had just included it in my test when trying to track down where the difference came from)

As far as I can tell, this is a bug

Looks like the change was introduced in a7a8a23

The text was updated successfully, but these errors were encountered:

* Fixed issue #689 * Fixed indentation * Trailing whitespace

Hugo-ter-Doest · 2023-11-26T16:17:31Z

Fixed in #705

Hugo-ter-Doest · 2023-11-26T16:26:30Z

See also https://gist.github.com/Hugo-ter-Doest/4ed21fb7eb5077814d998fa61a726566 for a breakdown of the regular expression since it is quite unreadable in the source code.

Hugo-ter-Doest added a commit that referenced this issue Nov 26, 2023

Fixed issue #689

846e6aa

Hugo-ter-Doest added a commit that referenced this issue Nov 26, 2023

Fixes issue #689 (#705)

1b830b1

* Fixed issue #689 * Fixed indentation * Trailing whitespace

Hugo-ter-Doest closed this as completed Nov 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Version 6.5.0 introduces a breaking change for sentence tokenizing #689

Version 6.5.0 introduces a breaking change for sentence tokenizing #689

jmsv commented Jun 18, 2023

Hugo-ter-Doest commented Nov 26, 2023

Hugo-ter-Doest commented Nov 26, 2023

Version 6.5.0 introduces a breaking change for sentence tokenizing #689

Version 6.5.0 introduces a breaking change for sentence tokenizing #689

Comments

jmsv commented Jun 18, 2023

Hugo-ter-Doest commented Nov 26, 2023

Hugo-ter-Doest commented Nov 26, 2023