Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Version 6.5.0 introduces a breaking change for sentence tokenizing #689

Closed
jmsv opened this issue Jun 18, 2023 · 2 comments
Closed

Version 6.5.0 introduces a breaking change for sentence tokenizing #689

jmsv opened this issue Jun 18, 2023 · 2 comments

Comments

@jmsv
Copy link

jmsv commented Jun 18, 2023

Just upgraded my deps for a project and some tests broke – managed to track it down to a difference in natural's SentenceTokenizer

I wrote this quick script to get words and sentences for a given string:

const natural = require("natural");

const wordTokenizer = new natural.WordTokenizer();
const sentenceTokenizer = new natural.SentenceTokenizer();

const tokenizeStuff = (input) => {
  const sentences = sentenceTokenizer.tokenize(input);
  const words = wordTokenizer.tokenize(input);

  return { sentences, words };
};

const testInput = `
This is some test content.

We're trying to figure out variations in versions of the package.
`.trim();

console.log(tokenizeStuff(testInput));

and then tried running it with 6.3.1 and then 6.5.0. This was the result:

james ~/Documents/natural-test 
$ yarn add natural@6.3.1 && node test.js
...

{
  sentences: [
    'This is some test content.',
    "We're trying to figure out variations in versions of the package."
  ],
  words: [
    'This',    'is',
    'some',    'test',
    'content', 'We',
    're',      'trying',
    'to',      'figure',
    'out',     'variations',
    'in',      'versions',
    'of',      'the',
    'package'
  ]
}

james ~/Documents/natural-test 
$ yarn add natural@6.5.0 && node test.js
...

{
  sentences: [
    'This is some test content.',
    "We're trying to figure out variations in versions of the",
    'package.'
  ],
  words: [
    'This',    'is',
    'some',    'test',
    'content', 'We',
    're',      'trying',
    'to',      'figure',
    'out',     'variations',
    'in',      'versions',
    'of',      'the',
    'package'
  ]
}

The sentence We're trying to figure out variations in versions of the package. has been incorrectly split on the last word using version 6.5.0, whereas the result produced by previous version 6.3.1 seems accurate to me

(pretty sure the word tokenizer is unaffected, I had just included it in my test when trying to track down where the difference came from)

As far as I can tell, this is a bug

Looks like the change was introduced in a7a8a23

Hugo-ter-Doest added a commit that referenced this issue Nov 26, 2023
Hugo-ter-Doest added a commit that referenced this issue Nov 26, 2023
* Fixed issue #689

* Fixed indentation

* Trailing whitespace
@Hugo-ter-Doest
Copy link
Collaborator

Fixed in #705

@Hugo-ter-Doest
Copy link
Collaborator

See also https://gist.github.com/Hugo-ter-Doest/4ed21fb7eb5077814d998fa61a726566 for a breakdown of the regular expression since it is quite unreadable in the source code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants