Skip to content

Commit

Permalink
Merge pull request #6143 from RasaHQ/fix_empty_strings
Browse files Browse the repository at this point in the history
Fix empty strings as output of whitespace tokenizer
  • Loading branch information
rasabot authored Jul 6, 2020
2 parents dcb0f2d + 56f2f50 commit 8f51534
Show file tree
Hide file tree
Showing 3 changed files with 5 additions and 3 deletions.
1 change: 1 addition & 0 deletions changelog/6143.bugfix.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Prevent ``WhitespaceTokenizer`` from outputting empty list of tokens.
6 changes: 3 additions & 3 deletions rasa/nlu/tokenizers/whitespace_tokenizer.py
Original file line number Diff line number Diff line change
Expand Up @@ -70,11 +70,11 @@ def tokenize(self, message: Message, attribute: Text) -> List[Token]:
text,
).split()

words = [self.remove_emoji(w) for w in words]
words = [w for w in words if w]

# if we removed everything like smiles `:)`, use the whole text as 1 token
if not words:
words = [text]

words = [self.remove_emoji(w) for w in words]
words = [w for w in words if w]

return self._convert_words_to_tokens(words, text)
1 change: 1 addition & 0 deletions tests/nlu/tokenizers/test_whitespace_tokenizer.py
Original file line number Diff line number Diff line change
Expand Up @@ -65,6 +65,7 @@
),
(":)", [":)"], [(0, 2)]),
("Hi :-)", ["Hi"], [(0, 2)]),
("👍", ["👍"], [(0, 1)]),
],
)
def test_whitespace(text, expected_tokens, expected_indices):
Expand Down

0 comments on commit 8f51534

Please sign in to comment.