A dot added in dates #52

starkadur · 2024-10-07T15:01:54Z

If I send in "17 júní" the tokenizer returns 17. júní". Even though I use tokenized() (and not split_itsentences()) and use the txt-property (which should contain the original source text for the token) I still get this extra dot.

peturorri · 2024-10-07T15:40:31Z

I think you're looking for the original property of the tokens, not txt. See: https://github.com/mideind/Tokenizer/blob/master/src/tokenizer/tokenizer.py#L95

starkadur · 2024-10-08T15:44:24Z

Do all tokens have the original property? I always get error when trying to access it:
txt = token.original
causes an error while
txt = token.txt
does not.

peturorri · 2024-10-09T17:09:04Z

They should all have original although it can sometimes be None.

Can you provide a complete example of the code you're trying to run, and the version of the tokenizer package.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A dot added in dates #52

A dot added in dates #52

starkadur commented Oct 7, 2024

peturorri commented Oct 7, 2024

starkadur commented Oct 8, 2024

peturorri commented Oct 9, 2024

A dot added in dates #52

A dot added in dates #52

Comments

starkadur commented Oct 7, 2024

peturorri commented Oct 7, 2024

starkadur commented Oct 8, 2024

peturorri commented Oct 9, 2024