-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FSATraversal may return NOT_FOUND instead of AUTOMATON_HAS_PREFIX #92
Comments
The way it works is fine -- 'ax' returns no match because there was no such string in the dictionary. Prefix match is returned if your input string exists fully in the automaton, but does not correspond to any automaton string (think 'abc' encoded in the automaton and 'ab' query). You can always traverse the automaton yourself -- FSATraversal are merely utilities, copy over the code and customize to your needs. |
Sorry, maybe there's a misunderstanding. In both cases the query "ax" is not accepted by the automaton, but its prefix "a" is. |
In both of these cases no-match is the right result. I explained above what automaton_has_prefix means in this context, but look at the code and you'll see the semantics of these enums. |
Well in the first case the actual result is MatchResult.AUTOMATON_HAS_PREFIX |
Darn, apologies Steven -- you're right, something is wrong here, I'll dig. |
Much appreciated and no need to apologize! |
I looked at the code and to sure it looks wrong... it returns AUTOMATON_HAS_PREFIX only when the arc in the automaton is terminal (that is: there was no longer sequence encoded in the automaton and the processing needs to end). I don't see it used anywhere in the code (other than a few irrelevant tests) and I wonder whether fixing this to actually work as it should is better than just removing the whole constant altogether... For larger automata there will nearly always be some kind of matching prefix (even if it's a single-letter one)... people who really need it can code manual traversal and those who do lookups will typically just need NOT_FOUND. What is your scenario? How did you come across it? |
I was trying to find the longest accepted string in an untokenized input sequence. E.g.
This worked very well with a small test sample of license plate regions since I got a match of kind "AUTOMATON_HAS_PREFIX" and the first unmatched index in the input sequence. I completely agree that a custom implementation of FSATraversal will do the trick and I think that's a valid solution. I was merely curious to see if that behavior was intentional or not, since I found it to be inconsistent. |
Thanks for filing the bug report and sorry again for hasty response -- this code hasn't been touched for years and it's funny things like that go unnoticed for so long (means nobody actually used this stuff!). I think what I'll do is I'll fix the AUTOMATON_HAS_PREFIX to actually work as expected and correct the JavaDocs on that class. I'll need to review the use cases in existing code (not just in morfologik-stemming, but in other places as well) and I'm currently on short holidays, so I'll go back to it next week at the earliest. If you can temporarily copy/paste the traversal routine into your own code you'll have full control over how it works. |
I can definitely do the traversal in our own code and since I'm still in the experimentation phase I'm not in a hurry either. Thank you for your time and your work! |
Fixed and released, thanks Steven. |
Hello,
I came across some behavior I find unexpected, but I am unsure if it is indeed unintended or not.
Here's a unit test I created using release 2.1.3 of morfologik-fsa and morfologik-fsa-builders:
The test "expected" is successful ("green"), but the test "unexpected" fails ("red"), because match.kind is "NO_MATCH".
Is "NO_MATCH" the intended result or is there something wrong with my test?
The text was updated successfully, but these errors were encountered: