Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FSATraversal may return NOT_FOUND instead of AUTOMATON_HAS_PREFIX #92

Closed
stevendolg opened this issue Jul 15, 2017 · 11 comments
Closed
Assignees
Labels
Milestone

Comments

@stevendolg
Copy link

Hello,

I came across some behavior I find unexpected, but I am unsure if it is indeed unintended or not.
Here's a unit test I created using release 2.1.3 of morfologik-fsa and morfologik-fsa-builders:

import static org.junit.Assert.assertEquals;

import java.util.ArrayList;
import java.util.List;

import org.junit.Test;

import morfologik.fsa.FSA;
import morfologik.fsa.FSATraversal;
import morfologik.fsa.MatchResult;
import morfologik.fsa.builders.FSABuilder;

public class MorfologikTest {

    @Test
    public void expected() throws Exception {
        List<byte[]> inputs = new ArrayList<>();
        inputs.add("a".getBytes("UTF-8"));

        FSA fsa = FSABuilder.build(inputs);

        FSATraversal fsaTraversal = new FSATraversal(fsa);
        MatchResult match = fsaTraversal.match("ax".getBytes("UTF-8"));

        assertEquals(MatchResult.AUTOMATON_HAS_PREFIX, match.kind);
    }

    @Test
    public void unexpected() throws Exception {
        List<byte[]> inputs = new ArrayList<>();
        inputs.add("a".getBytes("UTF-8"));
        inputs.add("ab".getBytes("UTF-8"));

        FSA fsa = FSABuilder.build(inputs);

        FSATraversal fsaTraversal = new FSATraversal(fsa);
        MatchResult match = fsaTraversal.match("ax".getBytes("UTF-8"));

        assertEquals(MatchResult.AUTOMATON_HAS_PREFIX, match.kind);
    }
}

The test "expected" is successful ("green"), but the test "unexpected" fails ("red"), because match.kind is "NO_MATCH".

Is "NO_MATCH" the intended result or is there something wrong with my test?

@dweiss
Copy link
Member

dweiss commented Jul 15, 2017

The way it works is fine -- 'ax' returns no match because there was no such string in the dictionary. Prefix match is returned if your input string exists fully in the automaton, but does not correspond to any automaton string (think 'abc' encoded in the automaton and 'ab' query).

You can always traverse the automaton yourself -- FSATraversal are merely utilities, copy over the code and customize to your needs.

@dweiss dweiss closed this as completed Jul 15, 2017
@stevendolg
Copy link
Author

Sorry, maybe there's a misunderstanding.

In both cases the query "ax" is not accepted by the automaton, but its prefix "a" is.
While in the first test the matching result is "AUTOMATON_HAS_PREFIX", in the second test it is "NO_MATCH".
Shouldn't both cases produce the same matching result, since "a" is a prefix of "ax" and "a" is accepted by the automaton?

@dweiss
Copy link
Member

dweiss commented Jul 16, 2017

In both of these cases no-match is the right result. I explained above what automaton_has_prefix means in this context, but look at the code and you'll see the semantics of these enums.

@stevendolg
Copy link
Author

Well in the first case the actual result is MatchResult.AUTOMATON_HAS_PREFIX

@dweiss dweiss reopened this Jul 16, 2017
@dweiss
Copy link
Member

dweiss commented Jul 16, 2017

Darn, apologies Steven -- you're right, something is wrong here, I'll dig.

@stevendolg
Copy link
Author

Much appreciated and no need to apologize!

@dweiss
Copy link
Member

dweiss commented Jul 16, 2017

I looked at the code and to sure it looks wrong... it returns AUTOMATON_HAS_PREFIX only when the arc in the automaton is terminal (that is: there was no longer sequence encoded in the automaton and the processing needs to end).

I don't see it used anywhere in the code (other than a few irrelevant tests) and I wonder whether fixing this to actually work as it should is better than just removing the whole constant altogether... For larger automata there will nearly always be some kind of matching prefix (even if it's a single-letter one)... people who really need it can code manual traversal and those who do lookups will typically just need NOT_FOUND.

What is your scenario? How did you come across it?

@dweiss dweiss self-assigned this Jul 16, 2017
@dweiss dweiss added the bug label Jul 16, 2017
@dweiss dweiss added this to the 2.1.4 milestone Jul 16, 2017
@dweiss dweiss changed the title Unexpected NO_MATCH? FSATraversal may return NOT_FOUND instead of AUTOMATON_HAS_PREFIX Jul 16, 2017
@stevendolg
Copy link
Author

I was trying to find the longest accepted string in an untokenized input sequence.

E.g.

  • accept regions of license plates in Austria: "Wels", "Wien", "Salzburg", ...
  • find the longest match at the start of an actual license plate, e.g. "Wien 12345 AA"

This worked very well with a small test sample of license plate regions since I got a match of kind "AUTOMATON_HAS_PREFIX" and the first unmatched index in the input sequence.
However when I completed the list of regions and added "Wels Umgebung", "Wien Umgebung", "Salzburg Umgebung" this suddenly stopped working and I got "NO_MATCH" instead.

I completely agree that a custom implementation of FSATraversal will do the trick and I think that's a valid solution. I was merely curious to see if that behavior was intentional or not, since I found it to be inconsistent.

@dweiss
Copy link
Member

dweiss commented Jul 16, 2017

Thanks for filing the bug report and sorry again for hasty response -- this code hasn't been touched for years and it's funny things like that go unnoticed for so long (means nobody actually used this stuff!).

I think what I'll do is I'll fix the AUTOMATON_HAS_PREFIX to actually work as expected and correct the JavaDocs on that class. I'll need to review the use cases in existing code (not just in morfologik-stemming, but in other places as well) and I'm currently on short holidays, so I'll go back to it next week at the earliest. If you can temporarily copy/paste the traversal routine into your own code you'll have full control over how it works.

@stevendolg
Copy link
Author

I can definitely do the traversal in our own code and since I'm still in the experimentation phase I'm not in a hurry either.

Thank you for your time and your work!

@dweiss dweiss closed this as completed in 2168689 Jul 31, 2017
@dweiss
Copy link
Member

dweiss commented Jul 31, 2017

Fixed and released, thanks Steven.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants