-
Notifications
You must be signed in to change notification settings - Fork 125
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Library usage questions... #104
Comments
Hi Thierry, glad to hear you found pyahocorasick module useful. There's no way to limit number of hits, you have to do this in your python code. But it seems to be an interesting feature. Would you open a feature request for this? Automaton will always return all occurrences of strings from the set, even when they overlap. I think the approach from #21 is what you need, but it's not ready. Seems I'll need to back to this, you're another person who requests this. |
Thanks for your reply! |
@tflorac thank you, I always need help, as my day usually has 24h only :) |
I'll complete my initial question with another one... |
@tflorac I'm afraid there is no such rule. The automaton work this way that if it matches the last character of some word, it also can figure out other words with the same suffix. But there's no notion of "the first character of match". BTW, what should be highlighted if you have set ["Advanced Python", "Python Book"] and the string is "Advanced Python Book"? There is partial overlap. |
@WojciechMula: well... so at first I'm going to follow a simple process where if two terms (or more) match at the same position, the first one will win and will be selected! I don't have any rule actually to answer to your question in the use case you give. I can't define any priority between the three terms, except by giving a "weight" to each term (statically or, for example, based on their respective length)... :-/ |
I also came here to request exactly this... I think the functionality "keep the longest match" will be very well received :) @WojciechMula Oh and thanks for being so quick on responding to people's requests, I'm having a great time. I'm building a library on top of this: I pretty much finished it but only now realized the overlap "issue" :( In my case I want to add 2 strings and count on the fact that only the longest one will be kept. |
I think this works as python logic, but it would be awesome if we could get the speed from c as I think this will be really slow:
In my case it seems that incorporating this code yields a ~15% slowdown. |
Hi,
I'm just starting to discover and use your library to extract terms defined in a thesaurus from an input text, and "highlight" them in a HTML output.
It works quite well and quickly, anyway I still have a few issues:
Many thanks for any advise!
Best regards,
Thierry
The text was updated successfully, but these errors were encountered: