Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Analyzer - multiple languages and nlp engines #312

Merged
merged 16 commits into from
Jul 22, 2020

Conversation

dhpollack
Copy link
Contributor

Initially this was my attempt to use stanza, which is an nlp engine by
Stanford. But generally, it's an update to allow for one to add NLP
engines and custom recognizers more easily. Specifically, I
standardized the format of the recognizers, removed use of global
variables when possible, and removed a lot of hard-coding of defaults.

I am thinking of using presidio for several non-english projects at work
and these are several of the changes that I made.

Below is a list of the changes in list form:

  • make spacy and/or stanza optional
    • remove requirement of en_core_web_lg from install
  • allow predefined recognizers to take parameters
    • this allows for easily using these as non-english recognizers
  • create config files for different NLP engines
  • create tests for stanza
  • make all spacy and stanza tests optional
  • create a Dockerfile for an anaconda-based image
    • pytorch is built with MKL and is much faster on cpu from conda
  • completely rewrote the IBAN recognizer
    • the current version only recognizes IBANs if they are the entirety
      of the string. This version will find IBANs in sentences.
  • fixed some tests
  • created a run.sh file, so just run dockers without rebuilding them

"Breaking" Changes:

  • I would like to use black, but it's
    not super friendly with pylint. My suggestion is to drop pylint and
    use black instead.
  • Default spacy model is en rather than en_core_web_lg and no spacy
    models are downloaded by default. The idea is to let the user choose
    which models they want. For non-english users, it saves a lot of time
    at installation because you don't need to install the large spacy
    model that you aren't using.

Relevant Issues:
#302

Signed-off-by: David Pollack d.pollack@solvemate.com

@omri374
Copy link
Contributor

omri374 commented Apr 22, 2020

Hi @dhpollack, thanks for your awesome contribution! We'll look at it and add comments.
Some initial thoughts:

  1. Have you looked at the presidio-research repo for evaluating NLP models? We didn't look at stanza but we did compare spaCy to Flair and CRF models. We'd like to make sure that Stanza is a. fast enough and b. accurate enough. Flair, for example, was extremely accurate but very slow, and CRF was very fast but less accurate.
  2. Have you considered using spacy-stanza?

@dhpollack
Copy link
Contributor Author

Hi @dhpollack, thanks for your awesome contribution! We'll look at it and add comments.
Some initial thoughts:

1. Have you looked at the [presidio-research](https://github.com/microsoft/presidio-research/) repo for evaluating NLP models? We didn't look at stanza but we did compare spaCy to Flair and CRF models. We'd like to make sure that Stanza is a. fast enough and b. accurate enough. Flair, for example, was extremely accurate but very slow, and CRF was very fast but less accurate.

2. Have you considered using [spacy-stanza](https://spacy.io/universe/project/spacy-stanza)?

@omri374

  1. I looked at it, but it was after I started hacking away at the original library. I also wanted the anonymization of text more than the analyzer so it was easier to use this library. Additionally, I was also looking into deploying a service either via k8s or docker eventually.

  2. I hadn't seen this. This looks like it would be easier to use. I'll definitely look further into it.

For my use case, speed was less of a priority than accuracy. Currently I'm only using a single core instance to test things out and it was fast enough for my use case, but I am sure that if one were to use a GPU machine it would be much faster. My main goal was see how difficult it would be add different models from different libraries to the analyzer (and ultimately the anonymization) and to make things a bit more user-friendly to add non-English pipelines. A ancillary goal was to play with stanza and see how easy it would be to use as an alternative to spacy.

Feel free to cherrypick certain parts of this PR. It looks like there is another PR which adds another language (hebrew), but I wanted to throw this out there as an alternative way of doing it. Instead of creating new python classes for each language one can get the same results with something like the following:

il_domain_recognizer = DomainRecognizer(pattern_groups=[("IL Domain()", u"(www.)?[\u05D0-\u05EA]{2,63}.co.il", 0.5)], supported_entity="IL_DOMAIN_NAME", supported_language="he")

Then you can even use config files for different languages / recognizers that are based on the predefined recognizers.

Another option for me would be to could use the presidio-research repo and then reimplement the anonymization part in python. I've just gotten started with this project so I'll take a look into that as well.

@omri374
Copy link
Contributor

omri374 commented Apr 22, 2020

One thing we had issues with regarding languages, is the context phrases. Even in patterns like credit cards (which are international), the context words are still language specific. One option would be to create a recognizer per language and the other is to provide a set of context phrases for each language.

What do you think would be the most practical and user friendly?

@dhpollack
Copy link
Contributor Author

I made the context an input as well. I just didn't use it in the example because the PR with the example used the same context as the default.

@dhpollack
Copy link
Contributor Author

I standardized the format of the predefined recognizers. If you look at the code you should be able to see what's configurable and what's not. But do you mean at run time and loading all the predefined recognizers for each language? I hadn't tackled this problem yet but Off the top of my head I would use a configuration file with a list of recognizers and load those, otherwise lod the defaults.

@elyase
Copy link

elyase commented Apr 22, 2020

@dhpollack I would be very interested in testing this PR. I installed the python wheel from your PR and downloaded the spacy models but when I try:

from presidio_analyzer import AnalyzerEngine
engine = AnalyzerEngine(default_language="de")
text = "My name is David and I live in Miami"
response = engine.analyze(correlation_id=0, text = text, entities=[], language='de', all_fields=True, score_threshold=0.5)

I am getting a:

2020-04-22 20:45:27 ERROR: Failed to get recognizers hash
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/elyase/Downloads/presidio/presidio-analyzer/presidio_analyzer/analyzer_engine.py", line 198, in analyze
    all_fields=all_fields)
  File "/Users/elyase/Downloads/presidio/presidio-analyzer/presidio_analyzer/recognizer_registry/recognizer_registry.py", line 113, in get_recognizers
    "No matching recognizers were found to serve the request.")
ValueError: No matching recognizers were found to serve the request.

Are you able to run that example?

@dhpollack
Copy link
Contributor Author

dhpollack commented Apr 23, 2020

@elyase

from presidio_analyzer import AnalyzerEngine
from presidio_analyzer.recognizer_registry.recognizer_registry import RecognizerRegistry
from presidio_analyzer.nlp_engine.spacy_nlp_engine import SpacyNlpEngine
registry = RecognizerRegistry()
nlp = SpacyNlpEngine({"de": "de"})
registry.load_predefined_recognizers(["de"], "spacy")
engine = AnalyzerEngine(registry=registry, nlp_engine=nlp, default_language="de")
text = "My name is David and I live in Miami"
response = engine.analyze(correlation_id=0, text = text, entities=[], language='de', all_fields=True, score_threshold=0.5)

I did the above and get a different error about not having a hash for the API store. I think this is because I don't have redis running on my machine. However, this is a simplified version of what app.py does. I've never tried this outside of using app.py so I'm not exactly sure how it should work.

@elyase
Copy link

elyase commented Apr 23, 2020

@dhpollack that works. The hash error is fortunately just logging output. Thanks a lot!

@omri374
Copy link
Contributor

omri374 commented Apr 23, 2020

We have a task on ignoring this error when running the analyzer in a standalone way

Copy link
Contributor

@omri374 omri374 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First, thanks for the great contribution. It makes the code much more readable and robust.
I would suggest to look at spacy-stanza. As it's mainatained by explosion AI, I would assume it's production ready. This would make the NLP engine (and other parts) much simpler to understand and maintain.

presidio-analyzer/Pipfile Outdated Show resolved Hide resolved
presidio-analyzer/tests/test_stanza_recognizer.py Outdated Show resolved Hide resolved
presidio-analyzer/tests/test_spacy_recognizer.py Outdated Show resolved Hide resolved
presidio-analyzer/tests/test_spacy_recognizer.py Outdated Show resolved Hide resolved
presidio-analyzer/requirements.txt Outdated Show resolved Hide resolved
presidio-analyzer/presidio_analyzer/recognizer_result.py Outdated Show resolved Hide resolved
@dhpollack
Copy link
Contributor Author

@omri374 What are your thoughts on dropping pylint in favor of black? It's also possible to continue using pylint, but there are a few things that black does that pylint complains about. I guess we could just ignore those in the pylintrc file. I started doing a bit of programming in Rust and C++ and found standardized code formatters remove a lot of cognitive energy that went into silly things like "should I put these parameters each on their own line, all on one line, two per line,...". I love it especially on other OSS projects that I've worked on.

@dhpollack dhpollack changed the title analyzer - multiple languages and nlp engines [WIP] analyzer - multiple languages and nlp engines Apr 24, 2020
@dhpollack
Copy link
Contributor Author

Ok, now 'm using pytest marks to mark the tests that test differently with the big model vs the small one. Also the test times on my machine are 1.8s vs 6.3s vs 22s (spacy, spacy + stanza, spacy + stanza + en_core_web_lg). There is currently only 1 test that gets different results when using en_core_web_lg vs en. I am using pytest --runslow to add the large model tests rather than skipping them.

@omri374
Copy link
Contributor

omri374 commented Apr 26, 2020

@dhpollack regarding Black, I'm not familiar enough with it. We'll add it to the backlog to investigate. Thanks for the suggestion!

@omri374
Copy link
Contributor

omri374 commented Apr 30, 2020

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@microsoft microsoft deleted a comment from azure-pipelines bot May 2, 2020
@dhpollack
Copy link
Contributor Author

@omri374 do you want to break out these changes separately yourself? If you cherry-pick certain changes but not all then it gets harder to merge / rebase into master. I maintain a separate branch internally, but I can keep this branch roughly up-to-date with that one.

Also I think you can run your CI pipeline and it should work.

@omri374
Copy link
Contributor

omri374 commented May 12, 2020

@dhpollack I think we should consider spacy-stanza first. This is the main blocker for accepting this PR. Breaking it into sub PRs or cherry-picking might require some manual work but it would easier to review. We need another reviewer to merge this.
Did you get a chance to try out spacy-stanza? if this is something you can introduce into this PR, I would be happy to work on breaking it into smaller PRs. WDYT?

@dhpollack
Copy link
Contributor Author

Spacy-stanza is in there. I made a few significant changes since the original PR. Spacy-stanza being one and the other being pytest style tests. But it might be good to get a fresh set of eyes on it and see what they think.

@omri374
Copy link
Contributor

omri374 commented May 19, 2020

@dhpollack thanks for the updates. If you prefer that we go over this together over a meeting, feel free to reach out: presidio@microsoft.com

@omri374
Copy link
Contributor

omri374 commented May 20, 2020

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@omri374
Copy link
Contributor

omri374 commented May 21, 2020

@dhpollack CI fails on linting. Could you please verify that pylint and flake 8 pass successfully?

@dhpollack
Copy link
Contributor Author

@dhpollack CI fails on linting. Could you please verify that pylint and flake 8 pass successfully?

@omri374 done

David Pollack added 13 commits July 15, 2020 14:19
* made spacy required
* using spacy-stanza for stanza models
* refactor tests to use pytest
* make one test reliant on big model optional
All tests have been refactored to use pytest.  Previously, there was a
mix of unittest, pytest and miscellaneous global initializations.  This
commit moves everything to pytest.  There is now extensive use of
fixtures instead of global variables and parametrized tests instead of
duplicated code for each test.  The major difference is that
parametrized tests are not individually named.
this installs the big spacy model by default in the Docker and the Azure
pipeline.
* add documentation and doc strings
* change yaml field names to be more logical
@dhpollack
Copy link
Contributor Author

@omri374 I made the changes based on the review. I am getting an branch conflict error, but it might just be an error with the github frontend and my browser cache. It should have been fixed with this commit.

@omri374 omri374 self-requested a review July 19, 2020 06:59
@omri374
Copy link
Contributor

omri374 commented Jul 19, 2020

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

…e_langs

* fix merge conflicts with documentation
@omri374
Copy link
Contributor

omri374 commented Jul 20, 2020

@balteravishay @eladiw could you please review the non-python and non-md files and add your comments?

@omri374 omri374 merged commit e5fe414 into microsoft:master Jul 22, 2020
@omri374
Copy link
Contributor

omri374 commented Jul 22, 2020

Merged! Thanks a lot @dhpollack !

@omri374 omri374 changed the title [WIP] analyzer - multiple languages and nlp engines Analyzer - multiple languages and nlp engines Sep 15, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants