Merge pull request #110 from umcu/improvements

Improve logic for rule based entity matching
umcu · Jul 10, 2024 · 113b55f · 113b55f
2 parents adb0947 + 7b2175b
commit 113b55f
Show file tree

Hide file tree

Showing 17 changed files with 528 additions and 325 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -14,14 +14,24 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 * Loading and exporting `InfoExtractionDataset` as dictionaries or JSON files
 * Metric support for multi-class qualifiers
 * Mantra GSC corpus for evaluation
+* In the `RuleBasedEntityMatcher`, option to add terms as a `dict` (in addition to `str`, `list` and `Term`)
+* In the `RuleBasedEntityMatcher`, option to add terms from dict (`add_terms_from_dict`), json (`add_terms_from_json`) or csv (`add_terms_from_csv`)
+* In the `Term` class, an option to override arguments that were not set
 
 ### Changed
 
 * Made the `default` field for `Qualifier` optional
 * `InfoExtractionDataset` and `InfoExtractionMetrics` use `Qualifier` objects for qualifiers rather than `dict`
-* `InfoExtractionDataset` and `InfoExtractionMetrics` no longer track or use qualifier defaults
+* :exclamation: `InfoExtractionDataset` and `InfoExtractionMetrics` no longer track or use qualifier defaults
 * Moved test cases to data directory in more open format, so they can be used by others
 * Made qualifiers optional for metrics in `Annotation`
+* Added a `normalize` method to `Normalizer`, so it can be used/tested directly
+* The logic for determining whether the `RuleBasedEntityMatcher` should internally use the phrase matcher or the matcher is simplified
+
+### Deprecated
+
+* :exclamation: The `create_concept_dict` method, which is now replaced by `add_terms_from_csv` in `RuleBasedEntityMatcher`
+* :exclamation: In the `RuleBasedEntityMatcher`, the `load_concepts` method, which is now replaced by `add_terms_from_dict` and `add_terms_from_json`
 
 ## 0.8.1 (2024-06-27)
 

diff --git a/README.md b/README.md
@@ -49,7 +49,7 @@ nlp.add_pipe("clinlp_normalizer")
 nlp.add_pipe("clinlp_sentencizer")
 
 # Entities
-concepts = {
+terms = {
     "prematuriteit": [
         "preterm", "<p3", "prematuriteit", "partus praematurus"
     ],
@@ -62,7 +62,7 @@ concepts = {
 }
 
 entity_matcher = nlp.add_pipe("clinlp_rule_based_entity_matcher", config={"attr": "NORM", "fuzzy": 1})
-entity_matcher.load_concepts(concepts)
+entity_matcher.add_terms_from_dict(terms)
 
 # Qualifiers
 nlp.add_pipe("clinlp_context_algorithm", config={"phrase_matcher_attr": "NORM"})

diff --git a/docs/source/components.md b/docs/source/components.md
@@ -72,14 +72,14 @@ The sentencizer is a rule-based sentence boundary detector. It is designed to de
 | example | `nlp.add_pipe("clinlp_rule_based_entity_matcher")` |
 | requires | `-` |
 | assigns | `doc.spans['ents']` |
-| config options | `attr = "TEXT"` <br /> `proximity = 0` <br /> `fuzzy = 0` <br /> `fuzzy_min_len = 0` <br /> `pseudo = False` |
+| config options | `attr = "TEXT"` <br /> `proximity = 0` <br /> `fuzzy = 0` <br /> `fuzzy_min_len = 0` <br /> `pseudo = False` <br /> `resolve_overlap = False` <br /> `spans_key = 'ents'` |
 
 The `clinlp_rule_based_entity_matcher` component can be used for matching entities in text, based on a dictionary of known concepts and their terms/synonyms. It includes options for matching on different token attributes, proximity matching, fuzzy matching and non-matching pseudo/negative terms.
 
 The most basic example would be the following, with further options described below:
 
 ```python
-concepts = {
+terms = {
     "sepsis": [
         "sepsis",
         "lijnsepsis",
@@ -93,7 +93,7 @@ concepts = {
 }
 
 entity_matcher = nlp.add_pipe("clinlp_rule_based_entity_matcher")
-entity_matcher.load_concepts(concepts)
+entity_matcher.add_terms_from_dict(terms)
 ```
 
 ```{admonition} Spans vs ents
@@ -149,7 +149,7 @@ The settings above are described at the matcher level, but can all be overridden
 ```python
 from clinlp.ie import Term
 
-concepts = {
+terms = {
     "sepsis": [
         "sepsis",
         "lijnsepsis",
@@ -161,7 +161,7 @@ concepts = {
 }
 
 entity_matcher = nlp.add_pipe("clinlp_rule_based_entity_matcher", config={"attr": "NORM", "fuzzy": 1})
-entity_matcher.load_concepts(concepts)
+entity_matcher.add_terms_from_dict(terms)
 ```
 
 In the above example, by default the `NORM` attribute is used, and `fuzzy` is set to `1`. In addition, for the terms `early onset` and `late onset` proximity matching is set to `1`, in addition to matcher-level config of matching the `NORM` attribute and fuzzy matching. For the `EOS` and `LOS` abbreviations the `TEXT` attribute is used (so the matching is case sensitive), and fuzzy matching is disabled.
@@ -171,7 +171,7 @@ In the above example, by default the `NORM` attribute is used, and `fuzzy` is se
 On the term level, it is possible to add pseudo or negative patterns, for those phrases that need to be excluded. For example:
 
 ```python
-concepts = {
+terms = {
     "prematuriteit": [
         "prematuur",
         Term("prematuur ademhalingspatroon", pseudo=True),
@@ -186,7 +186,7 @@ In this case `prematuur` will be matched, but not in the context of `prematuur a
 Finally, if you need more control than literal phrases and terms as explained above, the entity matcher also accepts [`spaCy` patterns](https://spacy.io/usage/rule-based-matching#adding-patterns). These patterns do not respect any other configurations (like attribute, fuzzy, proximity, etc.):
 
 ```python
-concepts = {
+terms = {
     "delier": [
         Term("delier", attr="NORM"),
         Term("DOS", attr="TEXT"),
@@ -208,9 +208,41 @@ concepts = {
 }
 ```
 
-#### Concept dictionary from external source
+#### Adding concept sets
 
-External lists of concepts (e.g. from a medical thesaurus such as `UMLS`) can also be loaded directly from `csv` through the `create_concept_dict` function. Your `csv` should contain a combination of concept and phrase on each line, with optional columns to configure the `Term`-options described above (e.g. `attribute`, `proximity`, `fuzzy`). You may present the columns in any order, but make sure the names match the `Term` attributes. Any other columns are ignored. For example:
+External lists of concepts (e.g. from a medical thesaurus such as `UMLS`) can also be loaded directly from `JSON` or `csv`.
+
+##### Adding terms from json
+
+Terms from `JSON` can be added by using `add_terms_from_json`. Your json should have the following format:
+
+```json
+{
+    "terms": {
+        "concept_identifier": [
+            "term",
+            {
+                "phrase": "term",
+                "attr": "some_attr"
+            },
+            [
+                {
+                    "NORM": "term"
+                }
+            ]
+        ],
+        "next_concept_identifier": [
+            "other_term"
+        ]
+    }
+}
+```
+
+Each term can be presented as a `str` (direct phrase), `dict` (arguments directly passed to `clinlp.ie.Term`), or `list` (a `spaCy` pattern). Any other top level keys than `terms` are ignored, so metadata can be added (e.g. a description, authors, etc.).
+
+##### Adding terms from csv
+
+ Terms from `csv` can be added through the `add_terms_from_csv` function. Your `csv` should contain a combination of concept and phrase on each line, with optional columns to configure the `Term`-options described above (e.g. `attribute`, `proximity`, `fuzzy`). You may present the columns in any order, but make sure the names match the `Term` attributes. Any other columns are ignored. For example:
 
 | **concept** | **phrase** | **attr** | **proximity** | **fuzzy** | **fuzzy_min_len** | **pseudo** | **comment** |
 |--|--|--|--|--|--|--|--|
@@ -221,25 +253,6 @@ External lists of concepts (e.g. from a medical thesaurus such as `UMLS`) can al
 | veneus_infarct | veneus infarct | | | | | | |
 | veneus_infarct | VI | TEXT | | | | | |
 
-Will result in the following concept dictionary:
-
-```python
-{
-    "prematuriteit": [
-        "prematuriteit",
-        Term("<p3", proximity=1, fuzzy=1, fuzzy_min_len=2),
-    ],
-    "hypotensie": [
-        "hypotensie",
-        Term("bd verlaagd", proximity=1)
-    ],
-    "veneus_infarct": [
-        "veneus infarct",
-        Term("VI", attr="TEXT")
-    ]
-}
-```
-
 ## Qualification
 
 ### `clinlp_context_algorithm`

diff --git a/docs/source/getting_started.md b/docs/source/getting_started.md
@@ -103,7 +103,7 @@ First, we will add the `clinlp_rule_based_entity_matcher`, along with some sampl
 ```python
 from clinlp.ie import Term
 
-concepts = {
+terms = {
     "prematuriteit": [
         "preterm", "<p3", "prematuriteit", "partus praematurus"
     ],
@@ -120,7 +120,7 @@ entity_matcher = nlp.add_pipe(
     config={"attr": "NORM", "fuzzy": 1}
 )
 
-entity_matcher.load_concepts(concepts)
+entity_matcher.add_terms_from_dict(terms)
 ```
 
 The above code adds three concepts to be matched (`prematuriteit`, `hypotensie`, and `veneus_infarct`), along with synonyms to match. Additionally, it configures the entity matcher on how to perform the matching. We have here configured the entity matcher to match against the `NORM` attribute by default, which it finds in the `Token.norm_` property the `clinlp_normalizer` set earlier. The `fuzzy` parameter specifies how much the concept text and the real text can differ (based on the edit distance). Some settings are overruled at the `Term` level. For instance, the `proximity=1` parameter for `bd verlaagd` specifies that at most one token may skipped between the words `bd` and `verlaagd`.

diff --git a/scripts/generate_clinlp_docs.py b/scripts/generate_clinlp_docs.py
@@ -28,7 +28,7 @@
     "betreft een Premature zuigeling",
 ]
 
-concepts = {
+terms = {
     "C0002871_anemie": [
         "anemie",
     ],
@@ -66,7 +66,7 @@ def get_model() -> Language:
         config={"attr": "NORM", "fuzzy": 1, "fuzzy_min_len": 8},
     )
 
-    entity_matcher.load_concepts(concepts)
+    entity_matcher.add_terms_from_dict(terms)
 
     nlp.add_pipe(
         "clinlp_context_algorithm",

diff --git a/scripts/generate_qualifier_regression_data.py b/scripts/generate_qualifier_regression_data.py
@@ -21,14 +21,14 @@ def get_model() -> Language:
     nlp.add_pipe("clinlp_sentencizer")
 
     # Entities
-    concepts = {
+    terms = {
         "named_entity": ["ENTITY"],
     }
 
     entity_matcher = nlp.add_pipe(
         "clinlp_rule_based_entity_matcher", config={"attr": "NORM"}
     )
-    entity_matcher.load_concepts(concepts)
+    entity_matcher.add_terms_from_dict(terms)
 
     # Qualifiers
     nlp.add_pipe("clinlp_context_algorithm", config={"phrase_matcher_attr": "NORM"})