Skip to content

Commit

Permalink
Merge pull request #110 from umcu/improvements
Browse files Browse the repository at this point in the history
Improve logic for rule based entity matching
  • Loading branch information
vmenger authored Jul 10, 2024
2 parents adb0947 + 7b2175b commit 113b55f
Show file tree
Hide file tree
Showing 17 changed files with 528 additions and 325 deletions.
12 changes: 11 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,14 +14,24 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
* Loading and exporting `InfoExtractionDataset` as dictionaries or JSON files
* Metric support for multi-class qualifiers
* Mantra GSC corpus for evaluation
* In the `RuleBasedEntityMatcher`, option to add terms as a `dict` (in addition to `str`, `list` and `Term`)
* In the `RuleBasedEntityMatcher`, option to add terms from dict (`add_terms_from_dict`), json (`add_terms_from_json`) or csv (`add_terms_from_csv`)
* In the `Term` class, an option to override arguments that were not set

### Changed

* Made the `default` field for `Qualifier` optional
* `InfoExtractionDataset` and `InfoExtractionMetrics` use `Qualifier` objects for qualifiers rather than `dict`
* `InfoExtractionDataset` and `InfoExtractionMetrics` no longer track or use qualifier defaults
* :exclamation: `InfoExtractionDataset` and `InfoExtractionMetrics` no longer track or use qualifier defaults
* Moved test cases to data directory in more open format, so they can be used by others
* Made qualifiers optional for metrics in `Annotation`
* Added a `normalize` method to `Normalizer`, so it can be used/tested directly
* The logic for determining whether the `RuleBasedEntityMatcher` should internally use the phrase matcher or the matcher is simplified

### Deprecated

* :exclamation: The `create_concept_dict` method, which is now replaced by `add_terms_from_csv` in `RuleBasedEntityMatcher`
* :exclamation: In the `RuleBasedEntityMatcher`, the `load_concepts` method, which is now replaced by `add_terms_from_dict` and `add_terms_from_json`

## 0.8.1 (2024-06-27)

Expand Down
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -49,7 +49,7 @@ nlp.add_pipe("clinlp_normalizer")
nlp.add_pipe("clinlp_sentencizer")

# Entities
concepts = {
terms = {
"prematuriteit": [
"preterm", "<p3", "prematuriteit", "partus praematurus"
],
Expand All @@ -62,7 +62,7 @@ concepts = {
}

entity_matcher = nlp.add_pipe("clinlp_rule_based_entity_matcher", config={"attr": "NORM", "fuzzy": 1})
entity_matcher.load_concepts(concepts)
entity_matcher.add_terms_from_dict(terms)

# Qualifiers
nlp.add_pipe("clinlp_context_algorithm", config={"phrase_matcher_attr": "NORM"})
Expand Down
69 changes: 41 additions & 28 deletions docs/source/components.md
Original file line number Diff line number Diff line change
Expand Up @@ -72,14 +72,14 @@ The sentencizer is a rule-based sentence boundary detector. It is designed to de
| example | `nlp.add_pipe("clinlp_rule_based_entity_matcher")` |
| requires | `-` |
| assigns | `doc.spans['ents']` |
| config options | `attr = "TEXT"` <br /> `proximity = 0` <br /> `fuzzy = 0` <br /> `fuzzy_min_len = 0` <br /> `pseudo = False` |
| config options | `attr = "TEXT"` <br /> `proximity = 0` <br /> `fuzzy = 0` <br /> `fuzzy_min_len = 0` <br /> `pseudo = False` <br /> `resolve_overlap = False` <br /> `spans_key = 'ents'` |

The `clinlp_rule_based_entity_matcher` component can be used for matching entities in text, based on a dictionary of known concepts and their terms/synonyms. It includes options for matching on different token attributes, proximity matching, fuzzy matching and non-matching pseudo/negative terms.

The most basic example would be the following, with further options described below:

```python
concepts = {
terms = {
"sepsis": [
"sepsis",
"lijnsepsis",
Expand All @@ -93,7 +93,7 @@ concepts = {
}

entity_matcher = nlp.add_pipe("clinlp_rule_based_entity_matcher")
entity_matcher.load_concepts(concepts)
entity_matcher.add_terms_from_dict(terms)
```

```{admonition} Spans vs ents
Expand Down Expand Up @@ -149,7 +149,7 @@ The settings above are described at the matcher level, but can all be overridden
```python
from clinlp.ie import Term

concepts = {
terms = {
"sepsis": [
"sepsis",
"lijnsepsis",
Expand All @@ -161,7 +161,7 @@ concepts = {
}

entity_matcher = nlp.add_pipe("clinlp_rule_based_entity_matcher", config={"attr": "NORM", "fuzzy": 1})
entity_matcher.load_concepts(concepts)
entity_matcher.add_terms_from_dict(terms)
```

In the above example, by default the `NORM` attribute is used, and `fuzzy` is set to `1`. In addition, for the terms `early onset` and `late onset` proximity matching is set to `1`, in addition to matcher-level config of matching the `NORM` attribute and fuzzy matching. For the `EOS` and `LOS` abbreviations the `TEXT` attribute is used (so the matching is case sensitive), and fuzzy matching is disabled.
Expand All @@ -171,7 +171,7 @@ In the above example, by default the `NORM` attribute is used, and `fuzzy` is se
On the term level, it is possible to add pseudo or negative patterns, for those phrases that need to be excluded. For example:

```python
concepts = {
terms = {
"prematuriteit": [
"prematuur",
Term("prematuur ademhalingspatroon", pseudo=True),
Expand All @@ -186,7 +186,7 @@ In this case `prematuur` will be matched, but not in the context of `prematuur a
Finally, if you need more control than literal phrases and terms as explained above, the entity matcher also accepts [`spaCy` patterns](https://spacy.io/usage/rule-based-matching#adding-patterns). These patterns do not respect any other configurations (like attribute, fuzzy, proximity, etc.):

```python
concepts = {
terms = {
"delier": [
Term("delier", attr="NORM"),
Term("DOS", attr="TEXT"),
Expand All @@ -208,9 +208,41 @@ concepts = {
}
```

#### Concept dictionary from external source
#### Adding concept sets

External lists of concepts (e.g. from a medical thesaurus such as `UMLS`) can also be loaded directly from `csv` through the `create_concept_dict` function. Your `csv` should contain a combination of concept and phrase on each line, with optional columns to configure the `Term`-options described above (e.g. `attribute`, `proximity`, `fuzzy`). You may present the columns in any order, but make sure the names match the `Term` attributes. Any other columns are ignored. For example:
External lists of concepts (e.g. from a medical thesaurus such as `UMLS`) can also be loaded directly from `JSON` or `csv`.

##### Adding terms from json

Terms from `JSON` can be added by using `add_terms_from_json`. Your json should have the following format:

```json
{
"terms": {
"concept_identifier": [
"term",
{
"phrase": "term",
"attr": "some_attr"
},
[
{
"NORM": "term"
}
]
],
"next_concept_identifier": [
"other_term"
]
}
}
```

Each term can be presented as a `str` (direct phrase), `dict` (arguments directly passed to `clinlp.ie.Term`), or `list` (a `spaCy` pattern). Any other top level keys than `terms` are ignored, so metadata can be added (e.g. a description, authors, etc.).

##### Adding terms from csv

Terms from `csv` can be added through the `add_terms_from_csv` function. Your `csv` should contain a combination of concept and phrase on each line, with optional columns to configure the `Term`-options described above (e.g. `attribute`, `proximity`, `fuzzy`). You may present the columns in any order, but make sure the names match the `Term` attributes. Any other columns are ignored. For example:

| **concept** | **phrase** | **attr** | **proximity** | **fuzzy** | **fuzzy_min_len** | **pseudo** | **comment** |
|--|--|--|--|--|--|--|--|
Expand All @@ -221,25 +253,6 @@ External lists of concepts (e.g. from a medical thesaurus such as `UMLS`) can al
| veneus_infarct | veneus infarct | | | | | | |
| veneus_infarct | VI | TEXT | | | | | |

Will result in the following concept dictionary:

```python
{
"prematuriteit": [
"prematuriteit",
Term("<p3", proximity=1, fuzzy=1, fuzzy_min_len=2),
],
"hypotensie": [
"hypotensie",
Term("bd verlaagd", proximity=1)
],
"veneus_infarct": [
"veneus infarct",
Term("VI", attr="TEXT")
]
}
```

## Qualification

### `clinlp_context_algorithm`
Expand Down
4 changes: 2 additions & 2 deletions docs/source/getting_started.md
Original file line number Diff line number Diff line change
Expand Up @@ -103,7 +103,7 @@ First, we will add the `clinlp_rule_based_entity_matcher`, along with some sampl
```python
from clinlp.ie import Term

concepts = {
terms = {
"prematuriteit": [
"preterm", "<p3", "prematuriteit", "partus praematurus"
],
Expand All @@ -120,7 +120,7 @@ entity_matcher = nlp.add_pipe(
config={"attr": "NORM", "fuzzy": 1}
)

entity_matcher.load_concepts(concepts)
entity_matcher.add_terms_from_dict(terms)
```

The above code adds three concepts to be matched (`prematuriteit`, `hypotensie`, and `veneus_infarct`), along with synonyms to match. Additionally, it configures the entity matcher on how to perform the matching. We have here configured the entity matcher to match against the `NORM` attribute by default, which it finds in the `Token.norm_` property the `clinlp_normalizer` set earlier. The `fuzzy` parameter specifies how much the concept text and the real text can differ (based on the edit distance). Some settings are overruled at the `Term` level. For instance, the `proximity=1` parameter for `bd verlaagd` specifies that at most one token may skipped between the words `bd` and `verlaagd`.
Expand Down
4 changes: 2 additions & 2 deletions scripts/generate_clinlp_docs.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@
"betreft een Premature zuigeling",
]

concepts = {
terms = {
"C0002871_anemie": [
"anemie",
],
Expand Down Expand Up @@ -66,7 +66,7 @@ def get_model() -> Language:
config={"attr": "NORM", "fuzzy": 1, "fuzzy_min_len": 8},
)

entity_matcher.load_concepts(concepts)
entity_matcher.add_terms_from_dict(terms)

nlp.add_pipe(
"clinlp_context_algorithm",
Expand Down
4 changes: 2 additions & 2 deletions scripts/generate_qualifier_regression_data.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,14 +21,14 @@ def get_model() -> Language:
nlp.add_pipe("clinlp_sentencizer")

# Entities
concepts = {
terms = {
"named_entity": ["ENTITY"],
}

entity_matcher = nlp.add_pipe(
"clinlp_rule_based_entity_matcher", config={"attr": "NORM"}
)
entity_matcher.load_concepts(concepts)
entity_matcher.add_terms_from_dict(terms)

# Qualifiers
nlp.add_pipe("clinlp_context_algorithm", config={"phrase_matcher_attr": "NORM"})
Expand Down
Loading

0 comments on commit 113b55f

Please sign in to comment.