Refactor entity matching name cleaner to be more efficient #3953

katie-lamb · 2024-11-08T23:05:15Z

Overview

As part of the SEC to EIA record linkage development, I had to make some changes to the PUDL company name cleaning module to make it more efficient and useful. The code for this module was originally pulled OS Climate's repo, but it was no longer maintained there. I didn't make significant changes when I pulled out that module, and thus it had some quirks and inefficiencies.

What problem does this address?

The name cleaner was very slow. It's still pretty slow on big datasets, but is about 3x faster than previously

What did you change?

Instead of using apply to apply the regex replacement rules, I used pd.Series.replace so that this replacement is vectorized.
Removed some coupling in the cleaning rules and restructured the CompanyNameCleaner class
Made some updates to the regex rules to be more effective

Documentation

Make sure to update relevant aspects of the documentation.

Tasks

Give feedback

Update the release notes: reference the PR and related issues.
Update relevant Data Source jinja templates (see docs/data_sources/templates).
Update relevant table or source description metadata (see src/metadata).
Review and update any other aspects of the documentation that might be affected by this PR.
Options

Testing

How did you make sure this worked? How can a reviewer verify this?

To-do list

Give feedback

If updating analyses or data processing functions: make sure to update or write data validation tests (e.g. test_minmax_rows()).
Run make pytest-coverage locally to ensure that the merge queue will accept your PR.
Review the PR yourself and call out any questions or issues you have.
For minor ETL changes or data additions, once make pytest-coverage passes, make sure you have a fresh full PUDL DB downloaded locally, materialize new/changed assets and all their downstream assets and run relevant data validation tests using pytest and --live-dbs.
For bigger ETL or data changes run the full ETL locally and then run the data validations using make pytest-validate.
Alternatively, run the build-deploy-pudl GitHub Action manually.
Options

For more information, see https://pre-commit.ci

…erative/pudl into rl-cleaning-upates

katie-lamb · 2024-12-04T18:34:19Z

@zschira I'm not sure how to fix the docs error... I posted about this in Slack, but here's the problem:

The warning (treated as error) occurs here and is about duplicate object descriptions for the class variables that I describe here in my CompanyNameCleaner class. When I look at the doc page that's generated by sphinx, it looks fine, although it lists all of these attributes twice, first with the full description and then with no description. Am I not doing the docstring for this class correctly? Is there something I should include so sphinx ignores this?

bendnorman · 2024-12-04T20:20:49Z

src/pudl/analysis/record_linkage/name_cleaner.py

+class CompanyNameCleaner(BaseModel):
+    """Class to normalize/clean up text based company names.
+
+    Attributes:


I resolve the warning/error you need to move the attribute doc strings directly below the attribute definitions:

cleaning_rules_list: list[str] = DEFAULT_CLEANING_RULES_LIST """A list of cleaning rules that the CompanyNameCleaner should apply. Will be validated to ensure rules comply to allowed cleaning functions."""

You can use this pudl class as a reference.

I think you're getting a duplicate warning because sphinx is creating empty doc strings for the attributes and then it looks at the doc string of the class and creates doc strings for the attributes again.

katie-lamb · 2024-12-05T03:51:35Z

Thanks for the help @bendnorman ! I think now it's just the same error that you mentioned was generally causing problems in PUDL?

zschira

A couple of small questions, but overall looks good!

One random thing I started thinking about while looking at this is that maybe we were going about this the wrong way when trying to think up a way to avoid illegal states with string cleaning, and instead a better approach would be to just provide as much insight into what's happening as possible. It could be cool to try to turn this into a series of dagster op's so you could see the order in the dagster UI and inspect intermediate outputs.

This is all just some random thoughts though, not something I think we should try to actually do right now.

src/pudl/analysis/record_linkage/name_cleaner.py

zschira · 2024-12-13T19:25:34Z

src/pudl/analysis/record_linkage/name_cleaner.py

        """Apply the cleaning rules from the dictionary of regex rules."""
+        if self.place_word_the_at_beginning:


Why is this rule handled differently? Is it a combination of two rules?

This has to do with the relationship between the "remove_word_the_from_the_end", "remove_word_the_from_the_beginning", and place_word_the_at_beginning rules. I decided that if you want to place the word the at the beginning, then you should do this first in case you want to then remove "the" from the end or "the" from the beginning. These rules are all kind of in conflict and feed into the idea that there are "states" that we go through - an op to handle "the" would make sense in a later refactor.

katie-lamb and others added 9 commits November 8, 2024 14:59

refactor name cleaner

2e9fd96

fix up

e88d6d8

fix legal terms dict variable

b6928a9

fix read in of legal term dictionary json

b1c2752

Merge branch 'main' into rl-cleaning-upates

1be805f

update release notes

b05f766

[pre-commit.ci] auto fixes from pre-commit.com hooks

a198dfe

For more information, see https://pre-commit.ci

Merge branch 'main' into rl-cleaning-upates

5dda49d

Merge branch 'rl-cleaning-upates' of https://github.com/catalyst-coop…

7e477f5

…erative/pudl into rl-cleaning-upates

katie-lamb marked this pull request as ready for review December 2, 2024 22:41

katie-lamb requested a review from zschira December 2, 2024 22:41

Merge branch 'main' into rl-cleaning-upates

4479bc2

bendnorman reviewed Dec 4, 2024

View reviewed changes

fix doc strings on name cleaner

a6bd569

Merge branch 'main' into rl-cleaning-upates

f5c0315

zschira approved these changes Dec 13, 2024

View reviewed changes

katie-lamb added 2 commits December 18, 2024 11:21

fix name cleaner rule

64b214c

Merge branch 'main' into rl-cleaning-upates

fe2d8b1

katie-lamb added this pull request to the merge queue Dec 18, 2024

Merged via the queue into main with commit 0dd0530 Dec 18, 2024
17 checks passed

katie-lamb deleted the rl-cleaning-upates branch December 18, 2024 21:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor entity matching name cleaner to be more efficient #3953

Refactor entity matching name cleaner to be more efficient #3953

katie-lamb commented Nov 8, 2024 •

edited

Loading

Tasks

To-do list

katie-lamb commented Dec 4, 2024

bendnorman Dec 4, 2024

katie-lamb commented Dec 5, 2024

zschira left a comment

zschira Dec 13, 2024

katie-lamb Dec 18, 2024

		"""Apply the cleaning rules from the dictionary of regex rules."""
		if self.place_word_the_at_beginning:

Refactor entity matching name cleaner to be more efficient #3953

Refactor entity matching name cleaner to be more efficient #3953

Conversation

katie-lamb commented Nov 8, 2024 • edited Loading

Overview

What problem does this address?

What did you change?

Documentation

Tasks

Testing

How did you make sure this worked? How can a reviewer verify this?

To-do list

katie-lamb commented Dec 4, 2024

bendnorman Dec 4, 2024

Choose a reason for hiding this comment

katie-lamb commented Dec 5, 2024

zschira left a comment

Choose a reason for hiding this comment

zschira Dec 13, 2024

Choose a reason for hiding this comment

katie-lamb Dec 18, 2024

Choose a reason for hiding this comment

katie-lamb commented Nov 8, 2024 •

edited

Loading