-
-
Notifications
You must be signed in to change notification settings - Fork 118
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor entity matching name cleaner to be more efficient #3953
Conversation
For more information, see https://pre-commit.ci
…erative/pudl into rl-cleaning-upates
@zschira I'm not sure how to fix the docs error... I posted about this in Slack, but here's the problem: The warning (treated as error) occurs here and is about duplicate object descriptions for the class variables that I describe here in my CompanyNameCleaner class. When I look at the doc page that's generated by sphinx, it looks fine, although it lists all of these attributes twice, first with the full description and then with no description. Am I not doing the docstring for this class correctly? Is there something I should include so sphinx ignores this? |
class CompanyNameCleaner(BaseModel): | ||
"""Class to normalize/clean up text based company names. | ||
|
||
Attributes: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I resolve the warning/error you need to move the attribute doc strings directly below the attribute definitions:
cleaning_rules_list: list[str] = DEFAULT_CLEANING_RULES_LIST
"""A list of cleaning rules that the
CompanyNameCleaner should apply. Will be validated to ensure
rules comply to allowed cleaning functions."""
You can use this pudl class as a reference.
I think you're getting a duplicate warning because sphinx is creating empty doc strings for the attributes and then it looks at the doc string of the class and creates doc strings for the attributes again.
Thanks for the help @bendnorman ! I think now it's just the same error that you mentioned was generally causing problems in PUDL? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of small questions, but overall looks good!
One random thing I started thinking about while looking at this is that maybe we were going about this the wrong way when trying to think up a way to avoid illegal states with string cleaning, and instead a better approach would be to just provide as much insight into what's happening as possible. It could be cool to try to turn this into a series of dagster op
's so you could see the order in the dagster UI and inspect intermediate outputs.
This is all just some random thoughts though, not something I think we should try to actually do right now.
"""Apply the cleaning rules from the dictionary of regex rules.""" | ||
if self.place_word_the_at_beginning: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this rule handled differently? Is it a combination of two rules?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has to do with the relationship between the "remove_word_the_from_the_end"
, "remove_word_the_from_the_beginning"
, and place_word_the_at_beginning
rules. I decided that if you want to place the word the at the beginning, then you should do this first in case you want to then remove "the" from the end or "the" from the beginning. These rules are all kind of in conflict and feed into the idea that there are "states" that we go through - an op
to handle "the" would make sense in a later refactor.
Overview
As part of the SEC to EIA record linkage development, I had to make some changes to the PUDL company name cleaning module to make it more efficient and useful. The code for this module was originally pulled OS Climate's repo, but it was no longer maintained there. I didn't make significant changes when I pulled out that module, and thus it had some quirks and inefficiencies.
What problem does this address?
What did you change?
apply
to apply the regex replacement rules, I usedpd.Series.replace
so that this replacement is vectorized.CompanyNameCleaner
classDocumentation
Make sure to update relevant aspects of the documentation.
Tasks
Testing
How did you make sure this worked? How can a reviewer verify this?
To-do list