Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor entity matching name cleaner to be more efficient #3953

Merged
merged 14 commits into from
Dec 18, 2024
Merged
8 changes: 7 additions & 1 deletion docs/release_notes.rst
Original file line number Diff line number Diff line change
Expand Up @@ -70,11 +70,17 @@ EPA CEMS
~~~~~~~~
* Added 2024 Q3 of CEMS data. See :issue:`3943` and :pr:`3948`.

FERC to EIA Record Linkage
Record Linkage
^^^^^^^^^^^^^^^^^^^^^^^^^^
* Updated the ``splink`` FERC to EIA development notebook to be compatible with
the latest version of ``splink``. This notebook is not run in production but
is helpful for visualizing model weights and what is happening under the hood.
* Updated ``pudl.analysis.record_linkage.name_cleaner`` company name cleaning
module to be more efficient by removing all ``.apply`` and instead use
``pd.Series.replace`` to make regex replacement rules vectorized. Also removed
some of the allowed replacement rules to make the cleaner simpler and more
effective. This module runs approximately 3x faster now when cleaning a
string Series.

.. _release-v2024.10.0:

Expand Down
4 changes: 1 addition & 3 deletions src/pudl/analysis/record_linkage/eia_ferc1_record_linkage.py
Original file line number Diff line number Diff line change
Expand Up @@ -78,11 +78,9 @@
cleaning_rules_list=[
"remove_word_the_from_the_end",
"remove_word_the_from_the_beginning",
"replace_amperstand_between_space_by_AND",
"replace_ampersand_by_AND",
"replace_hyphen_by_space",
"replace_hyphen_between_spaces_by_single_space",
"replace_underscore_by_space",
"replace_underscore_between_spaces_by_single_space",
"remove_all_punctuation",
"remove_numbers",
"remove_math_symbols",
Expand Down
Loading
Loading