-
-
Notifications
You must be signed in to change notification settings - Fork 118
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update ferc-ferc plant matching with ccai implementation. #3007
Conversation
This looks good so far! Is it the PCA or the clustering that blows up memory? |
@zschira I made some somewhat confusing but hopefully helpful plots to better understand the distance between clusters. With the fitted These plots are run on the small subset of data I've been using (2000 records), and shows the clusters that are p=40 merges from the final merge. The y-axis shows the distance at which two nodes are merged into one, indicated by a bracket connecting them. I recommend ignoring the labels on the x-axis, but it basically represents the size of each node. The red horizontal line indicates the threshold I've been using for two clusters to be merged (currently .5). The merges made above this line didn't happen and aren't represented in the labels. The merges below this bar are represented by the labels and were clusters that were merged to form a new cluster. I progressively zoomed in on the y-axis so we can see what's going on. There's a big jump from threshold of 10 to threshold of 2. I think this means that 10 is way too large of a cluster distance threshold. Zooming in on a y axis of 0 to 1, it still seems like a lot of merging happens at a much smaller distance. Maybe it's an indication that the threshold should actually be lower? Maybe it doesn't matter too much if the threshold is .5 or .1, still thinking about that and not entirely sure what to make of it. More merging happening in the <.05 range. But that's also to be expected. There's maybe more clustering of larger nodes (bigger on the x-axis), which is good. It's a little harder to visualize results with the model run on the full dataset, but for the most part it seems like results align with the smaller sample dataset. I ran this with p=20, so 20 merges from the final merge, because it was impossible to tell what was going on with p=40. |
I'm definitely confused here. With several thousand expected clusters (one for each FERC plant) it seems hard to visualize all of them at once. Have you made histograms of the cluster sizes in the old vs. new systems? If you randomly select a |
@zaneselvans I've spot checked a number of plant id's and so far they've all looked like the same plant to me. I think doing some focused spot checking using the dendrogram as a guide would be interesting. For example, look at clusters that merge just above our threshold at the 0.5-0.6 range and see if they look like the same plant or not, or do the same just below the threshold, and then maybe zoom way in to clusters with very small distances, and see if we can find any that don't seem to be general matches. I guess generally, it would be interesting to find some cases where we're clearly failing (matching plants that shouldn't be matched or not matching plants that should), and see where/why those might have gone wrong. |
You can think of the dendrograms as a sample of the several thousand clusters that are created by the model. In our case, the p=40 parameter is not that meaningful, since I'm instead using a distance threshold to decide when merging should stop. I think the dendrogram is mostly helpful to understand if the distance threshold for merging is appropriate. As @zschira pointed out, the second step for validating is probably to spot check nodes that are merged right below or right above that .5 mark. This also verifies why the average distance between records in a cluster is always very small (.05 or less) even when I experiment with a distance threshold in the .2-1 range. The vast majority of merges happen between nodes with a distance <.05, so looking at the average distance between records in a cluster isn't a very helpful metric for verifying whether the threshold is good in the .2-1 range.
Yes, currently in the notebook in the CCAI repo. Here's a comparison from the smaller sample dataset (2000 records, I didn't have a screenshot of the full model histogram), where the average new cluster size is ~5.8 and the average old I'll do some spot checking around the current distance threshold as Zach suggests. |
I think the y-axis label might belong on the x-axis? It's interesting that there's a bump at the very high end of the length spectrum (like, one record for every possible year) in the old version, but not in the new version. I wonder why that would have happened, and whether we've actually lost some good long time series, or if they were bad for some reason and the new algorithm does a better job of distinguishing them? |
Oh yep, I did that too fast, but switch the label from y-axis to x-axis.
That's a good point, and that disparity at the high end of the length spectrum happens in the full dataset as well. I'll spot check some of those. |
…l into entity_matching
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this looks good! I left a few small comment in the cross_year module. As sort of a side note, it might be useful for us to keep track of a couple validation checks besides just if there are duplicate report years. That would probably belong in experiment tracking infrastructure and metrics I'm assuming, and would go in with a later PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, didn't finish making all comments at once. Here's a few more things.
I think you might need to add an |
…tive/pudl into entity_matching
It looks like when the modules under I'm running the full ETL locally using the |
After including the new import paths, I get a failure on the plant parts EIA:
|
Hmm, attempting to re-run just the EPA CEMS it seems to do them two-by-two. So maybe it was just that the whole EPA CEMS graph job had failed and left the unstarted ghost assets in the UI. |
@zschira The builds passed last night so this could go into Do we know why the test coverage drops by half a percent between this PR and Edit: pytest wasn't running the tests in |
column_transform_from_key("name_cleaner_transform"), | ||
column_transform_from_key("string_transform"), | ||
], | ||
"weight": 2.0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are there weights for these first two features, but not the rest? Do they default to 1.0?
@@ -0,0 +1 @@ | |||
"""This module impolements models for various forms of record linkage.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There were some failures in the steam table processing due to pudl.analysis.fuel_by_plant
not being imported in pudl/analysis/__init__.py
and we have a lot of places where we just import a whole module, rather than the individual functions or constants within it, so I feel like adding the imports here for now would help avoid some confusion with that pattern breaking on some modules.
Closes catalyst-cooperative/ccai-entity-matching#109 .
This PR pulls @katie-lamb's CCAI implementation of the FERC-FERC inter-year plant matching process. This new implementation works well, and seems to be running much faster than the old implementation (~2 seconds vs ~8 seconds on the
etl_fast
dataset).Remaining work: