Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[0.2.dev4] Utilities for merging tables #102

Merged
merged 23 commits into from
Mar 27, 2019
Merged

[0.2.dev4] Utilities for merging tables #102

merged 23 commits into from
Mar 27, 2019

Conversation

smmaurer
Copy link
Member

@smmaurer smmaurer commented Mar 6, 2019

This PR adds utilities to perform merges using implicit join keys (column names that match index names) instead of requiring Orca broadcasts. See issues #78 and #100.

New utilities

  • utils.validate_table()
  • utils.validate_all_tables()
  • utils.merge_tables()

The validation tools check that tables conform to the new stricter spec: unique index or multi-index values, no duplicate index/column names within a table. Also prints some stats to indicate whether columns that match index names of other tables make sense as join keys.

The merge tool replaces orca.merge_tables() and is simpler and more deterministic. It also supports merging on multi-indexes (useful for interaction terms and sampling weights), which orca doesn't.

Other changes

  • updates utils.get_data() to use the new merge tool instead of Orca's
  • updates older model step templates (BinaryLogitStep and OLSRegressionStep) to use utils.get_data(), removing any reliance on broadcasts
  • raises the pandas requirement to 0.23

(The Pandas API is moving toward allowing users to refer to index and non-index columns interchangeably by name, which is nice: it's easier, and aligns better with other data formats. Version 0.23 allows specification of merge keys by name whether or not they're indexes.)

Versioning

0.2.dev4

To do before merging

  • finish and test utils.merge_tables()
  • add support for tables from Orca
  • add support for merges not strictly in order
  • update utils.get_data() to use new merge tool
  • make sure all the model step templates are using the new utilities, and not relying on broadcasts
  • raise pandas requirement to 0.23
  • finalize versioning
  • update docs and changelog

@smmaurer smmaurer marked this pull request as ready for review March 26, 2019 01:18
@coveralls
Copy link

coveralls commented Mar 26, 2019

Coverage Status

Coverage increased (+0.4%) to 91.894% when pulling 3f8bcb5 on data-utils into b23a284 on master.

@smmaurer smmaurer merged commit c911238 into master Mar 27, 2019
@smmaurer smmaurer deleted the data-utils branch March 27, 2019 00:50
@smmaurer smmaurer changed the title Utilities for merging tables [0.2.dev4] Utilities for merging tables Mar 29, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants