Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shouldn't Pandas 'categorical' type map to OmniSci dictionary encoded string? #114

Closed
mflaxman10 opened this issue Nov 12, 2018 · 3 comments
Assignees
Milestone

Comments

@mflaxman10
Copy link
Contributor

Pymapd currently does not accept categorical columns from pandas, and issues an appropriate error message.

So this is a feature enhancement request.... would expect that categorical column type matches intent of OmniSci dictionary and could always be held within one.

Current behavior:
Unsupported type error

Expected behavior:
Pandas categorical columns are mapped to text encoded dictionary. This could either be done on the general case, or further optimized so that dictionary size is optimized.

@randyzwitch
Copy link
Contributor

I think at a higher level, the question might be "Is there any reason not to always use text encoding dict"? I'd argue that for almost all cases, especially ones where users don't specify, there's probably not much harm in dictionary-encoding all string types.

I'm not as certain about trying to optimize the column widths. On the one hand, it's probably what a user wants, but on the other if the user had a preference they should build the table definition themselves. It also builds in more logic to maintain.

It could be the case that we could add a dry_run argument showing what the table definition might be. Then, users could alter/submit the table definition themselves with optimized settings.

@mflaxman10
Copy link
Contributor Author

Agreed, but the first issue here is that categorical columns types are rejected (not that they are treated as text, encoded or not). So in this particular case, we already have user intent/metadata telling us that the column represents something categorical which has already been dictionary-encoded in the source. OmniSci already has a quasi-equivalent concept. I'm wondering if they aren't equivalent-enough to add the re-encoding to dict, even if in-transition in thrift the concept doesn't exist.

@TomAugspurger
Copy link
Contributor

FYI, pandas will have a proper dict-encoded string in the near-term (next year or so): pandas-dev/pandas#20899

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants