Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[druid] fix bug around handling NULLs #4358

Merged
merged 1 commit into from
Feb 7, 2018

Conversation

mistercrunch
Copy link
Member

fillna would miss out on identifying STRING columns for Druid and
replace None in string columns with a numeric 0. This
mixed type column would confuse
pandas down the line on some operations like df.pivot_table.

superseeds #4236

@xrmx 👀

@mistercrunch
Copy link
Member Author

Note that get_fillna_for_columns assumes that the dataframe's column names are bound to table column names and related types. While this is generally the case, this is not enforced, and the bug could show up again if a string column in a dataframe is not named after a database column.

Viz like the Sankey may use specific columns as a source but rename dataframe columns to source and target for instance. In this case the current logic would fail at identifying the df columns as strings and the mixed type issue may re-appear.

superset/viz.py Outdated
return ' NULL'
if col:
if col.is_string:
return 'NULL'
Copy link
Contributor

@xrmx xrmx Feb 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The space was added on purpose for having NULL entries before other values after sorting

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gotcha added it back

@mistercrunch
Copy link
Member Author

mistercrunch commented Feb 6, 2018

A more final solution (outside the scope of this bugfix PR) would be to get better at typing dataframes (currently we have a lot of object types depending on the db engine) and pandas is much slower when operating over object types. We should infer type earlier on in the pipeline and have fillna logic be based on dataframe types instead of column types. (though dataframe types can be inferred from the database types in the first place, though drivers/pandas may fail at that). @betodealmeida and I had a conversation about that last week.

fillna would miss out on identifying STRING columns for Druid and
replace None in string columns with a numeric `0`. This
mixed type column would confuse
pandas down the line on some operations like `df.pivot_table`.
@mistercrunch mistercrunch merged commit 31a0b6e into apache:master Feb 7, 2018
@mistercrunch mistercrunch deleted the fix_cannot_compare branch February 7, 2018 16:19
michellethomas pushed a commit to michellethomas/panoramix that referenced this pull request May 24, 2018
fillna would miss out on identifying STRING columns for Druid and
replace None in string columns with a numeric `0`. This
mixed type column would confuse
pandas down the line on some operations like `df.pivot_table`.
wenchma pushed a commit to wenchma/incubator-superset that referenced this pull request Nov 16, 2018
fillna would miss out on identifying STRING columns for Druid and
replace None in string columns with a numeric `0`. This
mixed type column would confuse
pandas down the line on some operations like `df.pivot_table`.
@mistercrunch mistercrunch added 🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels 🚢 0.23.0 labels Feb 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels 🚢 0.23.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants