Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor(pinot): The python_date_format for a temporal column was not being passed to get_timestamp_expr #24942

Merged
merged 27 commits into from
Aug 27, 2023

Conversation

ege-st
Copy link
Contributor

@ege-st ege-st commented Aug 10, 2023

SUMMARY

Refactoring the Pinot plugin for Superset to bring it inline with how plugins operate in the latest version of Superset.

This refactoring is also addressing two bugs in the Apache Pinot DB Engine spec:

  1. If a column was marked as temporal and no time grain was provided then the query would be constructed with illegal {} around the column name causing Pinot to reject the query as syntactically invalid. The code has been updated to remove the incorrect {}
  2. In some cases, get_timestamp_expr will be called by models.adhoc_column_to_sqla and None is explicitly passed for the pdf parameter. This would cause the Pinot get_timestamp_expr to fault. To fix this, the models.adhoc_column_to_sqla method is updated to get the python_date_format for a column (if that data is available). (Without this fix, queries created for charts simply do not use the date format a user sets for a temporal column).

BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF

TESTING INSTRUCTIONS

  1. Connected to a Pinot data set and marked a long column as epoch_ms and tested creating a Bar Chart V2 to confirm that the query would be correctly constructed (this tested the issue where None is passed for the pdf parameter.
  2. Setting a String column as temporal and using %Y-%m-%d as the PDF value and confirming that when a chart is created the correct Date Time conversion function is used to parse the text date time.

ADDITIONAL INFORMATION

  • Has associated issue:
  • Required feature flags:
  • Changes UI
  • Includes DB Migration (follow approval process in SIP-59)
    • Migration is atomic, supports rollback & is backwards-compatible
    • Confirm DB migration upgrade and downgrade tested
    • Runtime estimates and downtime expectations provided
  • Introduces new feature or API
  • Removes existing feature or API

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Congrats on making your first PR and thank you for contributing to Superset! 🎉 ❤️

We hope to see you in our Slack community too! Not signed up? Use our Slack App to self-register.

if time_grain:
granularity = cls.get_time_grain_expressions().get(time_grain)
if not granularity:
raise NotImplementedError(f"No pinot grain spec for '{time_grain}'")
else:
return TimestampExpression("{{col}}", col)
return TimestampExpression("{col}", col)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you also mind adding some unit tests for Pinot which cover the get_timestamp_expr function. You can find many other examples of this in the other DB engine specs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, not a problem.

@@ -1011,7 +1013,7 @@ def adhoc_column_to_sqla( # pylint: disable=too-many-locals
if is_dttm and has_timegrain:
sqla_column = self.db_engine_spec.get_timestamp_expr(
col=sqla_column,
pdf=None,
pdf=pdf,
Copy link
Member

@john-bodley john-bodley Aug 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm concerned that this change may actually break existing logic—given it was explicitly set to None. Would you mind adding a unit test for this which helps not just to provide code coverage, but also helps reviewers et al. grok the consequence of the change.

@zhaoyongjie it seems like you added this logic in #21163 and thus you probably have the most context as to why we historically weren't defining the pdf variable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, in my testing whenever I tried to create a chart or use a dashboard, if a column was marked as temporal it would always call get_timestamp_expr via adhoc_column_to_sqla which means that the user defined date format is never passed to the DB Engine Spec.

It's possible that the root cause of the issue is that get_timestamp_expr is being called through adhoc_column_to_sqla which it should be getting called via TableColumn.get_timestamp_expression (the only other call path to get_timestamp_expr I could find. But all my tests pointed to adhoc_column_to_sqla being the root cause.

Copy link
Member

@zhaoyongjie zhaoyongjie Aug 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@john-bodley the "pdf" is a shortcut for "date format (seconds or milliseconds)", this code was existing in many years, the "pdf" only used in Calculated Column and Columns from database, but not used in Adhoc expression, so we shouldn't make this change.

image

Copy link
Member

@zhaoyongjie zhaoyongjie Aug 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ege-st

The design of current Pinot DB spec is completely incorrect. Maintaining our own Pinot driver and db_spec should solve your issue.

class PinotEngineSpec(BaseEngineSpec):  # pylint: disable=abstract-method
    engine = "pinot"
    engine_name = "Apache Pinot"
    allows_subqueries = False
    allows_joins = False
    allows_alias_in_select = True
    allows_alias_in_orderby = False

    # https://docs.pinot.apache.org/users/user-guide-query/supported-transformations#datetime-functions
    _time_grain_expressions = {
        None: "{col}",
        "PT1S": "CAST(DATE_TRUNC('second', CAST({col} AS TIMESTAMP)) AS TIMESTAMP)",
        "PT1M": "CAST(DATE_TRUNC('minute', CAST({col} AS TIMESTAMP)) AS TIMESTAMP)",
        "PT5M": "CAST(ROUND(DATE_TRUNC('minute', CAST({col} AS TIMESTAMP)), 300000) as TIMESTAMP)",
        "PT10M": "CAST(ROUND(DATE_TRUNC('minute', CAST({col} AS TIMESTAMP)), 600000) as TIMESTAMP)",
        "PT15M": "CAST(ROUND(DATE_TRUNC('minute', CAST({col} AS TIMESTAMP)), 900000) as TIMESTAMP)",
        "PT30M": "CAST(ROUND(DATE_TRUNC('minute', CAST({col} AS TIMESTAMP)), 1800000) as TIMESTAMP)",
        "PT1H": "CAST(DATE_TRUNC('hour', CAST({col} AS TIMESTAMP)) AS TIMESTAMP)",
        "P1D": "CAST(DATE_TRUNC('day', CAST({col} AS TIMESTAMP)) AS TIMESTAMP)",
        "P1W": "CAST(DATE_TRUNC('week', CAST({col} AS TIMESTAMP)) AS TIMESTAMP)",
        "P1M": "CAST(DATE_TRUNC('month', CAST({col} AS TIMESTAMP)) AS TIMESTAMP)",
        "P3M": "CAST(DATE_TRUNC('quarter', CAST({col} AS TIMESTAMP)) AS TIMESTAMP)",
        "P1Y": "CAST(DATE_TRUNC('year', CAST({col} AS TIMESTAMP)) AS TIMESTAMP)",
    }

    @classmethod
    def column_datatype_to_string(
        cls, sqla_column_type: TypeEngine, dialect: Dialect
    ) -> str:
        # Pinot driver infers TIMESTAMP column as LONG, so make the quick fix.
        # When the Pinot driver fix this bug, current method could be removed.
        if isinstance(sqla_column_type, types.TIMESTAMP):
            return sqla_column_type.compile().upper()
        else:
            return super().column_datatype_to_string(sqla_column_type, dialect)

driver at: https://github.com/BurdaForward/pinot-dbapi/tree/bf_release

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhaoyongjie I'm aware of what the Python Date Format (PDF) represents, though thanks for clarifying that this shouldn't be used for ad-hoc expressions.

Note we do already have a Pino DB engine spec, but maybe only adding the column_datatype_to_string method is required.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ege-st I also wondered if this was an underlying issue with the Pino SQLAlchemy dialect. You might want to look into the visit_label method.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhaoyongjie I've confirmed that this error happens with the latest versions of Pinot: so Superset can't alias a projection to the same name as a column that already exists. I looked at the diff you provided but it appears to be diffing a version of models.py that is not the same as the one in the master branch.

@john-bodley could you provide some more detail? Is SQL Alchemy generating the alias name used in the projection? If so, then it could be an issue with the dialect, but if Superset generates the alias label then I'm not sure how the dialect can address this.

Copy link
Member

@zhaoyongjie zhaoyongjie Aug 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ege-st the git-diffs are from Superset 2.1.0 branch. There aren't many changes, so you should apply the changes manually 🖨️🖨️🖨️

You should change this part of code on Master branch

def make_sqla_column_compatible(
self, sqla_col: ColumnElement, label: str | None = None
) -> ColumnElement:
"""Takes a sqlalchemy column object and adds label info if supported by engine.
:param sqla_col: sqlalchemy column instance
:param label: alias/label that column is expected to have
:return: either a sql alchemy column or label instance if supported by engine
"""
label_expected = label or sqla_col.name
# add quotes to tables
if self.db_engine_spec.allows_alias_in_select:
label = self.db_engine_spec.make_label_compatible(label_expected)
sqla_col = sqla_col.label(label)
sqla_col.key = label_expected
return sqla_col

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhaoyongjie so I believe I figured out a workaround for the alias issue. If I just set allows_alias_in_select = False then the query generated by Superset does not use an alias and the query is then compatible with Pinot. So, I don't think any of the additional changes you kindly suggested are necessary.

One question that I have is: what is the purpose of the pdf that gets defined in the dataset configuration? Since it isn't passed into the engine spec when creating a chart, it can't be used in the query generation, so it doesn't seem to serve a purpose?

Copy link
Member

@zhaoyongjie zhaoyongjie Aug 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ege-st

so I believe I figured out a workaround for the alias issue. If I just set allows_alias_in_select = False then the query generated by Superset does not use an alias and the query is then compatible with Pinot. So, I don't think any of the additional changes you kindly suggested are necessary.

Sounds good! It should be worked.

One question that I have is: what is the purpose of the pdf that gets defined in the dataset configuration? Since it isn't passed into the engine spec when creating a chart, it can't be used in the query generation, so it doesn't seem to serve a purpose?

I think the original design of "pdf" is a hard-code for getting a timestamp from a string, but a type conversion expression is more graceful, --- should push down the function and run in DB rather than calculate in client.

@ege-st
Copy link
Contributor Author

ege-st commented Aug 15, 2023 via email

@zhaoyongjie
Copy link
Member

zhaoyongjie commented Aug 16, 2023

@ege-st when you have time, could you fix the unit test and code style? If you need helping, feel free to ping me at GitHub or Slack DM.

Copy link
Member

@zhaoyongjie zhaoyongjie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code change LGTM, waiting for CI.

TimeGrain.QUARTER: True,
TimeGrain.YEAR: True,
None: "{col}",
"PT1S": "CAST(DATE_TRUNC('second', CAST({col} AS TIMESTAMP)) AS TIMESTAMP)",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we keep the enums here, eg, TimeGrain.SECOND instead of PT1S? It makes it much easier to read.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

@ege-st
Copy link
Contributor Author

ege-st commented Aug 22, 2023

@nytai could you take a look at this PR, I believe your review is the last one required.

Thanks!

@ege-st
Copy link
Contributor Author

ege-st commented Aug 27, 2023

@zhaoyongjie @john-bodley Is the PR good to merge? Thanks!

@zhaoyongjie
Copy link
Member

zhaoyongjie commented Aug 27, 2023 via email

@zhaoyongjie
Copy link
Member

zhaoyongjie commented Aug 27, 2023 via email

@ege-st ege-st changed the title fix(pinot): The python_date_format for a temporal column was not being passed to get_timestamp_expr refactor(pinot): The python_date_format for a temporal column was not being passed to get_timestamp_expr Aug 27, 2023
@ege-st
Copy link
Contributor Author

ege-st commented Aug 27, 2023

@zhaoyongjie Done.

@zhaoyongjie zhaoyongjie merged commit c2a21d2 into apache:master Aug 27, 2023
33 checks passed
@zhaoyongjie
Copy link
Member

@ege-st Thanks for tackling this problem!

@michael-s-molina michael-s-molina added the v3.0 Label added by the release manager to track PRs to be included in the 3.0 branch label Aug 29, 2023
michael-s-molina pushed a commit that referenced this pull request Aug 30, 2023
…ot being passed to `get_timestamp_expr` (#24942)

(cherry picked from commit c2a21d2)
@martin-raymond
Copy link
Contributor

Hi @ege-st ! what do you think of #25749 ? we think it might be related to your fix

@ege-st
Copy link
Contributor Author

ege-st commented Oct 25, 2023

Hi @martin-raymond, I'll take a look today and let you know.

@martin-raymond
Copy link
Contributor

@ege-st do you have any news ?

cccs-rc pushed a commit to CybercentreCanada/superset that referenced this pull request Mar 6, 2024
@mistercrunch mistercrunch added 🍒 3.0.0 🍒 3.0.1 🍒 3.0.2 🍒 3.0.3 🍒 3.0.4 🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels 🚢 3.1.0 labels Mar 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels size/L v3.0 Label added by the release manager to track PRs to be included in the 3.0 branch 🍒 3.0.0 🍒 3.0.1 🍒 3.0.2 🍒 3.0.3 🍒 3.0.4 🚢 3.1.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants