Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Update time grain expressions for Spark >= 3.x #18690

Merged

Conversation

thomasdesr
Copy link
Contributor

@thomasdesr thomasdesr commented Feb 12, 2022

SUMMARY

Spark removed date format string 'u' in Spark 3.0 [0] so the grain expressions need to be updated because ones that depend on that format string are failing 😢. I switched the behavior to use date_trunc which has been around since 2.3 [1] based on how the Athena db engine spec handles this [2].

I think its probably fine to raise the minimum supported version to 2.3 because this is a "Databricks" db_engine_spec and Databricks hasn't publicly supported something that old since ~Jan 2020 [3].

BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF

I can't reproduce the old behavior (I don't have that old a spark cluster xD) but here's a table of timestamps from a current version of spark run through these same expressions:

_timestamp 2021-02-08T23:58:02 2021-02-08T23:58:41 2021-02-08T23:59:31
PT1S 2021-02-08T23:58:02 2021-02-08T23:58:41 2021-02-08T23:59:31
PT1M 2021-02-08T23:58:00 2021-02-08T23:58:00 2021-02-08T23:59:00
PT1H 2021-02-08T23:00:00 2021-02-08T23:00:00 2021-02-08T23:00:00
P1D 2021-02-08T00:00:00 2021-02-08T00:00:00 2021-02-08T00:00:00
P1W 2021-02-08T00:00:00 2021-02-08T00:00:00 2021-02-08T00:00:00
P1M 2021-02-01T00:00:00 2021-02-01T00:00:00 2021-02-01T00:00:00
P3M 2021-01-01T00:00:00 2021-01-01T00:00:00 2021-01-01T00:00:00
P1Y 2021-01-01T00:00:00 2021-01-01T00:00:00 2021-01-01T00:00:00
P1W/1970-01-03T00:00:00Z 2021-02-13T00:00:00 2021-02-13T00:00:00 2021-02-13T00:00:00
1969-12-28T00:00:00Z/P1W 2021-02-07T00:00:00 2021-02-07T00:00:00 2021-02-07T00:00:00

TESTING INSTRUCTIONS

ADDITIONAL INFORMATION

  • Has associated issue:
  • Required feature flags:
  • Changes UI
  • Includes DB Migration (follow approval process in SIP-59)
    • Migration is atomic, supports rollback & is backwards-compatible
    • Confirm DB migration upgrade and downgrade tested
    • Runtime estimates and downtime expectations provided
  • Introduces new feature or API
  • Removes existing feature or API

Spark removed date format string 'u' in Spark 3.0. Switch to using date_trunc which has been around since 2.3
@thomasdesr thomasdesr changed the title fix: Update time grain expressions for Spark >= 2.3.0 fix: Update time grain expressions for Spark >= 3.x Feb 16, 2022
@thomasdesr
Copy link
Contributor Author

cc @betodealmeida (who seems to have done most of the work on the databricks/spark engine spec)

If you could at least boop the "allow tests to run" button that'd be great :D

Copy link
Member

@betodealmeida betodealmeida left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great, thanks for updating it!

@@ -37,7 +51,7 @@ class DatabricksODBCEngineSpec(BaseEngineSpec):
# the syntax for the ODBC engine is identical to the Hive one, so
# we can reuse the expressions from `HiveEngineSpec`
# pylint: disable=protected-access
_time_grain_expressions = HiveEngineSpec._time_grain_expressions
_time_grain_expressions = DatabricksHiveEngineSpec._time_grain_expressions
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can also define time_grain_expressions outside the classes, and then reuse it in both classes, I think it might be clearer:

time_grain_expression = { ... }

class DatabricksHiveEngineSpec(HiveEngineSpec):
    _time_grain_expressions = time_grain_expressions
    ...

class DatabricksODBCEngineSpec(BaseEngineSpec):
    _time_grain_expressions = time_grain_expressions
    ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good here! :D

@codecov
Copy link

codecov bot commented Feb 17, 2022

Codecov Report

Merging #18690 (d2a8fd7) into master (225015f) will increase coverage by 0.12%.
The diff coverage is 100.00%.

❗ Current head d2a8fd7 differs from pull request most recent head df0f950. Consider uploading reports for the commit df0f950 to get more accurate results

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #18690      +/-   ##
==========================================
+ Coverage   66.28%   66.40%   +0.12%     
==========================================
  Files        1605     1619      +14     
  Lines       62863    62940      +77     
  Branches     6341     6341              
==========================================
+ Hits        41666    41798     +132     
+ Misses      19545    19490      -55     
  Partials     1652     1652              
Flag Coverage Δ
hive 52.28% <100.00%> (+0.14%) ⬆️
mysql 81.55% <100.00%> (+0.22%) ⬆️
postgres 81.60% <100.00%> (+0.22%) ⬆️
presto 52.13% <100.00%> (+0.14%) ⬆️
python 82.03% <100.00%> (+0.22%) ⬆️
sqlite 81.29% <100.00%> (+0.22%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
superset/db_engine_specs/databricks.py 90.90% <100.00%> (+0.90%) ⬆️
superset/key_value/commands/create.py 80.95% <0.00%> (-3.05%) ⬇️
superset/sql_parse.py 98.61% <0.00%> (-0.87%) ⬇️
superset/models/core.py 88.91% <0.00%> (-0.19%) ⬇️
superset/sql_lab.py 81.64% <0.00%> (ø)
superset/key_value/utils.py 100.00% <0.00%> (ø)
superset/dashboards/filter_state/api.py 100.00% <0.00%> (ø)
superset/explore/form_data/commands/parameters.py 100.00% <0.00%> (ø)
superset/utils/pandas_postprocessing.py
superset/utils/pandas_postprocessing/compare.py 88.88% <0.00%> (ø)
... and 32 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 225015f...df0f950. Read the comment docs.

@thomasdesr
Copy link
Contributor Author

thomasdesr commented Mar 7, 2022

@betodealmeida Sorry, gentle ping to boop the "approve running tests" button. 🙇

@villebro
Copy link
Member

villebro commented Mar 7, 2022

FYI @thomasdesr CI tests started

@thomasdesr thomasdesr force-pushed the fix-spark-time-grain-expressions branch from a982006 to df0f950 Compare March 8, 2022 18:41
@thomasdesr
Copy link
Contributor Author

Okay I've now gone through https://github.com/apache/superset/blob/master/.github/workflows/superset-python-misc.yml and validated that while a few things are still failing locally they don't seem to be related to this change. Hopefully this should be the last time I need to ask for a boop xD

(cc @villebro 🙇)

@betodealmeida
Copy link
Member

Just booped it! :)

@betodealmeida betodealmeida merged commit 03b2b06 into apache:master Mar 8, 2022
@betodealmeida
Copy link
Member

Merged! Thanks for the PR, @thomasdesr!

villebro pushed a commit that referenced this pull request Apr 3, 2022
* Fix the time grain expressions for Spark >= 2.3.0

Spark removed date format string 'u' in Spark 3.0. Switch to using date_trunc which has been around since 2.3

* Review: Pull out time_grain_expressoins into its own thing

(cherry picked from commit 03b2b06)
@mistercrunch mistercrunch added 🍒 1.5.0 🍒 1.5.1 🍒 1.5.2 🍒 1.5.3 🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels 🚢 2.0.0 labels Mar 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels lts-v1 size/S 🍒 1.5.0 🍒 1.5.1 🍒 1.5.2 🍒 1.5.3 🚢 2.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants