Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(hive): Update CSV to Hive upload prefix #14255

Merged

Conversation

john-bodley
Copy link
Member

@john-bodley john-bodley commented Apr 20, 2021

SUMMARY

This PR updates the default CSV_TO_HIVE_UPLOAD_DIRECTORY_FUNC to leverage the previously unused database and schema. Previously the S3 key (which is used as the table location in Hive) was agnostic of database (cluster) or table schema meaning uploading a CSV to an existing table with the same name (sans cluster and schema) created via the CSV flow would overwrite the S3 key.

The fix is to ensure that the database ID and schema (if defined) are used in the default callable. Not an undefined schema infers the default (database dependent) schema.

BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF

TEST PLAN

CI.

ADDITIONAL INFORMATION

  • Has associated issue:
  • Changes UI
  • Includes DB Migration (follow approval process in SIP-59)
    • Migration is atomic, supports rollback & is backwards-compatible
    • Confirm DB migration upgrade and downgrade tested
    • Runtime estimates and downtime expectations provided
  • Introduces new feature or API
  • Removes existing feature or API

@john-bodley john-bodley force-pushed the john-bodley--fix-hive-upload-to-s3 branch 3 times, most recently from 569f20b to 7f711a8 Compare April 20, 2021 23:08
@@ -209,7 +209,7 @@ def test_create_table_from_csv_if_exists_fail_with_schema(mock_table, mock_g):

@mock.patch(
"superset.db_engine_specs.hive.config",
{**app.config, "CSV_TO_HIVE_UPLOAD_DIRECTORY_FUNC": lambda *args: True},
{**app.config, "CSV_TO_HIVE_UPLOAD_DIRECTORY_FUNC": lambda *args: ""},
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CSV_TO_HIVE_UPLOAD_DIRECTORY_FUNC function returns a str and not a boolean.

@john-bodley john-bodley force-pushed the john-bodley--fix-hive-upload-to-s3 branch from 7f711a8 to 01aa899 Compare April 20, 2021 23:11
@john-bodley john-bodley marked this pull request as ready for review April 20, 2021 23:11
@john-bodley john-bodley force-pushed the john-bodley--fix-hive-upload-to-s3 branch 2 times, most recently from defaced to c4e75a5 Compare April 20, 2021 23:25
database: "Database",
user: "models.User", # pylint: disable=unused-argument
schema: Optional[str],
) -> str:
Copy link
Member Author

@john-bodley john-bodley Apr 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note the return type has changed from Optional[str] to str since a path must be specified.

@codecov
Copy link

codecov bot commented Apr 21, 2021

Codecov Report

Merging #14255 (a922093) into master (852e840) will decrease coverage by 0.10%.
The diff coverage is 100.00%.

❗ Current head a922093 differs from pull request most recent head ce0ee19. Consider uploading reports for the commit ce0ee19 to get more accurate results
Impacted file tree graph

@@            Coverage Diff             @@
##           master   #14255      +/-   ##
==========================================
- Coverage   76.97%   76.87%   -0.11%     
==========================================
  Files         953      953              
  Lines       48061    48062       +1     
  Branches     5971     5971              
==========================================
- Hits        36997    36947      -50     
- Misses      10862    10913      +51     
  Partials      202      202              
Flag Coverage Δ
hive ?
mysql 80.73% <100.00%> (+<0.01%) ⬆️
postgres 80.76% <100.00%> (?)
presto 80.46% <100.00%> (+<0.01%) ⬆️
python 81.03% <100.00%> (-0.20%) ⬇️
sqlite 80.37% <100.00%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
superset/config.py 91.06% <100.00%> (+0.03%) ⬆️
superset/db_engines/hive.py 0.00% <0.00%> (-82.15%) ⬇️
superset/db_engine_specs/hive.py 74.42% <0.00%> (-16.42%) ⬇️
superset/utils/core.py 88.60% <0.00%> (-0.13%) ⬇️
superset/connectors/sqla/models.py 89.83% <0.00%> (ø)
superset/views/base_api.py 98.28% <0.00%> (+0.42%) ⬆️
superset/db_engine_specs/postgres.py 96.84% <0.00%> (+1.05%) ⬆️
superset/result_set.py 98.38% <0.00%> (+1.61%) ⬆️
superset/utils/celery.py 89.65% <0.00%> (+3.44%) ⬆️
superset/db_engine_specs/sqlite.py 96.87% <0.00%> (+6.25%) ⬆️
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 852e840...ce0ee19. Read the comment docs.

Copy link
Member

@ktmud ktmud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

# Note the final empty path enforces a trailing slash.
return os.path.join(
CSV_TO_HIVE_UPLOAD_DIRECTORY, str(database.id), schema or "", ""
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about using database.database_name instead of database.id to have more readable file paths (as well as allowing re-creating database yet still uploading to the same directory)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ktmud that was my original thinking as well, however these can have spaces etc. in them which could be problematic for S3 keys. I thought a database ID was the safest and also remains unchanged should someone rename a database (the name is purely for presentation purposes).

@john-bodley john-bodley force-pushed the john-bodley--fix-hive-upload-to-s3 branch 2 times, most recently from e51582c to 71b2443 Compare April 21, 2021 06:05
@john-bodley john-bodley force-pushed the john-bodley--fix-hive-upload-to-s3 branch from 71b2443 to 0ea817e Compare April 21, 2021 06:57
@john-bodley john-bodley merged commit a8781c5 into apache:master Apr 24, 2021
@john-bodley john-bodley deleted the john-bodley--fix-hive-upload-to-s3 branch April 24, 2021 01:20
QAlexBall pushed a commit to QAlexBall/superset that referenced this pull request Dec 29, 2021
* fix(hive): Update CSV to Hive upload prefix

* Trigger notification

Co-authored-by: John Bodley <john.bodley@airbnb.com>
@mistercrunch mistercrunch added 🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels 🚢 1.2.0 labels Mar 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels size/M 🚢 1.2.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants