fix(hive): Update CSV to Hive upload prefix #14255

john-bodley · 2021-04-20T21:57:17Z

SUMMARY

This PR updates the default CSV_TO_HIVE_UPLOAD_DIRECTORY_FUNC to leverage the previously unused database and schema. Previously the S3 key (which is used as the table location in Hive) was agnostic of database (cluster) or table schema meaning uploading a CSV to an existing table with the same name (sans cluster and schema) created via the CSV flow would overwrite the S3 key.

The fix is to ensure that the database ID and schema (if defined) are used in the default callable. Not an undefined schema infers the default (database dependent) schema.

BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF

TEST PLAN

CI.

ADDITIONAL INFORMATION

Has associated issue:
Changes UI
Includes DB Migration (follow approval process in SIP-59)
- Migration is atomic, supports rollback & is backwards-compatible
- Confirm DB migration upgrade and downgrade tested
- Runtime estimates and downtime expectations provided
Introduces new feature or API
Removes existing feature or API

john-bodley · 2021-04-20T23:10:41Z

tests/db_engine_specs/hive_tests.py

@@ -209,7 +209,7 @@ def test_create_table_from_csv_if_exists_fail_with_schema(mock_table, mock_g):

 @mock.patch(
    "superset.db_engine_specs.hive.config",
-    {**app.config, "CSV_TO_HIVE_UPLOAD_DIRECTORY_FUNC": lambda *args: True},
+    {**app.config, "CSV_TO_HIVE_UPLOAD_DIRECTORY_FUNC": lambda *args: ""},


The CSV_TO_HIVE_UPLOAD_DIRECTORY_FUNC function returns a str and not a boolean.

john-bodley · 2021-04-20T23:26:41Z

superset/config.py

+    database: "Database",
+    user: "models.User",  # pylint: disable=unused-argument
+    schema: Optional[str],
+) -> str:


Note the return type has changed from Optional[str] to str since a path must be specified.

codecov · 2021-04-21T00:01:16Z

Codecov Report

Merging #14255 (a922093) into master (852e840) will decrease coverage by 0.10%.
The diff coverage is 100.00%.

❗ Current head a922093 differs from pull request most recent head ce0ee19. Consider uploading reports for the commit ce0ee19 to get more accurate results

@@            Coverage Diff             @@
##           master   #14255      +/-   ##
==========================================
- Coverage   76.97%   76.87%   -0.11%     
==========================================
  Files         953      953              
  Lines       48061    48062       +1     
  Branches     5971     5971              
==========================================
- Hits        36997    36947      -50     
- Misses      10862    10913      +51     
  Partials      202      202

Flag	Coverage Δ
hive	`?`
mysql	`80.73% <100.00%> (+<0.01%)`	⬆️
postgres	`80.76% <100.00%> (?)`
presto	`80.46% <100.00%> (+<0.01%)`	⬆️
python	`81.03% <100.00%> (-0.20%)`	⬇️
sqlite	`80.37% <100.00%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
superset/config.py	`91.06% <100.00%> (+0.03%)`	⬆️
superset/db_engines/hive.py	`0.00% <0.00%> (-82.15%)`	⬇️
superset/db_engine_specs/hive.py	`74.42% <0.00%> (-16.42%)`	⬇️
superset/utils/core.py	`88.60% <0.00%> (-0.13%)`	⬇️
superset/connectors/sqla/models.py	`89.83% <0.00%> (ø)`
superset/views/base_api.py	`98.28% <0.00%> (+0.42%)`	⬆️
superset/db_engine_specs/postgres.py	`96.84% <0.00%> (+1.05%)`	⬆️
superset/result_set.py	`98.38% <0.00%> (+1.61%)`	⬆️
superset/utils/celery.py	`89.65% <0.00%> (+3.44%)`	⬆️
superset/db_engine_specs/sqlite.py	`96.87% <0.00%> (+6.25%)`	⬆️
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 852e840...ce0ee19. Read the comment docs.

ktmud

LGTM!

ktmud · 2021-04-21T00:56:54Z

superset/config.py

+    # Note the final empty path enforces a trailing slash.
+    return os.path.join(
+        CSV_TO_HIVE_UPLOAD_DIRECTORY, str(database.id), schema or "", ""
+    )


How about using database.database_name instead of database.id to have more readable file paths (as well as allowing re-creating database yet still uploading to the same directory)?

@ktmud that was my original thinking as well, however these can have spaces etc. in them which could be problematic for S3 keys. I thought a database ID was the safest and also remains unchanged should someone rename a database (the name is purely for presentation purposes).

* fix(hive): Update CSV to Hive upload prefix * Trigger notification Co-authored-by: John Bodley <john.bodley@airbnb.com>

pull-request-size bot added the size/M label Apr 20, 2021

john-bodley force-pushed the john-bodley--fix-hive-upload-to-s3 branch 3 times, most recently from 569f20b to 7f711a8 Compare April 20, 2021 23:08

john-bodley commented Apr 20, 2021

View reviewed changes

john-bodley force-pushed the john-bodley--fix-hive-upload-to-s3 branch from 7f711a8 to 01aa899 Compare April 20, 2021 23:11

john-bodley marked this pull request as ready for review April 20, 2021 23:11

john-bodley requested review from bkyryliuk, etr2460 and ktmud April 20, 2021 23:11

john-bodley force-pushed the john-bodley--fix-hive-upload-to-s3 branch 2 times, most recently from defaced to c4e75a5 Compare April 20, 2021 23:25

john-bodley commented Apr 20, 2021

View reviewed changes

ktmud approved these changes Apr 21, 2021

View reviewed changes

john-bodley force-pushed the john-bodley--fix-hive-upload-to-s3 branch 2 times, most recently from e51582c to 71b2443 Compare April 21, 2021 06:05

fix(hive): Update CSV to Hive upload prefix

0ea817e

john-bodley force-pushed the john-bodley--fix-hive-upload-to-s3 branch from 71b2443 to 0ea817e Compare April 21, 2021 06:57

Trigger notification

ce0ee19

john-bodley merged commit a8781c5 into apache:master Apr 24, 2021

john-bodley deleted the john-bodley--fix-hive-upload-to-s3 branch April 24, 2021 01:20

QAlexBall pushed a commit to QAlexBall/superset that referenced this pull request Dec 29, 2021

fix(hive): Update CSV to Hive upload prefix (apache#14255)

31711b3

* fix(hive): Update CSV to Hive upload prefix * Trigger notification Co-authored-by: John Bodley <john.bodley@airbnb.com>

mistercrunch added 🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels 🚢 1.2.0 labels Mar 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(hive): Update CSV to Hive upload prefix #14255

fix(hive): Update CSV to Hive upload prefix #14255

john-bodley commented Apr 20, 2021 •

edited

Loading

john-bodley Apr 20, 2021

john-bodley Apr 20, 2021 •

edited

Loading

codecov bot commented Apr 21, 2021 •

edited

Loading

ktmud left a comment

ktmud Apr 21, 2021

john-bodley Apr 21, 2021

fix(hive): Update CSV to Hive upload prefix #14255

fix(hive): Update CSV to Hive upload prefix #14255

Conversation

john-bodley commented Apr 20, 2021 • edited Loading

SUMMARY

BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF

TEST PLAN

ADDITIONAL INFORMATION

john-bodley Apr 20, 2021

Choose a reason for hiding this comment

john-bodley Apr 20, 2021 • edited Loading

Choose a reason for hiding this comment

codecov bot commented Apr 21, 2021 • edited Loading

Codecov Report

ktmud left a comment

Choose a reason for hiding this comment

ktmud Apr 21, 2021

Choose a reason for hiding this comment

john-bodley Apr 21, 2021

Choose a reason for hiding this comment

john-bodley commented Apr 20, 2021 •

edited

Loading

john-bodley Apr 20, 2021 •

edited

Loading

codecov bot commented Apr 21, 2021 •

edited

Loading