Add new hive `_push` method that supports cli operation and table properties #43

matthewwardrop · 2018-02-28T08:00:10Z

I think it's time we added the CLI operation for hive in upstream Omniduct, given that it is by far the fastest (and only, for older versions of Hive) method for directly getting data from pandas DataFrame objects into Hive. There's a few edge-cases I haven't covered here; such as removing the columns associated with partition keys when exporting the DataFrame to CSV. Is there anything else I missed?

@danfrankj

danfrankj · 2018-02-28T17:52:59Z

omniduct/databases/hiveserver2.py

+        temp_dir = tempfile.mkdtemp(prefix='omniduct_hiveserver2')
+        tmp_fname = os.path.join(temp_dir, 'data_{}.csv'.format(time.time()))
+        logger.info('Saving dataframe to file... {}'.format(tmp_fname))
+        df.fillna(r'\N').to_csv(tmp_fname, index=False, header=False,


does r'\N' get parsed as NULL by hive later?

Yes. r'\N' is the character sequence interpreted as NULL (by default).

danfrankj · 2018-02-28T17:56:46Z

omniduct/databases/hiveserver2.py

+        ;
+        """).render(**locals())
+
+        print(cmd)


print statement

danfrankj · 2018-02-28T17:59:21Z

omniduct/databases/hiveserver2.py

+            'U': 'STRING',   # Unicode
+            'V': 'STRING'    # void
+        }
+        sep = sep or chr(1)


can you comment on why this character? probably should be moved up to the other defaults (above the DTYPE map)

chr(1) is the CTRL-A character, which is the default separator used by Hive when tables are stored as text files. It's a way better choice than r'\t', since the tab character is not at all infrequent in strings that one might want to store.

danfrankj · 2018-02-28T18:00:44Z

Overall LGTM.

FYI I'm working on getting the pandas to_sql faster and that should be in soonish
pandas-dev/pandas#19664

matthewwardrop · 2018-02-28T22:53:20Z

@danfrankj Does your upstream work in pandas affect Hive, or just Presto? I mean, old versions of Hive don't even have the INSERT statement ;).

danfrankj · 2018-03-01T02:13:16Z

@matthewwardrop df.to_sql will (after the upstream change) check the sqlalchemy dialect for supports_multivalues_insert and if its supported it'll do the bulk insert rather than line by line

matthewwardrop · 2018-03-01T04:42:55Z

@danfrankj That part I understood... just wondering whether to make the CLI approach the default for Hive or not. Does Hive via pyhive and/or impyla support multivalue insert for new enough versions of Hive?

…perties.

danfrankj reviewed Feb 28, 2018

View reviewed changes

matthewwardrop force-pushed the mw_hive_push_implementation branch 3 times, most recently from 661cfda to fc6588f Compare March 1, 2018 20:10

Add new hive _push method that supports cli operation and table pro…

1730995

…perties.

matthewwardrop force-pushed the mw_hive_push_implementation branch from fc6588f to 1730995 Compare March 1, 2018 20:16

matthewwardrop merged commit dc52db3 into master Mar 1, 2018

matthewwardrop deleted the mw_hive_push_implementation branch March 1, 2018 21:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add new hive `_push` method that supports cli operation and table properties #43

Add new hive `_push` method that supports cli operation and table properties #43

matthewwardrop commented Feb 28, 2018

danfrankj Feb 28, 2018

matthewwardrop Mar 1, 2018

danfrankj Feb 28, 2018

matthewwardrop Mar 1, 2018

danfrankj Feb 28, 2018

matthewwardrop Mar 1, 2018

danfrankj commented Feb 28, 2018

matthewwardrop commented Feb 28, 2018 •

edited

Loading

danfrankj commented Mar 1, 2018

matthewwardrop commented Mar 1, 2018

Add new hive _push method that supports cli operation and table properties #43

Add new hive _push method that supports cli operation and table properties #43

Conversation

matthewwardrop commented Feb 28, 2018

danfrankj Feb 28, 2018

Choose a reason for hiding this comment

matthewwardrop Mar 1, 2018

Choose a reason for hiding this comment

danfrankj Feb 28, 2018

Choose a reason for hiding this comment

matthewwardrop Mar 1, 2018

Choose a reason for hiding this comment

danfrankj Feb 28, 2018

Choose a reason for hiding this comment

matthewwardrop Mar 1, 2018

Choose a reason for hiding this comment

danfrankj commented Feb 28, 2018

matthewwardrop commented Feb 28, 2018 • edited Loading

danfrankj commented Mar 1, 2018

matthewwardrop commented Mar 1, 2018

Add new hive `_push` method that supports cli operation and table properties #43

Add new hive `_push` method that supports cli operation and table properties #43

matthewwardrop commented Feb 28, 2018 •

edited

Loading