Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new hive _push method that supports cli operation and table properties #43

Merged
merged 1 commit into from
Mar 1, 2018

Conversation

matthewwardrop
Copy link
Collaborator

I think it's time we added the CLI operation for hive in upstream Omniduct, given that it is by far the fastest (and only, for older versions of Hive) method for directly getting data from pandas DataFrame objects into Hive. There's a few edge-cases I haven't covered here; such as removing the columns associated with partition keys when exporting the DataFrame to CSV. Is there anything else I missed?

@danfrankj

temp_dir = tempfile.mkdtemp(prefix='omniduct_hiveserver2')
tmp_fname = os.path.join(temp_dir, 'data_{}.csv'.format(time.time()))
logger.info('Saving dataframe to file... {}'.format(tmp_fname))
df.fillna(r'\N').to_csv(tmp_fname, index=False, header=False,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does r'\N' get parsed as NULL by hive later?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. r'\N' is the character sequence interpreted as NULL (by default).

;
""").render(**locals())

print(cmd)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

print statement

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

'U': 'STRING', # Unicode
'V': 'STRING' # void
}
sep = sep or chr(1)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you comment on why this character? probably should be moved up to the other defaults (above the DTYPE map)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

chr(1) is the CTRL-A character, which is the default separator used by Hive when tables are stored as text files. It's a way better choice than r'\t', since the tab character is not at all infrequent in strings that one might want to store.

@danfrankj
Copy link
Collaborator

Overall LGTM.

FYI I'm working on getting the pandas to_sql faster and that should be in soonish
pandas-dev/pandas#19664

@matthewwardrop
Copy link
Collaborator Author

matthewwardrop commented Feb 28, 2018

@danfrankj Does your upstream work in pandas affect Hive, or just Presto? I mean, old versions of Hive don't even have the INSERT statement ;).

@danfrankj
Copy link
Collaborator

@matthewwardrop df.to_sql will (after the upstream change) check the sqlalchemy dialect for supports_multivalues_insert and if its supported it'll do the bulk insert rather than line by line

@matthewwardrop
Copy link
Collaborator Author

@danfrankj That part I understood... just wondering whether to make the CLI approach the default for Hive or not. Does Hive via pyhive and/or impyla support multivalue insert for new enough versions of Hive?

@matthewwardrop matthewwardrop force-pushed the mw_hive_push_implementation branch 3 times, most recently from 661cfda to fc6588f Compare March 1, 2018 20:10
@matthewwardrop matthewwardrop merged commit dc52db3 into master Mar 1, 2018
@matthewwardrop matthewwardrop deleted the mw_hive_push_implementation branch March 1, 2018 21:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants