-
Notifications
You must be signed in to change notification settings - Fork 137
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add default compression, and warning about local paths to dataframe_to_mds #748
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks for contributing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm but before approval would like @XiaohanZhangCMU to verify comment
@srowen can you run |
I guess there are some test failures from test_utils/test_merge_index you need to adjust. The ground truth there are different now. |
Yeah I pushed linter fixes, but it is still complaining about the changed file. Hm. I will look at the test fixes yes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
OK I think it might be ready to go. |
@srowen of course, thanks for getting this one in! Will definitely make a difference for a lot of users. Apologies for my comma suggestions haha. Resolved those. |
Description of changes:
Set more appropriate defaults for compression in dataframe_to_mds.
In a distributed Spark env, it's probably worth saving I/O with compression, as many data types in MDS files will compress well (tokens). This enables default zstd compression if not specified.
Warn about local-only paths
In Databricks in particular, it's possible to provide a path to remote storage as a FUSE-mounted 'local' path likes /Volumes or /dbfs. It is interpreted as a local path, not remote. This works, but, in practice seems to be suboptimal as MDSWriter will causes lots of writes to the 'local' file. Instead it's probably better and more in line with default logic in CloudUploader to specify a temp dir as the local path, and the FUSE path as remote.
Because this is not necessarily a problem, it generates a warning only, not an error.
Issue #, if available:
None.
Merge Checklist:
Put an
x
without space in the boxes that apply. If you are unsure about any checklist, please don't hesitate to ask. We are here to help! This is simply a reminder of what we are going to look for before merging your pull request.General
Tests
pre-commit
on my change. (check out thepre-commit
section of prerequisites)