Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

✨ support for files bigger than 1MB in sync #509

Merged
merged 6 commits into from
Mar 21, 2023

Conversation

greentim
Copy link
Contributor

@greentim greentim commented Oct 6, 2022

Proposed changes

Bug report: #508
The rest API for put only works for files 1MB or less. There is a separate api to call for larger files, that requires three or more calls: 1 to create the request, 1 or more add-block calls to upload 1MB chunks , and finally a close. This PR will use put if the file length is < 1MB, and otherwise uses the create/add/close apis.

Types of changes

What types of changes does your code introduce to dbx?
Put an x in the boxes that apply

  • [ x ] Bugfix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation Update (if none of the other choices apply)

Further comments

It was necessary to switch to using AsyncMock in the tests to properly mock the calls with the change to await the result of the api call, which is necessary to get the handle for the block uploads. test_delete_secure was duplicated in main and so one was removed. We have been using this patch in my company since June 2022 with no issues.

…v2.0 will use the streaming api calls to upload the content and avoid the http 400 MAX_BLOCK_SIZE_EXCEEDED error
@renardeinside
Copy link
Contributor

hi @greentim ,
thanks a lot for your PR! Please sign your commits, we don't approve any unsigned commits due to security reasons.

@greentim
Copy link
Contributor Author

greentim commented Oct 14, 2022 via email

@renardeinside
Copy link
Contributor

@matthayes please review this PR

@renardeinside
Copy link
Contributor

renardeinside commented Nov 10, 2022

hi @greentim ,
could you please resolve the conflict with main? I'll take a look at this PR myself and merge it for 0.8.7, sorry for delays.

@jeremy010203
Copy link

Hello @renardeinside,
I performed the merge myself https://github.com/jeremy010203/dbx/tree/greentim-bug/dbfs_sync_1mb since it is a feature that I need and the creator of the PR seems inactive.
I am not familiar with the process do I have to open a new PR? Thanks

@greentim greentim force-pushed the bug/dbfs_sync_1mb branch 2 times, most recently from aa72d8d to 73a326b Compare December 13, 2022 16:03
@greentim
Copy link
Contributor Author

hello @renardeinside, @jeremy010203 Sorry I missed the previous message for some reason. I have resolved conflicts and I believe this should be ready for review again.

@jeremy010203
Copy link

@matthayes is it possible to have a look to that PR please?
It will be very useful for our team to migrate to Databricks using dbx!
Thanks a lot 😃

Copy link
Contributor

@matthayes matthayes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delayed review. I ran a test and this appears to work. I synced a large file, then fetched it, then ran a diff, and the contents matched. Thanks for the fix!

@matthayes
Copy link
Contributor

Please fix the failing unit test. Otherwise this looks good to me.

@matthayes
Copy link
Contributor

Here's the fix:

diff --git a/tests/unit/sync/clients/test_dbfs_client.py b/tests/unit/sync/clients/test_dbfs_client.py
index d2f2190..e983390 100644
--- a/tests/unit/sync/clients/test_dbfs_client.py
+++ b/tests/unit/sync/clients/test_dbfs_client.py
@@ -240,11 +240,11 @@ def test_put_max_block_size_exceeded(client: DBFSClient, dummy_file_path_2mb: st
     def mock_post(url, *args, **kwargs):
         resp = AsyncMock()
         setattr(type(resp), "status", PropertyMock(return_value=200))
-        if "/base/api/2.0/dbfs/put" in url:
+        if "/api/2.0/dbfs/put" in url:
             contents = kwargs.get("json").get("contents")
             if len(contents) > 1024 * 1024:  # replicate the api error thrown when contents exceeds max allowed
                 setattr(type(resp), "status", PropertyMock(return_value=400))
-        elif "/base/api/2.0/dbfs/create" in url:
+        elif "/api/2.0/dbfs/create" in url:
             # return a mock response json
             resp.json = MagicMock(side_effect=mock_json)
 
@@ -262,11 +262,11 @@ def test_put_max_block_size_exceeded(client: DBFSClient, dummy_file_path_2mb: st
     chunks = textwrap.wrap(base64.b64encode(bytes(expected_contents, encoding="utf8")).decode("ascii"), 1024 * 1024)
 
     assert session.post.call_count == len(chunks) + 2
-    assert session.post.call_args_list[0][1]["url"] == "http://fakehost.asdf/base/api/2.0/dbfs/create"
-    assert session.post.call_args_list[1][1]["url"] == "http://fakehost.asdf/base/api/2.0/dbfs/add-block"
-    assert session.post.call_args_list[2][1]["url"] == "http://fakehost.asdf/base/api/2.0/dbfs/add-block"
-    assert session.post.call_args_list[3][1]["url"] == "http://fakehost.asdf/base/api/2.0/dbfs/add-block"
-    assert session.post.call_args_list[4][1]["url"] == "http://fakehost.asdf/base/api/2.0/dbfs/close"
+    assert session.post.call_args_list[0][1]["url"] == "http://fakehost.asdf/api/2.0/dbfs/create"
+    assert session.post.call_args_list[1][1]["url"] == "http://fakehost.asdf/api/2.0/dbfs/add-block"
+    assert session.post.call_args_list[2][1]["url"] == "http://fakehost.asdf/api/2.0/dbfs/add-block"
+    assert session.post.call_args_list[3][1]["url"] == "http://fakehost.asdf/api/2.0/dbfs/add-block"
+    assert session.post.call_args_list[4][1]["url"] == "http://fakehost.asdf/api/2.0/dbfs/close"

@codecov
Copy link

codecov bot commented Mar 20, 2023

Codecov Report

Merging #509 (843356c) into main (4c221a9) will increase coverage by 0.01%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main     #509      +/-   ##
==========================================
+ Coverage   93.75%   93.77%   +0.01%     
==========================================
  Files          98       98              
  Lines        3796     3805       +9     
  Branches      474      476       +2     
==========================================
+ Hits         3559     3568       +9     
  Misses        179      179              
  Partials       58       58              
Impacted Files Coverage Δ
dbx/sync/clients.py 100.00% <100.00%> (ø)

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@renardeinside renardeinside changed the title files that are larger than the 1MB limit allowed in the dbfs put API … ✨ support for files bigger than 1MB in sync Mar 21, 2023
@renardeinside renardeinside merged commit f55b40a into databrickslabs:main Mar 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants