Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pip reininstall is executed in every single task #8

Closed
jmeidam opened this issue Sep 20, 2023 · 4 comments
Closed

pip reininstall is executed in every single task #8

jmeidam opened this issue Sep 20, 2023 · 4 comments

Comments

@jmeidam
Copy link

jmeidam commented Sep 20, 2023

I am trying to convert a current dbx project to bundles.
I have some tasks of type python_wheel_task.

One such tasks looks like this (they're all similar):

        - task_key: "data_raw"
          depends_on:
            - task_key: "process_init"
          job_cluster_key: "somejobcluster"
          python_wheel_task:
            package_name: "myproject"
            entry_point: "data_raw"
          libraries:
            - whl: ./dist/myproject-*.whl

and I have defined the following artifact:

    artifacts:
      the_wheel:
        type: whl
        path: .
        build: poetry build

In dbx, the wheel would be installed once on the job-cluster.
Now I noticed that every task is converted to a notebook that contains the following code:

%pip install --force-reinstall /Workspace/Shared/dbx/projects/myproject/.internal/.../myproject-0.0.0-py3-none-any.whl

This seems rather wasteful of running time if you have many tasks that do small things on the same cluster.

Am I missing a setting, or is this done by design?

@andrewnester
Copy link

At the moment this is done by design. We're aware of the concerns and working on a path forward.

You can follow this issue for more updates
databricks/cli#783

As a data point, could you please share what Database runtime version you are using for clusters for your Python wheel jobs?

Thanks!

@jmeidam
Copy link
Author

jmeidam commented Sep 21, 2023

Hi Andrew, thanks for the link.

I am using 11.3.x-scala2.12

github-merge-queue bot pushed a commit to databricks/cli that referenced this issue Sep 26, 2023
## Changes
Instead of always using notebook wrapper for Python wheel tasks, let's
make this an opt-in option.

Now by default Python wheel tasks will be deployed as is to Databricks
platform.
If notebook wrapper required (DBR < 13.1 or other configuration
differences), users can provide a following experimental setting

```
experimental:
  python_wheel_wrapper: true
```

Fixes #783,
databricks/databricks-asset-bundles-dais2023#8

## Tests
Added unit tests.

Integration tests passed for both cases

```
    helpers.go:163: [databricks stdout]: Hello from my func
    helpers.go:163: [databricks stdout]: Got arguments:
    helpers.go:163: [databricks stdout]: ['my_test_code', 'one', 'two']
    ...
Bundle remote directory is ***/.bundle/ac05d5e8-ed4b-4e34-b3f2-afa73f62b021
Deleted snapshot file at /var/folders/nt/xjv68qzs45319w4k36dhpylc0000gp/T/TestAccPythonWheelTaskDeployAndRunWithWrapper3733431114/001/.databricks/bundle/default/sync-snapshots/cac1e02f3941a97b.json
Successfully deleted files!
--- PASS: TestAccPythonWheelTaskDeployAndRunWithWrapper (214.18s)
PASS
coverage: 93.5% of statements in ./...
ok      github.com/databricks/cli/internal/bundle       214.495s        coverage: 93.5% of statements in ./...

```

```
    helpers.go:163: [databricks stdout]: Hello from my func
    helpers.go:163: [databricks stdout]: Got arguments:
    helpers.go:163: [databricks stdout]: ['my_test_code', 'one', 'two']
    ...
Bundle remote directory is ***/.bundle/0ef67aaf-5960-4049-bf1d-dc9e29157421
Deleted snapshot file at /var/folders/nt/xjv68qzs45319w4k36dhpylc0000gp/T/TestAccPythonWheelTaskDeployAndRunWithoutWrapper2340216760/001/.databricks/bundle/default/sync-snapshots/edf0b322cee93b13.json
Successfully deleted files!
--- PASS: TestAccPythonWheelTaskDeployAndRunWithoutWrapper (192.36s)
PASS
coverage: 93.5% of statements in ./...
ok      github.com/databricks/cli/internal/bundle       195.130s        coverage: 93.5% of statements in ./...

```
@andrewnester
Copy link

The change to improve this issue was just released in CLI version 0.206.0, feel free to give it a try.
Since you're using runtime 11.3.x, please upgrade to use DBR 13.2+ since the fix is only applicable there.

You can find more details here databricks/cli#797

@pietern
Copy link
Collaborator

pietern commented Oct 12, 2023

The linked PR has been merged and Python wheel tasks are no longer wrapped by a notebook by default.

@pietern pietern closed this as completed Oct 12, 2023
hectorcast-db pushed a commit to databricks/cli that referenced this issue Oct 13, 2023
## Changes
Instead of always using notebook wrapper for Python wheel tasks, let's
make this an opt-in option.

Now by default Python wheel tasks will be deployed as is to Databricks
platform.
If notebook wrapper required (DBR < 13.1 or other configuration
differences), users can provide a following experimental setting

```
experimental:
  python_wheel_wrapper: true
```

Fixes #783,
databricks/databricks-asset-bundles-dais2023#8

## Tests
Added unit tests.

Integration tests passed for both cases

```
    helpers.go:163: [databricks stdout]: Hello from my func
    helpers.go:163: [databricks stdout]: Got arguments:
    helpers.go:163: [databricks stdout]: ['my_test_code', 'one', 'two']
    ...
Bundle remote directory is ***/.bundle/ac05d5e8-ed4b-4e34-b3f2-afa73f62b021
Deleted snapshot file at /var/folders/nt/xjv68qzs45319w4k36dhpylc0000gp/T/TestAccPythonWheelTaskDeployAndRunWithWrapper3733431114/001/.databricks/bundle/default/sync-snapshots/cac1e02f3941a97b.json
Successfully deleted files!
--- PASS: TestAccPythonWheelTaskDeployAndRunWithWrapper (214.18s)
PASS
coverage: 93.5% of statements in ./...
ok      github.com/databricks/cli/internal/bundle       214.495s        coverage: 93.5% of statements in ./...

```

```
    helpers.go:163: [databricks stdout]: Hello from my func
    helpers.go:163: [databricks stdout]: Got arguments:
    helpers.go:163: [databricks stdout]: ['my_test_code', 'one', 'two']
    ...
Bundle remote directory is ***/.bundle/0ef67aaf-5960-4049-bf1d-dc9e29157421
Deleted snapshot file at /var/folders/nt/xjv68qzs45319w4k36dhpylc0000gp/T/TestAccPythonWheelTaskDeployAndRunWithoutWrapper2340216760/001/.databricks/bundle/default/sync-snapshots/edf0b322cee93b13.json
Successfully deleted files!
--- PASS: TestAccPythonWheelTaskDeployAndRunWithoutWrapper (192.36s)
PASS
coverage: 93.5% of statements in ./...
ok      github.com/databricks/cli/internal/bundle       195.130s        coverage: 93.5% of statements in ./...

```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants