Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added databricks.labs.blueprint.paths.WorkspacePath as pathlib.Path equivalent #115

Merged
merged 2 commits into from
Jul 5, 2024

Conversation

nfx
Copy link
Contributor

@nfx nfx commented Jul 5, 2024

Python-native pathlib.Path-like interfaces

This library exposes subclasses of pathlib from Python's standard
library that work with Databricks Workspace paths. These classes provide a more intuitive and Pythonic way to work
with Databricks Workspace paths than the standard str paths. The classes are designed to be drop-in replacements
for pathlib.Path and provide additional functionality for working with Databricks Workspace paths.

[back to top]

Working With User Home Folders

This code initializes a client to interact with a Databricks workspace, creates
a relative workspace path (~/some-folder/foo/bar/baz), verifies the path is not absolute, and then demonstrates
that converting this relative path to an absolute path is not implemented and raises an error. Subsequently,
it expands the relative path to the user's home directory and creates the specified directory if it does not
already exist.

from databricks.sdk import WorkspaceClient
from databricks.labs.blueprint.paths import WorkspacePath

name = 'some-folder'
ws = WorkspaceClient()
wsp = WorkspacePath(ws, f"~/{name}/foo/bar/baz")
assert not wsp.is_absolute()

wsp.absolute()  # raises NotImplementedError

with_user = wsp.expanduser()
with_user.mkdir()

user_name = ws.current_user.me().user_name
wsp_check = WorkspacePath(ws, f"/Users/{user_name}/{name}/foo/bar/baz")
assert wsp_check.is_dir()

wsp_check.parent.rmdir() # raises BadRequest
wsp_check.parent.rmdir(recursive=True)

assert not wsp_check.exists()

[back to top]

Relative File Paths

This code expands the ~ symbol to the full path of the user's home directory, computes the relative path from this
home directory to the previously created directory (~/some-folder/foo/bar/baz), and verifies it matches the expected
relative path (some-folder/foo/bar/baz). It then confirms that the expanded path is absolute, checks that
calling absolute() on this path returns the path itself, and converts the path to a FUSE-compatible path
format (/Workspace/username@example.com/some-folder/foo/bar/baz).

from pathlib import Path
from databricks.sdk import WorkspaceClient
from databricks.labs.blueprint.paths import WorkspacePath

name = 'some-folder'
ws = WorkspaceClient()
wsp = WorkspacePath(ws, f"~/{name}/foo/bar/baz")
with_user = wsp.expanduser()

home = WorkspacePath(ws, "~").expanduser()
relative_name = with_user.relative_to(home)
assert relative_name.as_posix() == f"{name}/foo/bar/baz"

assert with_user.is_absolute()
assert with_user.absolute() == with_user
assert with_user.as_fuse() == Path("/Workspace") / with_user.as_posix()

[back to top]

Browser URLs for Workspace Paths

as_uri() method returns a browser-accessible URI for the workspace path. This example retrieves the current user's username
from the Databricks workspace client, constructs a browser-accessible URI for the previously created directory
(~/some-folder/foo/bar/baz) by formatting the host URL and encoding the username, and then verifies that the URI
generated by the with_user path object matches the constructed browser URI:

from databricks.sdk import WorkspaceClient
from databricks.labs.blueprint.paths import WorkspacePath

name = 'some-folder'
ws = WorkspaceClient()
wsp = WorkspacePath(ws, f"~/{name}/foo/bar/baz")
with_user = wsp.expanduser()

user_name = ws.current_user.me().user_name
browser_uri = f'{ws.config.host}#workspace/Users/{user_name.replace("@", "%40")}/{name}/foo/bar/baz'

assert with_user.as_uri() == browser_uri

[back to top]

read/write_text(), read/write_bytes(), and glob() Methods

This code creates a WorkspacePath object for the path ~/some-folder/a/b/c, expands it to the full user path,
and creates the directory along with any necessary parent directories. It then creates a file named hello.txt within
this directory, writes "Hello, World!" to it, and verifies the content. The code lists all .txt files in the directory
and ensures there is exactly one file, which is hello.txt. Finally, it deletes hello.txt and confirms that the file
no longer exists.

from databricks.sdk import WorkspaceClient
from databricks.labs.blueprint.paths import WorkspacePath

name = 'some-folder'
ws = WorkspaceClient()
wsp = WorkspacePath(ws, f"~/{name}/a/b/c")
with_user = wsp.expanduser()
with_user.mkdir(parents=True)

hello_txt = with_user / "hello.txt"
hello_txt.write_text("Hello, World!")
assert hello_txt.read_text() == "Hello, World!"

files = list(with_user.glob("**/*.txt"))
assert len(files) == 1
assert hello_txt == files[0]
assert files[0].name == "hello.txt"

with_user.joinpath("hello.txt").unlink()

assert not hello_txt.exists()

read_bytes() method works as expected:

from databricks.sdk import WorkspaceClient
from databricks.labs.blueprint.paths import WorkspacePath

name = 'some-folder'
ws = WorkspaceClient()

wsp = WorkspacePath(ws, f"~/{name}")
with_user = wsp.expanduser()
with_user.mkdir(parents=True)

hello_bin = with_user.joinpath("hello.bin")
hello_bin.write_bytes(b"Hello, World!")

assert hello_bin.read_bytes() == b"Hello, World!"

with_user.joinpath("hello.bin").unlink()

assert not hello_bin.exists()

[back to top]

Moving Files

This code creates a WorkspacePath object for the path ~/some-folder, expands it to the full user path, and creates
the directory along with any necessary parent directories. It then creates a file named hello.txt within this directory
and writes "Hello, World!" to it. The code then renames the file to hello2.txt, verifies that hello.txt no longer exists,
and checks that the content of hello2.txt is "Hello, World!".

from databricks.sdk import WorkspaceClient
from databricks.labs.blueprint.paths import WorkspacePath

name = 'some-folder'
ws = WorkspaceClient()

wsp = WorkspacePath(ws, f"~/{name}")
with_user = wsp.expanduser()
with_user.mkdir(parents=True)

hello_txt = with_user / "hello.txt"
hello_txt.write_text("Hello, World!")

hello_txt.replace(with_user / "hello2.txt")

assert not hello_txt.exists()
assert (with_user / "hello2.txt").read_text() == "Hello, World!"

[back to top]

Working With Notebook Sources

This code initializes a Databricks WorkspaceClient, creates a WorkspacePath object for the path ~/some-folder, and
defines two items within this folder: a text file (a.txt) and a Python notebook (b). It creates the notebook with
specified content and writes "Hello, World!" to the text file. The code then retrieves all files in the folder, asserts
there are exactly two files, and verifies the suffix and content of each file. Specifically, it checks that a.txt has a
.txt suffix and b has a .py suffix, with the notebook containing the expected code.

from databricks.sdk import WorkspaceClient
from databricks.labs.blueprint.paths import WorkspacePath

ws = WorkspaceClient()

folder = WorkspacePath(ws, "~/some-folder")

txt_file = folder / "a.txt"
py_notebook = folder / "b"  # notebooks have no file extension

make_notebook(path=py_notebook, content="display(spark.range(10))")
txt_file.write_text("Hello, World!")

files = {_.name: _ for _ in folder.glob("**/*")}
assert len(files) == 2

assert files["a.txt"].suffix == ".txt"
assert files["b"].suffix == ".py"  # suffix is determined from ObjectInfo
assert files["b"].read_text() == "# Databricks notebook source\ndisplay(spark.range(10))"

[back to top]

Copy link

github-actions bot commented Jul 5, 2024

✅ 18/18 passed, 2 skipped, 41s total

Running from acceptance #151

@nfx nfx merged commit 0ea0db9 into main Jul 5, 2024
8 of 9 checks passed
@nfx nfx deleted the feat/wspath branch July 5, 2024 10:20
nfx added a commit that referenced this pull request Jul 5, 2024
* Added `databricks.labs.blueprint.paths.WorkspacePath` as `pathlib.Path` equivalent ([#115](#115)). This commit introduces the `databricks.labs.blueprint.paths.WorkspacePath` library, providing Python-native `pathlib.Path`-like interfaces to simplify working with Databricks Workspace paths. The library includes `WorkspacePath` and `WorkspacePathDuringTest` classes offering advanced functionality for handling user home folders, relative file paths, browser URLs, and file manipulation methods such as `read/write_text()`, `read/write_bytes()`, and `glob()`. This addition brings enhanced, Pythonic ways to interact with Databricks Workspace paths, including creating and moving files, managing directories, and generating browser-accessible URIs. Additionally, the commit includes updates to existing methods and introduces new fixtures for creating notebooks, accompanied by extensive unit tests to ensure reliability and functionality.
* Added propagation of `blueprint` version into `User-Agent` header when it is used as library ([#114](#114)). A new feature has been introduced in the library that allows for the propagation of the `blueprint` version and the name of the command line interface (CLI) command used in the `User-Agent` header when the library is utilized as a library. This feature includes the addition of two new pairs of `OtherInfo`: `blueprint/X.Y.Z` to indicate that the request is made using the `blueprint` library and `cmd/<name>` to store the name of the CLI command used for making the request. The implementation involves using the `with_user_agent_extra` function from `databricks.sdk.config` to set the user agent consistently with the Databricks CLI. Several changes have been made to the test file for `test_useragent.py` to include a new test case, `test_user_agent_is_propagated`, which checks if the `blueprint` version and the name of the command are correctly propagated to the `User-Agent` header. A context manager `http_fixture_server` has been added that creates an HTTP server with a custom handler, which extracts the `blueprint` version and the command name from the `User-Agent` header and stores them in the `user_agent` dictionary. The test case calls the `foo` command with a mocked `WorkspaceClient` instance and sets the `DATABRICKS_HOST` and `DATABRICKS_TOKEN` environment variables to test the propagation of the `blueprint` version and the command name in the `User-Agent` header. The test case then asserts that the `blueprint` version and the name of the command are present and correctly set in the `user_agent` dictionary.
* Bump actions/checkout from 4.1.6 to 4.1.7 ([#112](#112)). In this release, the version of the "actions/checkout" action used in the `Checkout Code` step of the acceptance workflow has been updated from 4.1.6 to 4.1.7. This update may include bug fixes, performance improvements, and new features, although specific changes are not mentioned in the commit message. The `Unshallow` step remains unchanged, continuing to fetch and clean up the repository's history. This update ensures that the latest enhancements from the "actions/checkout" action are utilized, aiming to improve the reliability and performance of the code checkout process in the GitHub Actions workflow. Software engineers should be aware of this update and its potential impact on their workflows.

Dependency updates:

 * Bump actions/checkout from 4.1.6 to 4.1.7 ([#112](#112)).
@nfx nfx mentioned this pull request Jul 5, 2024
nfx added a commit that referenced this pull request Jul 5, 2024
* Added `databricks.labs.blueprint.paths.WorkspacePath` as
`pathlib.Path` equivalent
([#115](#115)). This
commit introduces the `databricks.labs.blueprint.paths.WorkspacePath`
library, providing Python-native `pathlib.Path`-like interfaces to
simplify working with Databricks Workspace paths. The library includes
`WorkspacePath` and `WorkspacePathDuringTest` classes offering advanced
functionality for handling user home folders, relative file paths,
browser URLs, and file manipulation methods such as `read/write_text()`,
`read/write_bytes()`, and `glob()`. This addition brings enhanced,
Pythonic ways to interact with Databricks Workspace paths, including
creating and moving files, managing directories, and generating
browser-accessible URIs. Additionally, the commit includes updates to
existing methods and introduces new fixtures for creating notebooks,
accompanied by extensive unit tests to ensure reliability and
functionality.
* Added propagation of `blueprint` version into `User-Agent` header when
it is used as library
([#114](#114)). A new
feature has been introduced in the library that allows for the
propagation of the `blueprint` version and the name of the command line
interface (CLI) command used in the `User-Agent` header when the library
is utilized as a library. This feature includes the addition of two new
pairs of `OtherInfo`: `blueprint/X.Y.Z` to indicate that the request is
made using the `blueprint` library and `cmd/<name>` to store the name of
the CLI command used for making the request. The implementation involves
using the `with_user_agent_extra` function from `databricks.sdk.config`
to set the user agent consistently with the Databricks CLI. Several
changes have been made to the test file for `test_useragent.py` to
include a new test case, `test_user_agent_is_propagated`, which checks
if the `blueprint` version and the name of the command are correctly
propagated to the `User-Agent` header. A context manager
`http_fixture_server` has been added that creates an HTTP server with a
custom handler, which extracts the `blueprint` version and the command
name from the `User-Agent` header and stores them in the `user_agent`
dictionary. The test case calls the `foo` command with a mocked
`WorkspaceClient` instance and sets the `DATABRICKS_HOST` and
`DATABRICKS_TOKEN` environment variables to test the propagation of the
`blueprint` version and the command name in the `User-Agent` header. The
test case then asserts that the `blueprint` version and the name of the
command are present and correctly set in the `user_agent` dictionary.
* Bump actions/checkout from 4.1.6 to 4.1.7
([#112](#112)). In
this release, the version of the "actions/checkout" action used in the
`Checkout Code` step of the acceptance workflow has been updated from
4.1.6 to 4.1.7. This update may include bug fixes, performance
improvements, and new features, although specific changes are not
mentioned in the commit message. The `Unshallow` step remains unchanged,
continuing to fetch and clean up the repository's history. This update
ensures that the latest enhancements from the "actions/checkout" action
are utilized, aiming to improve the reliability and performance of the
code checkout process in the GitHub Actions workflow. Software engineers
should be aware of this update and its potential impact on their
workflows.

Dependency updates:

* Bump actions/checkout from 4.1.6 to 4.1.7
([#112](#112)).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant