Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[EPIC] Detect notebook include graph by analysing workspace file imports and sys.path manipulation #1202

Closed
18 tasks done
Tracked by #1085
nfx opened this issue Apr 1, 2024 · 3 comments · Fixed by #1633
Closed
18 tasks done
Tracked by #1085
Labels
CUJ critial user journey migrate/code Abstract Syntax Trees and other dark magic

Comments

@nfx
Copy link
Contributor

nfx commented Apr 1, 2024

Relevant issues:

With Databricks Runtime 11.2 and above, you can create and manage source code files in the Databricks workspace, and then import these files into your notebooks as needed.

Related info:

The following list orders precedence from highest to lowest. In this list, a lower number means higher precedence.

  1. Libraries in the current working directory (Git folders only).
  2. Libraries in the Git folder root directory (Git folders only).
  3. Notebook-scoped libraries (%pip install in notebooks).
  4. Cluster libraries (using the UI, CLI, or API).
  5. Libraries included in Databricks Runtime.
  6. Libraries installed with init scripts might resolve before or after built-in libraries, depending on how they are installed. Databricks does not recommend installing libraries with init scripts.
  7. Libraries in the current working directory (not in Git folders).
  8. Workspace files appended to the sys.path.
image

Proposed Solution

Detect any import statements and check for sys.path manipulations to treat imported workspace files as a dependency

Additional Context

No response

@ericvergnaud
Copy link
Contributor

@nfx the 'Python library precedence' link above returns an empty file (maybe it's only when called from outside DB account)

@ericvergnaud
Copy link
Contributor

ericvergnaud commented Apr 10, 2024

Splitting into more granular issues:
#1342
#1346
#1360
#1363
#1365
#1379
#1382
#1358
#1399
#1410
#1421
#1427
#1439
#1468

nfx pushed a commit that referenced this issue Apr 11, 2024
## Changes
parse python code for `import` and `import from` processing instruction

### Linked issues
#1202 
Resolves #1346

### Functionality 

- [ ] added relevant user documentation
- [ ] added new CLI command
- [ ] modified existing command: `databricks labs ucx ...`
- [ ] added a new workflow
- [ ] modified existing workflow: `...`
- [ ] added a new table
- [ ] modified existing table: `...`

### Tests
<!-- How is this tested? Please see the checklist below and also
describe any other relevant tests -->

- [ ] manually tested
- [x] added unit tests
- [ ] added integration tests
- [ ] verified on staging environment (screenshot attached)

---------

Co-authored-by: Cor <jczuurmond@protonmail.com>
nfx pushed a commit that referenced this issue Apr 12, 2024
## Changes
implement required methods in PythonLinter

### Linked issues
#1202 
Resolves #1379

### Functionality 

- [ ] added relevant user documentation
- [ ] added new CLI command
- [ ] modified existing command: `databricks labs ucx ...`
- [ ] added a new workflow
- [ ] modified existing workflow: `...`
- [ ] added a new table
- [ ] modified existing table: `...`

### Tests
- [ ] manually tested
- [x] added unit tests
- [ ] added integration tests
- [ ] verified on staging environment (screenshot attached)
nfx pushed a commit that referenced this issue Apr 12, 2024
## Changes
filter whitelisted python import dependencies 

### Linked issues
Related to #1202 
Resolves #1365

---------

Co-authored-by: Cor <jczuurmond@protonmail.com>
nfx pushed a commit that referenced this issue Apr 15, 2024
## Changes
Implement a SitePackages scanner that uses metadata to link module root
names with actual python code belonging to installed packages

### Linked issues
#1202
Resolves #1410 

### Functionality 

- [ ] added relevant user documentation
- [ ] added new CLI command
- [ ] modified existing command: `databricks labs ucx ...`
- [ ] added a new workflow
- [ ] modified existing workflow: `...`
- [ ] added a new table
- [ ] modified existing table: `...`

### Tests
- [ ] manually tested
- [x] added unit tests
- [ ] added integration tests
- [ ] verified on staging environment (screenshot attached)
nfx pushed a commit that referenced this issue Apr 15, 2024
## Changes
add SHELL CellLanguage 

### Linked issues
#1202
Resolves #1399
@ericvergnaud
Copy link
Contributor

@nfx from its title, this ticket only applies to workspaces. Should we change the title or create a ticket for local file system ?

@nfx nfx added the CUJ critial user journey label Apr 24, 2024
nfx pushed a commit that referenced this issue Apr 24, 2024
…dencies (#1529)

## Changes
Create a DependencyProblem class, than can later be converted to Advice
Whenever a problem is encountered, 

### Linked issues
#1202
Resolves #1444
Resolves #1431
Resolves #1439

### Functionality 

- [ ] added relevant user documentation
- [ ] added new CLI command
- [ ] modified existing command: `databricks labs ucx ...`
- [ ] added a new workflow
- [ ] modified existing workflow: `...`
- [ ] added a new table
- [ ] modified existing table: `...`

### Tests

- [ ] manually tested
- [x] added unit tests
- [ ] added integration tests
- [ ] verified on staging environment (screenshot attached)


Supersedes PRs #1447 and #1448
nfx added a commit that referenced this issue Apr 26, 2024
*  Fix test failure: `test_running_real_remove_backup_groups_job` ([#1445](https://github.com/databrickslabs/ucx/issues/1445)). In this release, a fix has been implemented to address an issue with the `test_running_real_remove_backup_groups_job` function in the `tests/integration/test_installation.py` file, which was causing a test failure. The changes include the addition of a retry mechanism to wait for the group to be deleted, which will help ensure that the group is properly deleted. This mechanism retries the `ws.groups.get` command up to a minute in case of a `NotFound` or `InvalidParameterValue` exception. It is important to note that this commit introduces a change to the `test_running_real_remove_backup_groups_job` function. Manual testing was conducted to verify the changes, but no new unit or integration tests were added. As a software engineer adopting this project, you should be aware of this modification and its potential impact on your testing processes. This change is part of our ongoing efforts to maintain and improve the project.
* A notebook linter to detect DBFS references within notebook cells ([#1393](https://github.com/databrickslabs/ucx/issues/1393)). A new linter has been developed for Notebooks that examines SQL and Python cells for references to DBFS (Databricks File System) mount points or folders, raising Advisory or Deprecated warnings as necessary. This enhances code security and maintainability by helping developers avoid potential issues when working with DBFS. The `NotebookLinter` class accepts a `Languages` object and a `Notebook` object in its constructor, and the `lint` method now checks for DBFS references in the notebook's cells. Two new methods, `original_offset` and `new_cell`, have been added to the `Cell` class, and the `extract_cells` method has been updated accordingly. The `_remove_magic_wrapper` method has also been improved for better code processing and reusability. This linter uses the sqlglot library with the `databricks` dialect to parse SQL statements, recognizing Databricks-specific SQL functions and syntax. This ensures that code using DBFS follows best practices and is up-to-date.
* Added CLI commands to trigger table migration workflow ([#1511](https://github.com/databrickslabs/ucx/issues/1511)). A new CLI command, "migrate-tables", has been added to facilitate table migration in a more flexible and convenient manner. This command, implemented in the "cli.py" file of the "databricks/labs/ucx" package, triggers the `migrate-tables` and `migrate-external-hiveserde-tables-in-place-experimental` workflows. It identifies tables with the `EXTERNAL_HIVESERDE` attribute and prompts the user to confirm running the migration for external HiveSerDe tables. The migration process can be assigned to a specific metastore and default catalog, with the latter set to `hive_metastore` if not specified. These changes provide improved table management and migration capabilities, offering greater control and ease of use for our software engineering audience.
* Added CSV, JSON and include path in mounts ([#1329](https://github.com/databrickslabs/ucx/issues/1329)). The latest open-source library update introduces CSV and JSON support in the TablesInMounts class, which crawls for tables within mounts. A new parameter, 'include_paths_in_mount', has been included to specify a list of paths for crawling. This feature allows users to crawl and include specific paths in their mount crawl, providing more fine-grained control over the crawling process. Additionally, new methods have been added to detect CSV, JSON, and partitioned Parquet files, while existing methods have been updated to handle the new parameter. New tests have been added to ensure that only the specified paths are included in the crawl and that the correct file formats are detected. These changes enhance the functionality and flexibility of the TablesInMounts feature, providing greater control and precision in crawling and detecting various file formats.
* Added CTAS migration workflow for external tables cannot be in place migrated ([#1510](https://github.com/databrickslabs/ucx/issues/1510)). A new CTAS (Create Table As Select) migration workflow has been added for external tables that cannot be migrated in-place, enabling more efficient and flexible data management. The `MigrateExternalTablesCTAS` method is added, facilitating the creation of a Change Data Capture (CDC) task for external tables using CTAS queries. New integration tests have been introduced, covering HiveSerDe format migration, and handling potential NotFound errors with retry decorators and timeout settings. Additionally, a new JSON file for testing has been added, enabling testing of migration workflows for external Hive tables that cannot be in-place migrated. New modules and methods for migrating hive serde tables in-place, handling other external CTAS tables, and managing hive serde CTAS tables have been added, and test cases have been updated to include these new methods.
* Added Python linter for table creation with implicit format ([#1435](https://github.com/databrickslabs/ucx/issues/1435)). A new linter has been implemented for Python code to advise on explicit table format specification when using Databricks Runtime (DBR) 8.0 and later versions. This change comes in response to the default table format changing from `parquet` to `delta` in DBR 8.0 when no format is specified. The linter checks for 'writeTo', 'table', 'insertInto', and `saveAsTable` method invocations without an explicit format and suggests updates to include an explicit format. It supports `format` invocation in the same chain of calls and as a direct argument for 'saveAsTable'. Linting is the only functionality provided, and the linter skips linting when the DRM version is 8.0 or later. The linter is implemented in 'table_creation.py', making use of reusable AST utilities in 'python_ast_util.py', and is accompanied by unit tests. The `code migration` workflow has been updated to include this new linting functionality.
* Added Support for Migrating Table ACL of Interactive clusters using SPN ([#1077](https://github.com/databrickslabs/ucx/issues/1077)). A new class `ServicePrincipalClusterMapping` has been added to store the mapping between an interactive cluster and its corresponding Service Principal, and the `AzureServicePrincipalCrawler` class has been updated to include a new method `get_cluster_to_storage_mapping`. This method retrieves the mapping between clusters and their corresponding SPNs by iterating through all the clusters in the workspace, filtering out job clusters and clusters with specific data security modes, and retrieving the corresponding SPNs using the existing `_get_azure_spn_from_cluster_config` method. The retrieved mapping is then returned as a list of `ServicePrincipalClusterMapping` objects. Additionally, the commit adds support for migrating table ACLs of interactive clusters using a Service Principal Name (SPN) in Azure environments, achieved through the introduction of new classes, functions, and changes to existing functionality in various modules. These changes facilitate more flexible and secure ACL management for interactive clusters in Azure environments.
* Added Support for migrating Schema/Catalog ACL for Interactive cluster ([#1413](https://github.com/databrickslabs/ucx/issues/1413)). This pull request adds support for migrating schema and catalog ACLs for interactive clusters, specifically for AWS and Azure, addressing partially issues [#1192](https://github.com/databrickslabs/ucx/issues/1192) and [#1193](https://github.com/databrickslabs/ucx/issues/1193). It identifies database ACL grants from the PrincipalACL class, maps Hive Metastore schema to Unity Catalog (UC) schema and catalog using Table Mapping, and replaces Hive Metastore actions with equivalent UC actions. While it covers both cloud platforms, external location permission is excluded and will be addressed in a separate PR. Changes include updating the `_SPARK_CONF` variable in the test_migrate.py file and modifying the `test_migrate_external_tables_with_principal_acl_azure` function to skip tests in non-Azure environments. The `CatalogSchema` class now accepts a `principal_acl` parameter, and a new test function, `test_catalog_schema_acl`, has been added. This PR introduces new methods, modifies existing functionality, and includes unit, integration, and manual tests.
* Added `.suffix` override for notebooks in `WorkspacePath` ([#1557](https://github.com/databrickslabs/ucx/issues/1557)). A new `.suffix` override has been added to the `WorkspacePath` class for more consistent handling of workspace notebooks and files, addressing issue [#1455](https://github.com/databrickslabs/ucx/issues/1455). This enhancement includes a `Language` class import from `databricks.sdk.service.workspace` and enables setting the language and import format for uploading notebooks using the `make_notebook` fixture's `language` parameter. The commit also adds an `overwrite` parameter to handle existing notebook overwriting and modifies the `ws.workspace.upload` function to support new `language` and `format` parameters. Additionally, a new test case `test_file_and_notebook_in_same_folder_with_different_suffixes` in `test_wspath.py` ensures proper behavior when working with multiple file types in a single folder within the workspace.
* Added `databricks labs ucx logs` command ([#1350](https://github.com/databrickslabs/ucx/issues/1350)). A new `databricks labs ucx logs` command has been introduced, facilitating the logging of events in UCX installations, addressing issue [#1350](https://github.com/databrickslabs/ucx/issues/1350) and fixing [#1282](https://github.com/databrickslabs/ucx/issues/1282). The command is implemented in the `logs.py` file, and retrieves and logs the most recent run of each job, displaying a warning if there are no jobs to relay logs for. The implementation includes the `relay_logs` method, which logs records using `logging.getLogger`, and the `_fetch_logs` method to retrieve logs for a specified workflow and run. The `tests/unit/test_cli.py` file has been updated to include a new test case for the `logs` function, ensuring the logs are fetched correctly from the Databricks workspace. The `cli_command.py` module includes the new `logs` function, responsible for fetching logs and printing them to the console. Overall, this feature enhances the diagnostic capabilities of the UCX installer, providing a dedicated command for generating and managing logs.
* Added assessment workflow test with external hms ([#1460](https://github.com/databrickslabs/ucx/issues/1460)). This release introduces a new assessment workflow test using an external Hive Metastore Service (hms), which has been manually tested and verified on the staging environment. The `validate_workflow` function has been updated to allow skipping known failed tasks. A new method, `test_running_real_assessment_job_ext_hms`, has been added, which sets up an external hms cluster with specific configurations, grants permissions to a group, deploys and runs a workflow, and validates its success while skipping failed tasks on the SQL warehouse. The `test_migration_job_ext_hms` method has also been updated to include an assertion to check if the Hive Metastore version and GlueCatalog are enabled. Additionally, integration tests have been added to ensure the functionality of these new features. This release is aimed at improving the library's integration with external hms and providing more flexibility in testing and validating workflows.
* Added back prompts for table migration job cluster configuration ([#1195](https://github.com/databrickslabs/ucx/issues/1195)). A new function, `_config_table_migration`, has been added to the `install.py` file to improve the configuration of the table migration job cluster. This function allows users to set the parallelism, minimum and maximum number of workers for auto-scaling. The `spark_conf_dict` parameter is updated with the new spark configuration. The code has been refactored, simplifying the creation of schemas, catalogs, and tables, and improving readability. The `test_table_migration_job` function has been updated to utilize the new schema object, checking if the tables are migrated correctly and validating the configuration of the cluster. Additional properties such as `parallelism_for_migrating`, `min_workers_for_auto_scale`, and `max_workers_for_auto_scale` have been introduced for configuring the parallelism and number of workers for auto-scaling. These properties are passed as arguments to the `test_fresh_install` function. The `test_install_cluster_override_jobs` function has replaced the `WorkspaceInstallation` instance with `WorkflowsDeployment`, which may affect how the installation process handles clusters and jobs. The `test_fresh_install` function now includes configurations for SQL warehouse type, mapping workspace groups, and configuring the number of days for submit runs history, number of threads, policy ID, minimum and maximum workers, renamed group prefix, warehouse ID, workspace start path, and `spark_conf` with the parallelism configuration for spark SQL sources.
* Added check for DBFS mounts in SQL code ([#1351](https://github.com/databrickslabs/ucx/issues/1351)). In this release, we have added a check for Databricks File System (DBFS) mounts in SQL code, enhancing the system's ability to handle DBFS-related operations within SQL contexts. We have introduced a new `FromDbfsFolder` class in the DBFS module of the source code, which is added to the SQL SequentialLinter for SQL code checking. This change ensures that any references to DBFS mounts in SQL code are valid and properly formatted, improving the system's ability to validate SQL code that interacts with DBFS mounts. Additionally, we have updated the `test_dbfs.py` file with new methods to test DBFS-related functionality, and the `FromDbfsFolder` class is now responsible for identifying deprecated DBFS usage in SQL code. These updates provide developers with better insights into how DBFS usage is handled in SQL code and facilitate smoother data manipulation and retrieval for end-users and software engineers adopting this project.
* Added check for circular view dependency ([#1502](https://github.com/databrickslabs/ucx/issues/1502)). A circular view dependency check has been implemented in the hive metastore to prevent infinite loops during view migrations. This check ensures that views do not depend on each other in a circular manner, handling cases where view A depends on view B, view B depends on view C, and view C depends on view A. Two new methods, `test_migrate_circular_views_raises_value_error` and `test_migrate_circular_view_chain_raises_value_error`, have been added to the `test_views_sequencer.py` file to test for circular view dependencies and circular dependency chains. These methods utilize a mock backend to simulate a real SQL backend and check if the code raises a `ValueError` with the correct error message when circular view dependencies are detected. Additionally, an existing test has been updated, and an error message related to circular view references has been modified. The changes have been manually tested and verified with unit tests. Integration tests and staging environment verification are pending.
* Added commands for metastores listing & assignment ([#1489](https://github.com/databrickslabs/ucx/issues/1489)). A new feature has been implemented in the Unity Catalog (UCX) tool to enhance metastore management and assignment. This feature includes two new commands: `assign-metastore` and `show-all-metastores`. The `assign-metastore` command automatically assigns a UCX metastore to a specified workspace, while the `show-all-metastores` command displays all possible metastores that can be assigned to a workspace. These changes have been thoroughly tested using manual testing and unit tests, with new user documentation added to support this functionality. However, verification on a staging environment is still pending. The new methods have been implemented in the `cli_command.py` file, and the diff shows the addition of the `AccountMetastores` class and its import in the `cli_command.py` file. A new default catalog can be set using the default_namespace setting API. This feature is expected to improve the overall management and assignment of metastores in UCX.
* Added document for table migration workflow ([#1229](https://github.com/databrickslabs/ucx/issues/1229)). This release introduces detailed documentation for a table migration workflow, designed to facilitate the migration of tables from the Hive Metastore to the Unity Catalog in Databricks. The migration process consists of three stages: assessment, group migration, and the table migration workflow. The table migration workflow includes several tasks such as creating table mappings, migrating credentials, and creating catalogs and schemas in the Unity Catalog. The documentation includes the necessary commands to perform these tasks, along with dependency CLI commands like `create-table-mapping`, `principal-prefix-access`, `migrate-credentials`, `migrate-locations`, `create-catalogs-schemas`, and `create-uber-principal`. Additionally, the document covers table migration workflow tasks, including `migrate_dbfs_root_delta_tables` and `migrate_external_tables_sync`, along with other considerations such as running the workflow multiple times, setting higher workers for auto-scale, creating an instance pool, and manually editing job cluster configurations. The table migration workflow requires the assessment workflow and group migration workflow to be completed before running the table migration commands. The utility commands section includes the `ensure-assessment-run` command, the `repair-run` command, and other commands for UCX installation, configuration, and troubleshooting. This comprehensive documentation should assist developers and administrators in migrating tables from the Hive Metastore to the Unity Catalog in Databricks.
* Added error handling to udf crawling ([#1459](https://github.com/databrickslabs/ucx/issues/1459)). This commit addresses error handling in UDF (User Defined Function) crawling, specifically resolving flakiness in `test_permission_for_files_anonymous_func`. Changes include updates to the `apply_group_permissions` method in `manager.py`, introducing error gathering, checking for errors, and raising a `ManyError` exception if necessary. Additionally, the `test_tables_returning_error_when_show_tables` test has been modified to correctly check for a non-existent schema in the Hive Metastore, resolving inconsistencies in test behavior. The `snapshot` method in `logs.py` has been revised to handle specific error messages during testing, enhancing the reliability of UDF crawling. These changes have been manually tested and verified in a staging environment.
* Added functionality to migrate external tables using Create Table (No Sync) ([#1432](https://github.com/databrickslabs/ucx/issues/1432)). A new feature has been implemented in the open-source library to enable migrating external tables in Databricks' Hive metastore using the "Create Table (No Sync)" method. This feature introduces new methods `_migrate_non_sync_table` and `_get_create_in_place_sql` for handling migration and SQL query generation. The existing methods `_migrate_dbfs_root_table` and `_migrate_acl` have been updated to accommodate these changes. Additionally, a new test case has been added to demonstrate migration of external tables while preserving their location and properties. During migration, SQL queries are generated using the `sqlglot` library, with the SQL create statement for a given table key being obtained through the newly implemented `sql_show_create` method. The `sql_migrate_view` method has also been updated to create a view if it doesn't already exist. The implementation includes a new file in the `tests/unit/hive_metastore/tables/` directory, containing JSON data representing source and destination of migration, including catalog, database, name, object type, table format, workspace name, catalog name, schema mappings, and table mappings.
* Added initial version of account-level installer ([#1339](https://github.com/databrickslabs/ucx/issues/1339)). The commit introduces the initial version of an account-level installer that enables account administrators to install UCX (Unity Catalog eXtensions) on all workspaces in a Databricks account simultaneously. The installer performs necessary authentication to log in to the account, prompts for configuration for the first workspace, runs the installer, and then confirms if the user wants to repeat the process for the remaining workspaces. A new method `prompt_for_new_installation` saves answers to a new `InstallationConfig` data class, allowing answers to be reused for other workspaces. The command `databricks labs install ucx` now supports an account-level installation mode with the environment variable `UCX_FORCE_INSTALL` set to `account`. The changes include handling for `PermissionDenied`, `NotFound`, and `ValueError` exceptions, as well as modifications to the `sync_workspace_info` method to accept a list of workspaces. The `README.md` file has been updated with new sections on advanced force install over existing UCX and installing UCX on all workspaces within a Databricks account. The commit also modifies the `hms_lineage.py` method `apply` to include a new parameter `is_account_install`, which determines whether the HMS lineage init script should be enabled or a new global init script should be added, regardless of the user's response to prompts. Relevant user documentation and tests have been added, and the changes are manually tested. The commit additionally introduces a new method `AccountInstaller` and modifies the existing command `databricks labs install ucx ...`.
* Added integration tests with external HMS & Glue ([#1408](https://github.com/databrickslabs/ucx/issues/1408)). In this release, we have added integration tests for end-to-end workflows with an external Hive Metastore (HMS) and Apache Glue. The new test suite `test_ext_hms.py` utilizes a `sql_backend` fixture with `CommandContextBackend` to execute queries on a cluster with the external HMS set up, and requires a new environment variable `TEST_EXT_HMS_CLUSTER_ID`. Additionally, we have introduced a `make_mounted_location` fixture in `fixtures.py` for testing mounted locations in DBFS with a random suffix. The changes include updates to existing tests for migrating managed tables, tables with cache, external tables, and views, and the addition of tests for reverting migrated tables and handling table mappings. We have also added tests for migrating managed tables with ACLs and introduced a new `CommandContextBackend` class with methods for executing and fetching SQL commands, saving tables, and truncating tables. The new test suite includes manual testing, integration tests, and verification on a staging environment.
* Added linting for DBFS usage ([#1341](https://github.com/databrickslabs/ucx/issues/1341)). A new file, dbfs.py, has been added to the project, implementing a linter to detect DBFS (Databricks File System) file system paths in Python code. The linter uses an AST (Abstract Syntax Tree) visitor pattern to search for file system paths within the code, returning Deprecation or Advisory warnings for deprecated usage in calls or constant strings, respectively. This will help project maintainers and users identify and migrate away from deprecated file system paths in their Python code. The linter is also capable of detecting the usage of DBFS paths in string constants, function calls, and variable assignments, recognizing three types of DBFS path patterns and spark.read.parquet() function calls that use DBFS paths. The addition of this feature will ensure the proper usage of file systems in the code and aid in the transition from DBFS to other file systems.
* Added log task to parse logs and store the logs in the ucx database ([#1272](https://github.com/databrickslabs/ucx/issues/1272)). A new log task has been added that parses logs and stores them in the ucx database, with the ability to only store logs that exceed a minimum log level. The log crawler task has been added to all workflows after other tasks have run. A new CLI command has been added to retrieve errors and warnings from the latest workflow run. The LogRecord has been updated to include all relevant fields. The functionality is thoroughly tested with unit and integration tests. Existing workflows are modified and a new table for logs is added to the SQL database. User documentation and new methods have been added where necessary. This commit resolves issues [#1148](https://github.com/databrickslabs/ucx/issues/1148) and [#1283](https://github.com/databrickslabs/ucx/issues/1283).
* Added migration for non delta dbfs tables using Create Table As Select (CTAS). Convert such tables to Delta tables ([#1434](https://github.com/databrickslabs/ucx/issues/1434)). This release introduces enhancements to migrate non-Delta DBFS root tables to managed Delta tables, expanding support for various table types and configurations during migration. New methods have been added to improve CTAS functionality and SQL statement generation safety. Grant assignments are now supported during migration, along with updated integration tests and additional table format compatibility. The release includes code modifications to import `escape_sql_identifier`, add new methods like `_migrate_table_create_ctas` and `_get_create_in_place_sql`, and update existing methods such as `_migrate_non_sync_table`. Specific changes in the diff file include modifications to "fixtures.py", where the `table_type` variable is set to "TableType.EXTERNAL" for non-Delta tables, and the SQL statement is adjusted accordingly. Additionally, a new test has been added for migrating non-Delta DBFS root tables, ensuring migration success by checking target table properties.
* Added migration for views sequentially ([#1177](https://github.com/databrickslabs/ucx/issues/1177)). The `Migrate views sequentially` feature modifies the views migration process in the Hive metastore to provide better clarity and control. The `ViewsMigrator` class has been renamed to `ViewsMigrationSequencer` and now processes a list of `TableToMigrate` instances instead of fetching tables from `TablesCrawler`. This change introduces a new method, `_migrate_views`, to manage batches of views during migration, ensuring that preliminary tasks have succeeded before running tasks. The `migrate_table` method of `TableMigrate` now requires a mandatory `what` argument to prevent accidental view migrations, and the corresponding tests are updated accordingly. This feature does not add new documentation, CLI commands, or tables, but it modifies an existing command and workflow. Unit tests are added for the new functionality, and the target audience is software engineers who adopt this project. While this commit resolves issue [#1172](https://github.com/databrickslabs/ucx/issues/1172), integration tests are still required for comprehensive validation. Software engineers reviewing the code should focus on understanding the logic behind the renaming and the new `__hash__` and `__eq__` methods in the `TableToMigrate` class to maintain and extend the functionality in a consistent manner.
* Added missing step sync-workspace-info ([#1330](https://github.com/databrickslabs/ucx/issues/1330)). A new step, "sync-workspace-info," has been added to the table migration workflow in the CLI subgraph, prior to the `create-table-mapping` step. This step is designed to synchronize workspace information, ensuring its accuracy and currency before creating table mappings. These changes are confined to the table migration workflow and do not affect other parts of the project. The README file has been updated to reflect the new step in the Table Migration Workflow section, providing detailed information for software engineers. The addition of `sync-workspace-info` aims to streamline the migration process, enhancing the overall efficiency and reliability of the open-source library.
* Added roadmap workflows and tasks to Table Migration Workflow document ([#1274](https://github.com/databrickslabs/ucx/issues/1274)). The table migration workflow has been significantly enhanced in this release to provide additional functionality and flexibility. The `migrate-tables` workflow now includes new tasks for creating table mappings, catalogs and schemas, a principal, prefixing access, migrating credentials and locations, and creating catalog schemas. Additionally, there are new workflows for migrating views, migrating tables using CTAS, and experimentally migrating ParquetHiveSerDe, OrcSerde, AvroSerde, LazySimpleSerDe, JsonSerDe, and OpenCSVSerde tables in place. An experimental workflow for migrating Delta and Parquet data found in DBFS mounts but not registered as Hive Metastore tables into UC tables has also been introduced. Due to the complexity of the migration process, multiple runs of the workflow may be necessary to ensure successful migration of all tables. For more detailed information, please refer to the table migration design.
* Added support for %pip cells ([#1401](https://github.com/databrickslabs/ucx/issues/1401)). The recent commit introduces support for a new `PipCell` in the notebook functionality, enabling the execution of pip commands directly in the notebook environment. The `PipCell` comes with methods specific to its functionality, such as `language`, `is_runnable`, `build_dependency_graph`, and `migrate_notebook_path`. The `language` property returns the string `PIP`, and the `is_runnable` method returns `True`, indicating that this type of cell can be executed. The `build_dependency_graph` and `migrate_notebook_path` methods are currently empty but may be implemented in the future. Additionally, the `CellLanguage` enumeration has been updated to include a new item for the `PIP` language. This change also includes the addition of a new magic command `%pip install some-package`, allowing for easy installation and management of python packages within the notebook. Furthermore, the commit introduces a new tuple `PIP_NOTEBOOK_SAMPLE` in the `test_notebook.py` file for testing pip cells in the notebook, thereby enhancing the versatility of the project. Overall, this commit adds a new, useful functionality for running pip commands within a notebook context.
* Added support for %sh cells ([#1400](https://github.com/databrickslabs/ucx/issues/1400)). A new cell type, SHELL, has been introduced in this release, which is implemented in the `ShellCell` class. The `language` property of this class returns `CellLanguage.SHELL`. The `is_runnable` method has been added and returns `True`, but it is marked as `TODO`. The `build_dependency_graph` and `migrate_notebook_path` methods are no-ops. A new case for the `SHELL` CellLanguage has been added to the `CellLanguage` Enum and assigned to the `ShellCell` class. The release also includes a new sample notebook, "notebook-with-shell-cell.py.txt", with a shell script that can be executed using the `%sh` magic command. Two new tuples, `SHELL_NOTEBOOK_SAMPLE` and `PIP_NOTEBOOK_SAMPLE`, have been added to `source_code/test_notebook.py` for testing the new `%sh` cell functionality. Overall, this release adds support for the new `SHELL` cell type, but does not implement any specific behavior for it yet.
* Added support for migrating Table ACL for interactive cluster in AWS using Instance Profile ([#1285](https://github.com/databrickslabs/ucx/issues/1285)). This change adds support for migrating table Access Control Lists (ACLs) in AWS for interactive clusters utilizing Instance Profiles. The update introduces a new method, `get_iam_role_from_cluster_policy`, which replaces the previous `_get_iam_role_from_cluster_policy` method. This new method extracts the IAM role ARN from the cluster policy JSON object and returns the IAM role name. The `create_uber_principal` method has also been updated to use the new `get_iam_role_from_cluster_policy` method for determining the IAM role name in the cluster policy. Additionally, AWS and Google Cloud Platform (GCP) support has been added to the `principal_locations` method, which now checks for Azure, AWS, and GCP in that order. If GCP is not detected, a `NotImplementedError` is raised. These enhancements improve the migration process for table ACLs in AWS interactive clusters by utilizing Instance Profiles and providing unified handling for ACL migration across multiple cloud providers.
* Added support for views in `table-migration` workflow ([#1325](https://github.com/databrickslabs/ucx/issues/1325)). A new file, `migration_status.py`, has been added to track table migration status in a Hive metastore, and the `MigrationStatusRefresher` class has been updated to use a new approach for migrating views. The files `views_sequencer.py` and `test_views_sequencer.py` have been renamed to `view_migrate.py` and `test_view_migrate.py`, respectively. A new `MigrationIndex` class has been introduced in the `migration_status` module to keep track of the migration status of tables. The `ViewMigrationSequencer` class has been updated to accept a `migration_index` as an argument, which is used to determine the migration order of views. Relevant tests have been updated to reflect these changes and cover different scenarios of view migration, including views with no dependencies, direct views, views with dependencies, and deep nested views. The changes also include rewriting view code to point to the migrated tables and decoupling the queries module from `table_migrate`.
* Added workflow for in-place migrating external Parquet, Orc, Avro hiveserde tables ([#1412](https://github.com/databrickslabs/ucx/issues/1412)). This change introduces a workflow for in-place upgrading external Hive tables with Parquet, ORC, or Avro hiveserde formats. A new workflow, `MigrateHiveSerdeTablesInPlace`, has been added, which upgrades the specified hiveserde tables to Unity Catalog. The `tables.py` module includes new functions to describe the table, extract hiveserde details, and update the DDL with the new table name and mount point if necessary. A new function, `_migrate_external_table_hiveserde`, has been added to `table_migrate.py`, and the `TablesMigrator` class now includes two new arguments: `mounts` and `hiveserde_in_place_migrate`. These arguments control which hiveserde to migrate and replace the DBFS mnt table location if any. This allows for multiple tasks to run in parallel and migrate only one type of hiveserde at a time. The majority of the code from a previous pull request has been removed as only a subset of table formats can be in-place migrated to UC with DDL from `show create table`. This change includes new unit and integration tests and has been manually tested.
* Addressed Issue with Disabled Feature in certain regions ([#1275](https://github.com/databrickslabs/ucx/issues/1275)). In this release, we have implemented changes to address Issue [#1275](https://github.com/databrickslabs/ucx/issues/1275), which is related to a disabled feature in certain regions. Specifically, a new class attribute, ERRORS_TO_IGNORE with a value of ["FEATURE_DISABLED"], has been added to the PermissionManager class. The inventorize_permissions method has been updated to handle the `FEATURE_DISABLED` error by logging it and skipping it instead of raising an exception. This change improves the system's robustness by handling such cases more gracefully. Additionally, a new test method, 'test_manager_inventorize_ignore_error', has been added to demonstrate how to handle the error caused by the disabled feature in certain regions. This method introduces a new function, 'raise_error', that raises a `DatabricksError` with a specific error message and code. The `PermissionManager` object is then initialized with a mock `some_crawler` object and the `inventorize_permissions` method of the `PermissionManager` object is called, and the expected data is asserted to be written to the 'hive_metastore.test_database.permissions' table. The scope of these changes is limited to modifying the `test_manager_inventorize` method and adding the new `test_manager_inventorize_ignore_error` method to the 'tests/unit/workspace_access/test_manager.py' file.
* Addressed a bug with AWS UC Role Update. Adding unit tests ([#1429](https://github.com/databrickslabs/ucx/issues/1429)). A bug in the AWS Unity Catalog (UC) Role's trust policy update feature has been resolved by updating the `aws.py` file with a new method `_databricks_trust_statement`. This enhancement accurately updates the trust policy for UC roles, with modifications in the `create_uc_role` and `update_uc_trust_role` methods. New unit tests have been added, including the `test_update_uc_trust_role_append` function, which incorporates a mocked AWS CLI command for updating trust relationship policies and checks for updated trust relationships containing two principals and external ID matching conditions. The test function also includes a new mocked response for the `iam get-role` command, returning the role details with an updated ARN to verify if the trust relationship policy is updated correctly. This improvement simplifies the trust policy document generation and enhances the overall functionality of the feature.
* Allow reinstall when retry the failed table migration integration test ([#1224](https://github.com/databrickslabs/ucx/issues/1224)). The latest update introduces the capability to reinstall a table migration job in the event of a failed integration test, effectively addressing issue [#1224](https://github.com/databrickslabs/ucx/issues/1224). Previously, if the table migration job failed during an integration test, the test could not be retried due to the installation being marked as failed. Now, when executing the test_table_migration_job and test_table_migration_job_cluster_override functions, users will be prompted with "Do you want to update the existing installation?" and given the option to select `yes` to proceed with the reinstallation. This functionality is implemented by adding the `extend_prompts` parameter to the new_installation function call in both functions, with the value being a dictionary containing the new prompt. This addition allows the test to retry and the installation to be marked as successful if the table migration job is successful.
* Build dependency graph for local files ([#1462](https://github.com/databrickslabs/ucx/issues/1462)). This commit introduces a local file dependency graph builder, refactoring dependency classes to differentiate between resolution and loading. A new method, LocalFileMigrator.build_dependency_graph, is implemented, following the pattern of NotebookMigrator, for building dependency graphs of local files. The DependencyResolver class and its methods have been refactored for clarity. The Whitelist class is used to parse a compatibility catalog file, and the DependencyResolver's get_advices method returns recommendations for updating to compatible module versions. Test functions compare expected and actual advice objects for correct recommendations. No changes to user documentation, CLI commands, workflows, or tables are made in this commit. Unit tests have been added to ensure that the changes work as expected.
* Build dependency graph for site packages ([#1504](https://github.com/databrickslabs/ucx/issues/1504)). This commit introduces a dependency graph for site packages, adding package files as dependencies if they're not recognized during import, and addressing an infinite loop issue in cyclical graphs. The changes involve introducing new classes `WrappingLoader`, `SitePackageContainer`, and `SitePackage`, as well as updating the `DependencyResolver` class to use the new `SitePackages` object for locating dependencies. Additionally, this commit resolves an issue where the `locate_dependency` method would incorrectly match graph paths starting with './', and includes new unit tests and the removal of a deprecation warning for a specific dependency. The target audience for this commit is software engineers adopting the project.
* Build notebook dependency graph for `%run` cells ([#1279](https://github.com/databrickslabs/ucx/issues/1279)). A new Notebook class is implemented to parse source code and split it into cells, and a NotebookDependencyGraph class is added with related utilities to discover dependencies in `%run` cells, addressing issue [#1201](https://github.com/databrickslabs/ucx/issues/1201). This new functionality allows for the creation of a dependency graph, aiding in better code organization and understanding dependencies in the code. The Notebook class defines a `parse` method to process source code and return a Notebook object, and a `to_migrated_code` method to apply necessary modifications for running code in different environments. The NotebookDependencyGraph class offers a `build_dependency_graph` method to construct a directed acyclic graph (DAG) of dependencies between cells. The commit also includes renaming the test file for the notebook migrator and updating the Notebooks class to NotebookMigrator in the test functions.
* Bump actions/checkout from 3 to 4 ([#1191](https://github.com/databrickslabs/ucx/issues/1191)). In this release, the "actions/checkout" dependency version has been updated from 3 to 4 in the project's acceptance workflow, addressing issue [#1191](https://github.com/databrickslabs/ucx/issues/1191). The new "actions/checkout@v4" improves the reliability and performance of the code checkout process, with better handling of shallow clones and submodules. This is achieved by setting the fetch-depth to 0 for a full clone of the repository, ensuring all bug fixes and improvements of the latest version are utilized. The update provides enhanced submodule handling and improved performance, resulting in a more stable and efficient checkout process for the project's CI/CD pipeline.
* Bump actions/setup-python from 4 to 5 ([#1189](https://github.com/databrickslabs/ucx/issues/1189)). In this release, the version of the `actions/setup-python` library has been updated from 4 to 5 in various workflow files, including `.github/workflows/acceptance.yml`, `.github/workflows/push.yml`, and `.github/workflows/release.yml`. This update ensures the usage of the latest available version of the Python environment setup action, which may include bug fixes, performance improvements, and new features. The `setup-python@v5` action is configured with appropriate cache and installation parameters. As a software engineer, it is important to review this change to ensure compatibility with the project's specific requirements and configurations related to `actions/setup-python`. Communicating these improvements to colleagues and maintaining up-to-date dependencies can help ensure the reliability and performance of the project.
* Bump codecov/codecov-action from 1 to 4 ([#1190](https://github.com/databrickslabs/ucx/issues/1190)). In this update, we have improved the project's CI/CD workflow by upgrading the `codecov-action` version from 1 to 4. This change primarily affects the `Publish test coverage` job, where we have replaced `codecov/codecov-action@v1` with `codecov/codecov-action@v4`. By implementing this update, the most recent features and bug fixes from `codecov-action` will be utilized in the project's testing and coverage reporting. Moreover, this update may introduce modifications to the `codecov-action` configuration options and input parameters, requiring users to review the updated documentation to ensure their usage remains correct. The anticipated benefit of this change is enhanced accuracy and up-to-date test coverage reporting for the project.
* Bump databricks-sdk from 0.23.0 to 0.24.0 ([#1223](https://github.com/databrickslabs/ucx/issues/1223)). In this release, the dependency for the `databricks-sdk` package has been updated from version 0.23.0 to 0.24.0, which may include bug fixes, performance improvements, or new features. Additionally, specific versions have been fixed for `databricks-labs-lsql` and "databricks-labs-blueprint", and the PyYAML package version has been constrained to a range between 6.0.0 and 7.0.0. This update enhances the reliability and compatibility of the project with other libraries and packages. However, the "jobs.py" file's `crawl` function now uses the `RunType` type instead of `ListRunsRunType` when calling the `list_runs` method, which could affect the job's behavior. Therefore, further investigation is required to ensure that the updated functionality aligns with the expected behavior.
* Bump softprops/action-gh-release from 1 to 2 ([#1188](https://github.com/databrickslabs/ucx/issues/1188)). The softprops/action-gh-release package has been updated from version 1 to version 2 in this release, enhancing the reliability and efficiency of release automation. The update specifically affects the "release.yml" file in the ".github/workflows" directory, where the action-gh-release is called. While there are no specific details about the changes included in this version, it is expected to contain bug fixes, performance improvements, and possibly new features. By updating to the latest version, software engineers can ensure the smooth operation of their release processes, taking advantage of the enhanced functionality and improved performance.
* Bumped databricks-sdk from 0.24.0 to 0.26.0 ([#1388](https://github.com/databrickslabs/ucx/issues/1388)). In this release, the databricks-sdk version has been updated from 0.24.0 to 0.26.0 to resolve a breaking change where the `AzureManagedIdentity` struct has been split into `AzureManagedIdentityResponse` and `AzureManagedIdentityRequest`. This change enhances the library's compatibility with Azure services. The `tests/unit/aws/test_credentials.py` file has been updated to replace `AzureManagedIdentity` instances with `AzureManagedIdentityResponse`. The `AzureManagedIdentityResponse` struct represents the response from the Databricks SDK when requesting information about an Azure managed identity, while the `AzureManagedIdentityRequest` struct represents the request sent to the Databricks SDK when requesting the creation or modification of an Azure managed identity. These changes improve the codebase's modularity and maintainability, allowing for clearer separation of concerns and more flexibility in handling managed identities in Databricks. The updated `AzureManagedIdentityResponse` and `AzureManagedIdentityRequest` structs have been manually tested, but they have not been verified on a staging environment. The functionality of the code remains mostly the same, except for the split `AzureManagedIdentity` struct. The revised dependencies list includes "databricks-sdk~=0.26.0", "databricks-labs-lsql~=0.4.0", "databricks-labs-blueprint~=0.4.3", "PyYAML>=6.0.0,<7.0.0", and "sqlglot>=23.9,<23.12".
* Cleaned up integration test suite ([#1422](https://github.com/databrickslabs/ucx/issues/1422)). In this release, the integration test suite has been improved through the removal of the outdated `test_mount_listing` test and the fixing of `test_runtime_crawl_permissions`. The fixed test now accurately checks permissions for runtime crawlers and verifies that tables are dropped before being created. The API client and permissions of the WorkspaceClient are now properly handled, with the `do` and `permissions.get` methods returning an empty value. These changes ensure that the integration tests are up-to-date, accurate, functional, and have been manually tested. Issue [#1129](https://github.com/databrickslabs/ucx/issues/1129) has been resolved as a result of these improvements.
* Create UC External Location, Schema, and Table Grants based on workspace-wide Azure SPN mount points ([#1374](https://github.com/databrickslabs/ucx/issues/1374)). This PR introduces new functionality to create UC external location, schema, and table grants based on workspace-wide Azure SPN mount points. The `get_interactive_cluster_grants` function in the `grants.py` file has been updated to include new grants for principals and catalog grants for the `hive_metastore` catalog. The `_get_privilege` function has also been updated to accept `locations` and `mounts` inputs. Additionally, new test methods `test_migrate_external_tables_with_principal_acl_azure` and `test_migrate_external_tables_with_spn_azure` have been added to test migrating managed and external tables with principal and SPN ACLs in Azure. Existing test methods have been modified to support a new user and UCX group access. These changes improve the management of UC resources in a Databricks workspace and are tested through manual testing and the addition of unit tests. However, there is no mention of integration tests or verification on staging environments in this PR.
* Decouple `InstallState` from `WorkflowsDeployment` constructor ([#1246](https://github.com/databrickslabs/ucx/issues/1246)). In pull request [#1209](https://github.com/databrickslabs/ucx/issues/1209), the `InstallState` class was decoupled from the `WorkflowsDeployment` constructor in the `install.py` file. This refactoring allows for increased modularity and maintainability by representing the state of an installation with the `InstallState` class, which includes information such as status and configuration. The `InstallState` class is created from the `Installation` object using the `from_installation` class method in the `run` and `current` methods, and is then passed as an argument to the `WorkflowsDeployment` constructor. This change also affects several methods, such as `create_jobs`, `repair_run`, `latest_job_status`, and others, which have been updated to use `InstallState` instead of `Installation`. In the `test_installation.py` file, the `WorkflowsDeployment` constructor has been updated to accept an `InstallState` object as a separate argument. This refactoring improves the code's decoupling, readability, and flexibility, allowing for more customization and configuration of `InstallState` independently of the `installation` object.
* Decouple `InstallState` from `WorkspaceDeployment` constructor. In this refactoring change, the `InstallState` object has been decoupled from the `WorkspaceDeployment` constructor, improving modularity and maintainability. The `InstallState` object is now initialized separately and passed as an argument to the `WorkspaceInstallation` and `WorkflowsDeployment` classes. The `state` property has been removed from the `WorkspaceDeployment` class, and the `run_workflow` and `validate_step` methods now access the `_install_state` object directly. The `_create_dashboards`, `_trigger_workflow`, and `_remove_jobs` methods have been updated to use `self._install_state` instead of `self._workflows_installer.state`. This change does not impact functionality but enhances the flexibility of managing the installation state and enables easier testing and modification of the `InstallState` object separately from the `WorkspaceDeployment` class.
* Delete src/databricks/labs/ucx/source_code/dbfsqueries.py ([#1396](https://github.com/databrickslabs/ucx/issues/1396)). In this release, we have removed the DBFS querying functionality previously implemented in the `dbfsqueries.py` module located at `src/databricks/labs/ucx/source_code/`. This change is intended to streamline the project and improve maintainability. As a result, users should note that any existing code that relies on this functionality will no longer work and should be updated accordingly. We have decided to remove this file because the DBFS querying code is either no longer needed or will be implemented differently in the future. We recommend that users familiarize themselves with this change and adjust their code accordingly. Specific details about the implementation cannot be provided, as the file has been completely deleted.
* Detect DBFS use in SQL statements in notebooks ([#1372](https://github.com/databrickslabs/ucx/issues/1372)). A new linter, 'notebook-linter', has been implemented to detect and raise advisories for DBFS (Databricks File System) usage in SQL statements within notebooks. This feature helps in identifying and migrating away from DBFS references. The linter is a class that parses a Databricks notebook and applies available linters to the code cells based on the language of the cell. It specifically checks for DBFS usage in SQL statements and raises advisories accordingly. New unit tests have been added to ensure the functionality of the linter and manual testing has been conducted. This feature resolves issue [#1108](https://github.com/databrickslabs/ucx/issues/1108) and promotes best practices for file system usage within notebooks.
* Detect `sys.path` manipulation ([#1380](https://github.com/databrickslabs/ucx/issues/1380)). A new feature has been added to our open-source library that enables the detection of `sys.path` manipulation in Python code. This functionality is implemented through updates to the PythonLinter class, which now includes methods to parse the abstract syntax tree (AST) and identify modifications to `sys.path`. Additionally, the linter can now list imported sources and appended sys.paths with the new list_import_sources and list_appended_sys_paths methods, respectively. The new functionality is covered by several test cases and is accompanied by updates to the documentation, CLI command, and existing workflows. Unit tests have been added to ensure the proper functioning of the new feature, which resolves issue [#1379](https://github.com/databrickslabs/ucx/issues/1379) and is linked to issue [#1202](https://github.com/databrickslabs/ucx/issues/1202).
* Detect direct access to cloud storage and raise a deprecation warning ([#1506](https://github.com/databrickslabs/ucx/issues/1506)). In this release, the PySpark linter has been updated to detect and issue deprecation warnings for direct access to cloud storage. This new check complements the existing functionality of detecting table names that need migration. The changes include the addition of a new `AstHelper` class to extract fully-qualified function names from PySpark AST nodes and a `DirectFilesystemAccessMatcher` class to match calls to functions that perform direct filesystem access. The `TableNameMatcher` class has been updated to check if the table name is a constant string and raise a deprecation warning if the table has been migrated in the Unity Catalog. These updates aim to encourage the use of more secure and recommended methods for accessing cloud storage in PySpark code. This feature resolves issue [#1133](https://github.com/databrickslabs/ucx/issues/1133) and is signed off by Jim Idle.
* Detect imported files and packages ([#1362](https://github.com/databrickslabs/ucx/issues/1362)). This commit introduces new functionality to parse Python code for `import` and `import from` processing instructions, allowing for the detection of imported files and packages. This resolves issue [#1346](https://github.com/databrickslabs/ucx/issues/1346) and is related to issue [#1202](https://github.com/databrickslabs/ucx/issues/1202). The changes include a new method for loading files, modifications to an existing method for loading objects, and new conditions for error checking. A new `WorkspaceFile` object is created for each imported file, and the `_load_source` method has been updated to validate object info. The functionality is tested through unit tests and is not dependent on any other files. User documentation and a new CLI command have been added. Additionally, a new workflow and table have been added, and an existing workflow and table have been modified. Co-authored by Cor <jczuurmond@protonmail.com>.
* Document troubleshooting guide ([#1226](https://github.com/databrickslabs/ucx/issues/1226)). A new troubleshooting guide has been added to the UCX toolkit documentation, providing comprehensive guidance on identifying and resolving common errors. The guide includes instructions for gathering and interpreting logs from both Databricks workspace and UCX command line, as well as resources for further assistance, such as the UCX GitHub repository, Databricks community, Databricks support, and Databricks partners. It also covers specific error scenarios, including cryptic authentication errors and issues with UCX installation, with detailed steps for troubleshooting and resolution. The guide can be found in the docs/troubleshooting.md file and is linked from the main README.md, which has undergone minor revisions to installation and migration processes, as well as the removal of the previous link for questions and bug fixes in favor of the new troubleshooting guide.
* Don't fail `main` branch build with `no-cheat` ([#1461](https://github.com/databrickslabs/ucx/issues/1461)). A new GitHub Actions workflow called `no-cheat` has been developed to maintain code quality and consistency in pull requests. This workflow checks out the code with full history and verifies that no linter directives have been disabled in the new code added in the pull request. If any linter directives are found, the workflow will cause the build to fail and print a message indicating the number of instances of linter directives found. This feature is especially useful for projects that value code quality and consistency, as it helps to maintain a uniform code style throughout the codebase. Additionally, changes have been made to the push.yml file in the .github/workflows directory to ensure that the linter is not disabled in new code added to the main branch, causing the build to fail if it is. This reinforces the project's commitment to maintaining high code standards.
* Enforced removal of commented-out code on `make fmt` ([#1493](https://github.com/databrickslabs/ucx/issues/1493)). In this release, the `make fmt` process has been updated to enforce the removal of dead commented-out code through the use of `pylint`. The `pyproject.toml` file has been modified to utilize the new version of `databricks-labs-pylint` (0.3.0) and incorporate the `databricks.labs.pylint.eradicate` plugin for identifying and removing dead code. The `databricks.labs.ucx.hive_metastore.locations` module, specifically the `locations.py` and `lsp.py` files, has undergone changes to eliminate dead code and update commented-out lines. The `do_POST` and `do_GET` methods in the `lsp.py` file have been affected. No functional changes have been introduced; the modifications focus on removing dead code and improving code quality. This update enhances code readability and maintainability, promotes consistency, and eliminates unnecessary code accumulation throughout the project. A test case in the "test_generic.py" file verifies the removal of dead code, further ensuring the codebase's integrity and reliability. This release offers a cleaner, more efficient, and consistent codebase for software engineers to adopt and work with.
* Enhanced migrate views task to support views created with explicit column list ([#1375](https://github.com/databrickslabs/ucx/issues/1375)). The `migrate views` task has been enhanced to support views that are created with an explicit column list, addressing issue [#1375](https://github.com/databrickslabs/ucx/issues/1375). A lookup based on `SHOW CREATE TABLE` has been added to extract the column list from the create script, allowing for better handling of views with defined column lists. The commit also introduces a new dependency, "sqlglot~=23.9.0", and updates the PyYAML dependency. Test functions and methods have been updated in the test file 'test_table_migrate.py' to ensure that views with explicit column lists are migrated correctly, including a new test function 'test_migrate_view_with_columns'. This improvement helps make the migrate views task more robust and capable of handling a wider variety of views.
* Ensure that USE statements are recognized and apply to table references without a qualifying schema in SQL and pyspark ([#1433](https://github.com/databrickslabs/ucx/issues/1433)). This change introduces a new class, CurrentSessionState, in the databricks.labs.ucx.source_code.base module to manage the current session's database name during table initialization in SQL and PySpark. The purpose of this enhancement is to ensure proper recognition and application of USE statements to table references without a qualifying schema, addressing schema ambiguity and aligning with Spark documentation. The linter class has been updated to track session schema, and SQL parsing methods have been modified to accurately interpret the read argument as the dialect for parsing. Improved handling of table references and schema alignment in SQL and PySpark contexts enhances code robustness and user-friendliness.
* Expand documentation for end to end workflows with external HMS ([#1458](https://github.com/databrickslabs/ucx/issues/1458)). The updated UCX library now supports integration with an external Hive Metastore, providing users with the flexibility to choose between the default workspace metastore or an external one. Upon detecting an external metastore in cluster policies and Spark configurations, UCX will prompt the user to connect to it, creating a new policy with the chosen external metastore configuration. This change does not affect SQL Warehouse data access configurations, and users must ensure both job clusters and SQL Warehouses are configured for the same external Hive metastore. When setting up UCX with an external metastore, the assessment workflow scans tables and views, and the table migration workflow upgrades them accordingly. The inventory database is stored in the external Hive metastore and can only be queried with the correct configuration. When using multiple external Hive metastores, users can choose between having multiple UCX installations or manually modifying the cluster policy and SQL data access configuration to point to the correct external Hive metastore.
* Extend service principal migration with option to create access connectors with managed identity for each storage account ([#1417](https://github.com/databrickslabs/ucx/issues/1417)). This commit extends the service principal migration feature by adding the capability to create access connectors with managed identities for each storage account. A new CLI command and updated existing command are included, as well as new methods for creating, listing, getting, and deleting access connectors. The `AccessConnector` class is added to represent an access connector with properties such as id, name, location, and tags. The necessary permissions for these new access connectors will be set in a later PR. The changes also include updates to user documentation and new unit and integration tests. This feature will allow users to migrate their service principals to UC storage credentials and create Databricks Access Connectors for their storage accounts, all with the convenience of managed identities, improving security and management.
* Extended wait time for group checking during tests ([#1464](https://github.com/databrickslabs/ucx/issues/1464)). In this release, a modification has been implemented to address eventual consistency issues in group APIs that can cause failed tests. The update extends the wait time for group checking during tests to up to 2 minutes, specifically affecting the retried decorator in the `tests/integration/workspace_access/test_groups.py` file. The `timeout` parameter in the `wait` function's retry decorator has been adjusted from 60 seconds to 120 seconds, enhancing the reliability of tests interacting with group APIs. This adjustment ensures reliable group verification, even in the presence of delays or inconsistencies in group APIs, thereby improving the stability and robustness of the system.
* Fix: `test_delete_ws_groups_should_delete_renamed_and_reflected_groups_only` and `test_running_real_remove_backup_groups_job` ([#1476](https://github.com/databrickslabs/ucx/issues/1476)). This release includes a fix for issues [#1473](https://github.com/databrickslabs/ucx/issues/1473) and [#1472](https://github.com/databrickslabs/ucx/issues/1472) in the tests for deleting workspace groups. In `test_groups.py`, the `test_delete_ws_groups_should_delete_renamed_and_reflected_groups_only` and `test_running_real_remove_backup_groups_job` tests have been updated to improve the reliability of checking if a group is deleted. Previously, the tests would fail if a group was not found immediately after deletion, but now they will pass if a `NotFound` exception is raised after retrying a few times and fail if the group is not found after a couple of minutes. The logic for handling `NotFound` errors has been updated to use a new `get_group` function that raises a `KeyError` when the group is not found, which is then caught and expected to fail with a `NotFound` error. This ensures that groups are properly deleted and not found errors are handled correctly.
* Fixed UCX policy creation when instance pool is specified ([#1457](https://github.com/databrick…
@nfx nfx mentioned this issue Apr 26, 2024
nfx added a commit that referenced this issue Apr 26, 2024
* A notebook linter to detect DBFS references within notebook cells ([#1393](https://github.com/databrickslabs/ucx/issues/1393)). A new linter has been implemented in the open-source library to identify references to Databricks File System (DBFS) mount points or folders within SQL and Python cells of Notebooks, raising Advisory or Deprecated alerts when detected. This feature, resolving issue [#1108](https://github.com/databrickslabs/ucx/issues/1108), enhances code maintainability by discouraging DBFS usage, and improves security by avoiding hard-coded DBFS paths. The linter's functionality includes parsing the code and searching for Table elements within statements, raising warnings when DBFS references are found. Implementation changes include updates to the `NotebookLinter` class, a new `from_source` class method, and an `original_offset` argument in the `Cell` class. The linter now also supports the `databricks` dialect for SQL code parsing. This feature improves the library's security and maintainability by ensuring better data management and avoiding hard-coded DBFS paths.
* Added CLI commands to trigger table migration workflow ([#1511](https://github.com/databrickslabs/ucx/issues/1511)). A new `migrate_tables` command has been added to the 'databricks.labs.ucx.cli' module, which triggers the `migrate-tables` workflow and, optionally, the `migrate-external-hiveserde-tables-in-place-experimental` workflow. The `migrate-tables` workflow is responsible for managing table migrations, while the `migrate-external-hiveserde-tables-in-place-experimental` workflow handles migrations for external hiveserde tables. The new `What` class from the 'databricks.labs.ucx.hive_metastore.tables' module is used to identify hiveserde tables. If hiveserde tables are detected, the user is prompted to confirm running the `migrate-external-hiveserde-tables-in-place-experimental` workflow. The `migrate_tables` command requires a WorkspaceClient and Prompts objects and accepts an optional WorkspaceContext object, which is set to the WorkspaceContext of the WorkspaceClient if not provided. Additionally, a new `migrate_external_hiveserde_tables_in_place` command has been added which will run the `migrate-external-hiveserde-tables-in-place-experimental` workflow if it finds any hiveserde tables, making it easier to manage table migrations from the command line.
* Added CSV, JSON and include path in mounts ([#1329](https://github.com/databrickslabs/ucx/issues/1329)). In this release, the TablesInMounts function has been enhanced to support CSV and JSON file formats, along with the existing Parquet and Delta table formats. The new `include_paths_in_mount` parameter has been introduced, enabling users to specify a list of paths to crawl within all mounts. The WorkspaceConfig class in the config.py file has been updated to accommodate these changes. Additionally, a new `_assess_path` method has been introduced to assess the format of a given file and return a `TableInMount` object accordingly. Several existing methods, such as `_find_delta_log_folders`, `_is_parquet`, `_is_csv`, `_is_json`, and `_path_is_delta`, have been updated to reflect these improvements. Furthermore, two new unit tests, `test_mount_include_paths` and `test_mount_listing_csv_json`, have been added to ensure the proper functioning of the TablesInMounts function with the new file formats and the `include_paths_in_mount` parameter. These changes aim to improve the functionality and flexibility of the TablesInMounts library, allowing for more precise crawling and identification of tables based on specific file formats and paths.
* Added CTAS migration workflow for external tables cannot be in place migrated ([#1510](https://github.com/databrickslabs/ucx/issues/1510)). In this release, we have added a new CTAS (Create Table As Select) migration workflow for external tables that cannot be migrated in-place. This feature includes a `MigrateExternalTablesCTAS` class with three tasks to migrate non-SYNC supported and non-HiveSerde external tables, migrate HiveSerde tables, and migrate views from the Hive Metastore to the Unity Catalog. We have also added new methods for managed and external table migration, deprecated old methods, and added a new test function to ensure proper CTAS migration for external tables using HiveSerDe. This change also introduces a new JSON file for external table configurations and a mock backend to simulate the Hive Metastore and test the migration process. Overall, these changes improve the migration capabilities for external tables and ensure a more flexible and reliable migration process.
* Added Python linter for table creation with implicit format ([#1435](https://github.com/databrickslabs/ucx/issues/1435)). A new linter has been added to the Python library to advise on implicit table formats when the 'writeTo', 'table', 'insertInto', or `saveAsTable` methods are invoked without an explicit format specified in the same chain of calls. This feature is useful for software engineers working with Databricks Runtime (DBR) v8.0 and later, where the default table format changed from `parquet` to 'delta'. The linter, implemented in 'table_creation.py', utilizes reusable AST utilities from 'python_ast_util.py' and is not automated, providing advice instead of fixing the code. The linter skips linting when a DRM version of 8.0 or higher is passed, as the default format change only applies to versions prior to 8.0. Unit tests have been added for both files as part of the code migration workflow.
* Added Support for Migrating Table ACL of Interactive clusters using SPN ([#1077](https://github.com/databrickslabs/ucx/issues/1077)). This change introduces support for migrating table Access Control Lists (ACLs) of interactive clusters using a Security Principal Name (SPN) for Azure Databricks environments in the UCX project. It includes modifications to the `hive_metastore` and `workspace_access` modules, as well as the addition of new classes, methods, and import statements for handling ACLs and grants. This feature enables more secure and granular control over table permissions when using SPN authentication for interactive clusters in Azure. This will benefit software engineers working with interactive clusters in Azure Databricks by enhancing security and providing more control over data access.
* Added Support for migrating Schema/Catalog ACL for Interactive cluster ([#1413](https://github.com/databrickslabs/ucx/issues/1413)). This commit adds support for migrating schema and catalog ACLs for interactive clusters, specifically for AWS and Azure, with partial fixes for issues [#1192](https://github.com/databrickslabs/ucx/issues/1192) and [#1193](https://github.com/databrickslabs/ucx/issues/1193). The changes identify and filter database ACL grants, create mappings from Hive metastore schema to Unity Catalog schema and catalog, and replace Hive metastore actions with equivalent Unity Catalog actions for both schema and catalog. External location permission is not included in this commit and will be addressed separately. New methods for creating mappings, updating principal ACLs, and getting catalog schema grants have been added, and existing functionalities have been modified to handle both AWS and Azure. The code has undergone manual testing and passed unit and integration tests. The changes are targeted towards software engineers who adopt the project.
* Added `databricks labs ucx logs` command ([#1350](https://github.com/databrickslabs/ucx/issues/1350)). A new command, 'databricks labs ucx logs', has been added to the open-source library to enhance logging and debugging capabilities. This command allows developers and administrators to view logs from the latest job run or specify a particular workflow name to display its logs. By default, logs with levels of INFO, WARNING, and ERROR are shown, but the --debug flag can be used for more detailed DEBUG logs. This feature utilizes the relay_logs method from the deployed_workflows object in the WorkspaceContext class and addresses issue [#1282](https://github.com/databrickslabs/ucx/issues/1282). The addition of this command aims to improve the usability and maintainability of the framework, making it easier for users to diagnose and resolve issues.
* Added check for DBFS mounts in SQL code ([#1351](https://github.com/databrickslabs/ucx/issues/1351)). A new feature has been introduced to check for Databricks File System (DBFS) mounts within SQL code, enhancing data management and accessibility in the Databricks environment. The `dbfsqueries.py` file in the `databricks/labs/ucx/source_code` directory now includes a function that verifies the presence of DBFS mounts in SQL queries and returns appropriate messages. The `Languages` class in the `__init__` method has been updated to incorporate a new class, `FromDbfsFolder`, which replaces the existing `from_table` linter with a new linter, `DBFSUsageLinter`, for handling DBFS usage in SQL code. In addition, a Staff Software Engineer has improved the functionality of a DBFS usage linter tool by adding new methods to check for deprecated DBFS mounts in SQL code, returning deprecation warnings as needed. These enhancements ensure more robust handling of DBFS mounts throughout the system, allowing for better integration and management of DBFS-related issues in SQL-based operations.
* Added check for circular view dependency ([#1502](https://github.com/databrickslabs/ucx/issues/1502)). A circular view dependency check has been implemented to prevent issues caused by circular dependencies in views. This includes a new test for chained circular dependencies (A->B, B->C, C->A) and an update to the existing circular view dependency test. The checks have been implemented through modifications to the tests in `test_views_sequencer.py`, including a new test method and an update to the existing test method. If any circular dependencies are encountered during migration, a ValueError with an error message will be raised. These changes include updates to the `tables_and_views.json` file, with the addition of a new view `v12` that depends on `v11`, creating a circular dependency. The changes have been tested through the addition of unit tests and are expected to function as intended. No new methods have been added, but changes have been made to the existing `_next_batch` method and two new methods, `_check_circular_dependency` and `_get_view_instance`, have been introduced.
* Added commands for metastores listing & assignment ([#1489](https://github.com/databrickslabs/ucx/issues/1489)). This commit introduces new commands for handling metastores in the Databricks Labs Unity Catalog (UCX) tool, which enables more efficient management of metastores. The `databricks labs ucx assign-metastore` command automatically assigns a metastore to a specified workspace when possible, while the `databricks labs ucx show-all-metastores` command displays all possible metastores that can be assigned to a workspace. These changes include new methods for handling metastores in the account and workspace classes, as well as new user documentation, manual testing, and unit tests. The new functionality is added to improve the usability and efficiency of the UCX tool in handling metastores. Additional information on the UCX metastore commands is provided in the README.md file.
* Added functionality to migrate external tables using Create Table (No Sync) ([#1432](https://github.com/databrickslabs/ucx/issues/1432)). A new feature has been implemented for migrating external tables in Databricks' Hive metastore using the "Create Table (No Sync)" method. This feature includes the addition of two new methods, `_migrate_non_sync_table` and `_get_create_in_place_sql`, for handling migration and SQL query generation. The existing methods `_migrate_dbfs_root_table` and `_migrate_acl` have also been updated. A test case has been added to demonstrate migration of external tables while preserving their location and properties. This new functionality provides more flexibility in managing migrations for specific use cases. The SQL parsing library sqlglot has been utilized to replace the current table name with the updated catalog and change the CREATE statement to CREATE IF NOT EXISTS. This increases the efficiency and security of migrating external tables in the Databricks' Hive metastore.
* Added initial version of account-level installer ([#1339](https://github.com/databrickslabs/ucx/issues/1339)). A new account-level installer has been added to the UCX library, allowing account administrators to install UCX on all workspaces within an account in a single operation. The installer authenticates to the account, prompts the user for configuration of the first workspace, and then runs the installation and offers to repeat the process for all remaining workspaces. This is achieved through the creation of a new `prompt_for_new_installation` method which saves user responses to a new `InstallationConfig` data class, allowing for reuse in other workspaces. The existing `databricks labs install ucx` command now supports account-level installation when the `UCX_FORCE_INSTALL` environment variable is set to 'account'. The changes have been manually tested and include updates to documentation and error handling for `PermissionDenied`, `NotFound`, and `ValueError` exceptions. Additionally, a new `AccountInstaller` class has been added to manage the installation process at the account level.
* Added linting for DBFS usage ([#1341](https://github.com/databrickslabs/ucx/issues/1341)). A new linter, "DBFSUsageLinter", has been added to our open-source library to check for deprecated file system paths in Python code, specifically for Database File System (DBFS) usage. Implemented as part of the "databricks.labs.ucx.source_code" package in the "languages.py" file, this linter defines a visitor, "DetectDbfsVisitor", that detects file system paths in the code and checks them against a list of known deprecated paths. If a match is found, it creates a Deprecation or Advisory object with information about the deprecated code, including the line number and column offset, and adds it to a list. This feature will assist in identifying and removing deprecated file system paths from the codebase, ensuring consistent and proper use of DBFS within the project.
* Added log task to parse logs and store the logs in the ucx database ([#1272](https://github.com/databrickslabs/ucx/issues/1272)). A new log task has been added to parse logs and store them in the ucx database, added as a log crawler task to all workflows after other tasks have completed. The LogRecord has been updated to include all necessary fields, and logs below a certain minimum level will no longer be stored. A new CLI command to retrieve errors and warnings from the latest workflow run has been added, while existing commands and workflows have been modified. User documentation has been updated, and new methods have been added for log parsing and storage. A new table called `logs` has been added to the database, and unit and integration tests have been added to ensure functionality. This change also resolves issues [#1148](https://github.com/databrickslabs/ucx/issues/1148) and [#1283](https://github.com/databrickslabs/ucx/issues/1283), with modifications to existing classes such as RuntimeContext, TaskRunWarningRecorder, and LogRecord, and the addition of new classes and methods including HiveMetastoreLineageEnabler and LogRecord in the logs.py file. The deploy_schema function has been updated to include the new table, and the existing command `databricks labs ucx` has been modified to accommodate the new log functionality. Existing workflows have been updated and a new workflow has been added, all of which are tested through unit tests, integration tests, and manual testing. The `TaskLogger` class and `TaskRunWarningRecorder` class are used to log and record task run data, with the `parse_logs` method used to parse log files into partial log records, which are then used to create snapshot rows in the `logs` table.
* Added migration for non delta dbfs tables using Create Table As Select (CTAS). Convert such tables to Delta tables ([#1434](https://github.com/databrickslabs/ucx/issues/1434)). In this release, we've developed new methods to migrate non-Delta DBFS root tables to managed Delta tables, enhancing compatibility with various table formats and configurations. We've added support for safer SQL statement generation in our Create Table As Select (CTAS) functionality and incorporated new creation methods. Additionally, we've introduced grant assignments during the migration process and updated integration tests. The changes include the addition of a `TablesMigrator` class with an updated `migrate_tables` method, a new `PrincipalACL` parameter, and the `test_dbfs_non_delta_tables_should_produce_proper_queries` function to test the migration of non-Delta DBFS tables to managed Delta tables. These improvements promote safer CTAS functionality and expanded compatibility for non-Delta DBFS root tables.
* Added support for %pip cells ([#1401](https://github.com/databrickslabs/ucx/issues/1401)). A new cell type, %pip, has been introduced to the notebook interface, allowing for the execution of pip commands within the notebook. The new class, PipCell, has been added with several methods, including is_runnable, build_dependency_graph, and migrate_notebook_path, enabling the notebook interface to recognize and handle pip cells differently from other cell types. This allows for the installation of Python packages directly within a notebook setting, enhancing the notebook environment and providing users with the ability to dynamically install necessary packages as they work. The new sample notebook file demonstrates the installation of a package using the %pip install command. The implementation includes modifying the notebook runtime to recognize and execute %pip cells, and installing packages in a manner consistent with standard pip installation processes. Additionally, a new tuple, PIP_NOTEBOOK_SAMPLE, has been added to the existing test notebook sample tuple list, enabling testing the handling of %pip cells during notebook splitting.
* Added support for %sh cells ([#1400](https://github.com/databrickslabs/ucx/issues/1400)). A new `SHELL` CellLanguage has been implemented to support %sh cells, enabling the execution of shell commands directly within the notebook interface. This enhancement, addressing issue [#1400](https://github.com/databrickslabs/ucx/issues/1400) and linked to [#1399](https://github.com/databrickslabs/ucx/issues/1399) and [#1202](https://github.com/databrickslabs/ucx/issues/1202), streamlines the process of running shell scripts in the notebook, eliminating the need for external tools. The new SHELL_NOTEBOOK_SAMPLE tuple, part of the updated test suite, demonstrates the feature's functionality with a shell cell, while the new methods manage the underlying mechanics of executing these shell commands. These changes not only extend the platform's capabilities by providing built-in support for shell commands but also improve productivity and ease-of-use for teams relying on shell commands as part of their data processing and analysis pipelines.
* Added support for migrating Table ACL for interactive cluster in AWS using Instance Profile ([#1285](https://github.com/databrickslabs/ucx/issues/1285)). This change adds support for migrating table access control lists (ACLs) for interactive clusters in AWS using an Instance Profile. A new method `get_iam_role_from_cluster_policy` has been introduced in the `AwsACL` class, which replaces the static method `_get_iam_role_from_cluster_policy`. The `create_uber_principal` method now uses this new method to obtain the IAM role name from the cluster policy. Additionally, the project now includes AWS Role Action and AWS Resource Permissions to handle permissions for migrating table ACLs for interactive clusters in AWS. New methods and classes have been added to support AWS-specific functionality and handle AWS instance profile information. Two new tests have been added to tests/unit/test_cli.py to test various scenarios for interactive clusters with and without ACL in AWS. A new argument `is_gcp` has been added to WorkspaceContext to differentiate between Google Cloud Platform and other cloud providers.
* Added support for views in `table-migration` workflow ([#1325](https://github.com/databrickslabs/ucx/issues/1325)). A new `MigrationStatus` class has been added to track the migration status of tables and views in a Hive metastore, and a `MigrationIndex` class has been added to check if a table or view has been migrated or not. The `MigrationStatusRefresher` class has been updated to use a new approach for migrating tables and views, and is now responsible for refreshing the migration status of tables and indexing it using the `MigrationIndex` class. A `ViewsMigrationSequencer` class has also been introduced to sequence the migration of views based on dependencies. These changes improve the migration process for tables and views in the `table-migration` workflow.
* Added workflow for in-place migrating external Parquet, Orc, Avro hiveserde tables ([#1412](https://github.com/databrickslabs/ucx/issues/1412)). This change introduces a new workflow, `MigrateHiveSerdeTablesInPlace`, for in-place upgrading external Parquet, Orc, and Avro hiveserde tables to the Unity Catalog. The workflow includes new functions to describe the table and extract hiveserde details, update the DDL from `show create table`, and replace the old table name with the migration target and DBFS mount table location if any. A new function `_migrate_external_table_hiveserde` has been added to `table_migrate.py`, and two new arguments, `mounts` and `hiveserde_in_place_migrate`, have been added to the `TablesMigrator` class. These arguments control which hiveserde to migrate and replace the DBFS mnt table location if any, enabling multiple tasks to run in parallel and migrate only one type of hiveserde at a time. This feature does not include user documentation, new CLI commands, or changes to existing commands, but it does add a new workflow and modify the existing `migrate_tables` function in `table_migrate.py`. The changes have been manually tested, but no unit tests, integration tests, or staging environment verification have been provided.
* Build dependency graph for local files ([#1462](https://github.com/databrickslabs/ucx/issues/1462)). This commit refactors dependency classes to distinguish between resolution and loading, and introduces new classes to handle different types of dependencies. A new method, `LocalFileMigrator.build_dependency_graph`, is implemented, following the pattern of `NotebookMigrator`, to build a dependency graph for local files. This resolves issue [[#1202](https://github.com/databrickslabs/ucx/issues/1202)](https://github.com/databrickslabs/ucx/issues/1202) and addresses issue [[#1360](https://github.com/databrickslabs/ucx/issues/1360)](https://github.com/databrickslabs/ucx/issues/1360). While the refactoring and implementation of new methods improve the accuracy of dependency graphs and ensure that dependencies are correctly registered based on the file's language, there are no user-facing changes, such as new or modified CLI commands, tables, or workflows. Unit tests are added to ensure that the new changes function as expected.
* Build dependency graph for site packages ([#1504](https://github.com/databrickslabs/ucx/issues/1504)). This commit introduces changes to the dependency graph building process for site packages within the ucx project. When a package is not recognized, package files are added as dependencies to prevent errors during import dependency determination, thereby fixing an infinite loop issue when encountering cyclical graphs. This resolves issues [#1427](https://github.com/databrickslabs/ucx/issues/1427) and is related to [#1202](https://github.com/databrickslabs/ucx/issues/1202). The changes include adding new methods for handling package files as dependencies and preventing infinite loops when visiting cyclical graphs. The `SitePackage` class in the `site_packages.py` file has been updated to handle package files more accurately, with the `__init__` method now accepting `module_paths` as a list of Path objects instead of a list of strings. A new method, `module_paths`, has also been introduced. Unit tests have been added to ensure the correct functionality of these changes, and a hack in the PR will be removed once issue [#1421](https://github.com/databrickslabs/ucx/issues/1421) is implemented.
* Build notebook dependency graph for `%run` cells ([#1279](https://github.com/databrickslabs/ucx/issues/1279)). A new `Notebook` class has been developed to parse source code and split it into cells, and a `NotebookDependencyGraph` class with related utilities has been added to discover dependencies in `%run` cells, addressing issue [#1201](https://github.com/databrickslabs/ucx/issues/1201). The new functionality enhances the management and tracking of dependencies within notebooks, improving code organization and efficiency. The commit includes updates to existing notebooks to utilize the new classes and methods, with no impact on existing functionality outside of the `%run` context.
* Create UC External Location, Schema, and Table Grants based on workspace-wide Azure SPN mount points ([#1374](https://github.com/databrickslabs/ucx/issues/1374)). This change adds new functionality to create Unity Catalog (UC) external location, schema, and table grants based on workspace-wide Azure Service Principal Names (SPN) mount points. The majority of the work was completed in a previous pull request. The main change in this pull request is the addition of a new test function, `test_migrate_external_tables_with_principal_acl_azure`, which tests the migration of tables with principal ACLs in an Azure environment. This function includes the creation of a new user with cluster access, another user without cluster access, and a new group with cluster access to validate the migration of table grants to these entities. The `make_cluster_permissions` method now accepts a `service_principal_name` parameter, and after migrating the tables with the `acl_strategy` set to `PRINCIPAL`, the function checks if the appropriate grants have been assigned to the Azure SPN. This change is part of an effort to improve the integration of Unity Catalog with Azure SPNs and is accessible through the UCX CLI command. The changes have been tested through manual testing, unit tests, and integration tests and have been verified in a staging environment.
* Detect DBFS use in SQL statements in notebooks ([#1372](https://github.com/databrickslabs/ucx/issues/1372)). A new linter has been added to detect and discourage the use of DBFS (Databricks File System) in SQL statements within notebooks. This linter raises deprecated advisories for any identified DBFS folder or mount point references in SQL statements, encouraging the use of alternative storage options. The change is implemented in the `NotebookLinter` class of the 'notebook_linter.py' file, and is tested through unit tests to ensure proper functionality. The target audience for this update includes software engineers who use Databricks or similar platforms, as the new linter will help users transition away from using DBFS in their SQL statements and adopt alternative storage methods.
* Detect `sys.path` manipulation ([#1380](https://github.com/databrickslabs/ucx/issues/1380)). A change has been introduced to the Python linter to detect manipulation of `sys.path`. New classes, AbsolutePath and RelativePath, have been added as subclasses of SysPath. The SysPathVisitor class has been implemented to track additions to sys.path and the visit_Call method in SysPathVisitor checks for 'sys.path.append' and 'os.path.abspath' calls. The new functionality includes a new method, collect_appended_sys_paths in PythonLinter, and a static method, list_appended_sys_paths, to retrieve the appended paths. Additionally, new tests have been added to the PythonLinter to detect manipulation of the `sys.path` variable, specifically the `list_appended_sys_paths` method. The new test cases include using aliases for `sys`, `os`, and `os.path`, and using both absolute and relative paths. This improvement will enhance the linter's ability to detect potential issues related to manipulation of the `sys.path` variable. The change resolves issue [#1379](https://github.com/databrickslabs/ucx/issues/1379) and is linked to issue [#1202](https://github.com/databrickslabs/ucx/issues/1202). No user documentation or CLI commands have been added or modified, and no manual testing has been performed. Unit tests for the new functionality have been added.
* Detect direct access to cloud storage and raise a deprecation warning ([#1506](https://github.com/databrickslabs/ucx/issues/1506)). In this release, the Pyspark linter has been enhanced to detect and issue deprecation warnings for direct access to cloud storage. This change, which resolves issue [#1133](https://github.com/databrickslabs/ucx/issues/1133), introduces new classes `AstHelper` and `TableNameMatcher` to determine the fully-qualified name of functions and replace instances of direct cloud storage access with migration index table names. Instances of direct access using 'dbfs:/', 'dbfs://', and default 'dbfs:' references will now be detected and flagged with a deprecation warning. The test file `test_pyspark.py` has been updated to include new tests for detecting direct cloud storage access. Users should be aware of these changes when updating their code to avoid deprecation warnings.
* Detect imported files and packages ([#1362](https://github.com/databrickslabs/ucx/issues/1362)). This commit introduces functionality to parse Python code for `import` and `import from` processing instructions, enabling the detection and management of imported files and packages. It includes a new CLI command, modifications to existing commands, new and updated workflows, and additional tables. The code modifications include new methods for visiting Import and ImportFrom nodes, and the addition of unit tests to ensure correctness. Relevant user documentation has been added, and the new functionality has been tested through manual testing, unit tests, and verification on a staging environment. This comprehensive update enhances dependency management, code organization, and understanding for a more streamlined user experience.
* Enhanced migrate views task to support views created with explicit column list ([#1375](https://github.com/databrickslabs/ucx/issues/1375)). The commit enhances the migrate views task to better support handling of views with an explicit column list, improving overall compatibility. A new lookup based on `SHOW CREATE TABLE` has been added to extract the column list from the create script, ensuring accurate migration. The `_migrate_view_table` method has been refactored, and a new `_sql_migrate_view` method is added to fetch the create statement of the view. The `ViewToMigrate` class has been updated with a new `_view_dependencies` method to determine view dependencies in the new SQL text. Additionally, new methods `safe_sql_key` and `add_table` have been introduced, and the `sqlglot.parse` method is used to parse the code with `databricks` as the read argument. A new test for migrating views with an explicit column list has been added, along with the `upgraded_from` and `upgraded_to` table properties, and the migration status is updated to reflect successful migration. New test functions have also been added to test the migration of views with columns and ACLs. Dependency sqlglot has been updated to version ~=23.9.0, enhancing the overall functionality and compatibility of the migrate views task.
* Ensure that USE statements are recognized and apply to table references without a qualifying schema in SQL and pyspark ([#1433](https://github.com/databrickslabs/ucx/issues/1433)). This commit enhances the library's functionality in handling `USE` statements in both SQL and PySpark by ensuring they are recognized and applied to table references without a qualifying schema. A new `CurrentSessionState` class is introduced to manage the current schema of a session, and existing classes such as `FromTable` and `TableNameMatcher` are updated to use this new class. Additionally, the `lint` and `apply` methods have been updated to handle `USE` statements and improve the precision of table reference handling. These changes are particularly useful when working with tables in different schemas, ensuring the library can manage table references more accurately in SQL and PySpark. A new fixture, 'extended_test_index', has been added to support unit tests, and the test file 'test_notebook.py' has been updated to better reflect the intended schema for each table reference.
* Expand documentation for end to end workflows with external HMS ([#1458](https://github.com/databrickslabs/ucx/issues/1458)). The UCX toolkit has been updated to support integration with an external Hive Metastore (HMS), in addition to the default workspace HMS. This feature allows users to easily set up UCX to work with an existing external HMS, providing greater flexibility in managing and accessing data. During installation, UCX will scan for evidence of an external HMS in the cluster policies and Spark configurations. If found, UCX will prompt the user to connect to the external HMS, create a new policy with the necessary Spark and data access configurations, and set up job clusters accordingly. However, users will need to manually update the data access configuration for SQL Warehouses that are not configured for external HMS. Users can also create a cluster policy with appropriate Spark configurations and data access for external HMS, or edit existing policies in specified UCX workflows. Once set up, the assessment workflow will scan tables and views from the external HMS, and the table migration workflow will upgrade tables and views from the external HMS to the Unity Catalog. Users should note that if the external HMS is shared between multiple workspaces, a different inventory database name should be specified for each UCX installation. It is important to plan carefully when setting up a workspace with multiple external HMS, as the assessment dashboard will fail if the SQL warehouse is not configured correctly. Users can have multiple UCX installations in a workspace, each set up with a different external HMS, or manually modify the cluster policy and SQL data access configuration to point to the correct external HMS after UCX has been installed.
* Extend service principal migration with option to create access connectors with managed identity for each storage account ([#1417](https://github.com/databrickslabs/ucx/issues/1417)). This commit extends the service principal migration feature to create access connectors with managed identities for each storage account, enhancing security and isolation by preventing cross-account access. A new CLI command has been added, and an existing command has been modified. The `create_access_connectors_for_storage_accounts` method creates access connectors with the required permissions for each storage account used in external tables. The `_apply_storage_permission` method has also been updated. New unit and integration tests have been included, covering various scenarios such as secret value decoding, secret read exceptions, and single storage account testing. The necessary permissions for these connectors will be set in a subsequent pull request. Additionally, a new method, `azure_resources_list_access_connectors`, and `azure_resources_get_access_connector` have been introduced to ensure access connectors are returned as expected. This change has been tested manually and through automated tests, ensuring backward compatibility while providing improved security features.
* Fixed UCX policy creation when instance pool is specified ([#1457](https://github.com/databrickslabs/ucx/issues/1457)). In this release, we have made significant improvements to the handling of instance pools in UCX policy creation. The `policy.py` file has been updated to properly handle the case when an instance pool is specified, by setting the `instance_pool_id` attribute and removing the `node_type_id` attribute in the policy definition. Additionally, the availability attribute has been removed for all cloud providers, including AWS, Azure, and GCP, when an instance pool ID is provided. A new `pop` method call has also been added to remove the `gcp_attributes.availability` attribute when an instance pool ID is provided. These changes ensure consistency in the policy definition across all cloud providers. Furthermore, tests for this functionality have been updated in the 'test_policy.py' file, specifically the `test_cluster_policy_instance_pool` function, to check the correct addition of the instance pool to the cluster policy. The purpose of these changes is to improve the reliability and functionality of UCX policy creation, specifically when an instance pool is specified.
* Fixed `migrate-credentials` command on aws ([#1501](https://github.com/databrickslabs/ucx/issues/1501)). In this release, the `migrate-credentials` command for the `labs.yml` configuration file has been updated to include new flags for specifying a subscription ID and AWS profile. This allows users to scan a specific storage account and authenticate using a particular AWS profile when migrating credentials for storage access to UC storage credentials. The `create-account-groups` command remains unchanged. Additionally, several issues related to the `migrate-credentials` command for AWS have been addressed, such as hallucinating the presence of a `--profile` flag, using a monotonically increasing role ID, and not handling cases where there are no IAM roles to migrate. The `run` method of the `AwsUcStorageCredentials` class has been updated to handle these cases, and several test functions have been added or updated to ensure proper functionality. These changes improve the functionality and robustness of the `migrate-credentials` command for AWS.
* Fixed edge case for `RegexSubStrategy` ([#1561](https://github.com/databrickslabs/ucx/issues/1561)). In this release, we have implemented fixes for the `RegexSubStrategy` class within the `GroupMigrationStrategy`, addressing an issue where matching account groups could not be found using the display name. The `generate_migrated_groups` function has been updated to include a check for account groups with matching external IDs when either the display name or regex substitution of the display name fails to yield a match. Additionally, we have expanded testing for the `GroupManager` class, which handles group management. This includes new tests using regular expressions to match groups, and ensuring that the `GroupManager` class can correctly identify and manage groups based on different criteria such as the group's ID, display name, or external ID. These changes improve the robustness of the `GroupMigrationStrategy` and ensure the proper functioning of the `GroupManager` class when using regular expression substitution and matching.
* Fixed table in mount partition scans for JSON and CSV ([#1437](https://github.com/databrickslabs/ucx/issues/1437)). This release introduces a fix for an issue where table scans on partitioned CSV and JSON files were not being correctly identified. The `TablesInMounts` scan function has been updated to accurately detect these files, addressing the problem reported in issue [#1389](https://github.com/databrickslabs/ucx/issues/1389) and linked issue [#1437](https://github.com/databrickslabs/ucx/issues/1437). To ensure functionality, new private methods `_find_partition_file_format` and `_assess_path` have been introduced, with the latter updated to handle partitioned directories. Additionally, unit tests have been added to test partitioned CSVs and JSONs, simulating the file system's response to various calls. These changes provide enhanced detection and handling of partitioned CSVs and JSONs in the `TablesInMounts` scan function.
* Forward remote logs on `run_workflow` and removed `destroy-schema` workflow in favour of `databricks labs uninstall ucx` ([#1349](https://github.com/databrickslabs/ucx/issues/1349)). In this release, the `destroy-schema` workflow has been removed and replaced with the `databricks labs uninstall ucx` command, addressing issue [#1186](https://github.com/databrickslabs/ucx/issues/1186). The `run_workflow` function has been updated to forward remote logs, and the `run_task` function now accepts a new argument `sql_backend`. The `Task` class includes a new method `is_testing()` and has been updated to use `RuntimeBackend` before `SqlBackend` in the `databricks.labs.lsql.backends` module. The `TaskLogger` class has been modified to include a new argument `attempt` and a new class method `log_path()`. The `verify_metastore` method in the `verification.py` file has been updated to handle `PermissionDenied` exceptions more gracefully. The `destroySchema` class and its `destroy_schema` method have been removed. The `workflow_task.py` file has been updated to include a new argument `attempt` in the `task_run_warning_recorder` method. These changes aim to improve the system's efficiency, error handling, and functionality.
* Give all access connectors `Storage Blob Data Contributor` role ([#1425](https://github.com/databrickslabs/ucx/issues/1425)). A new change has been introduced to grant the `Storage Blob Data Contributor` role, which provides the highest level of data access, to all access connectors for each storage account in the system. This adjustment, part of issue [#142](https://github.com/databrickslabs/ucx/issues/142)
* Grant uber principal write permissions so that SYNC command will succeed ([#1505](https://github.com/databrickslabs/ucx/issues/1505)). A change has been implemented to modify the `databricks labs ucx create-uber-principal` command, granting the uber principal write permissions on Azure Blob Storage. This aligns with the existing implementation on AWS where the uber principal has write access to all S3 buckets. The modification includes the addition of a new role, "STORAGE_BLOB_DATA_CONTRIBUTOR", to the `_ROLES` dictionary in the `resources.py` file. A new method, `clean_up_spn`, has also been added to clear ucx uber service principals. This change resolves issue [#939](https://github.com/databrickslabs/ucx/issues/939) and ensures consistent behavior with AWS, enabling the uber principal to have write permissions on all Azure blob containers and ensuring the success of the `SYNC` command. The changes have been manually tested but not yet verified on a staging environment.
* Handled new output format of `SHOW TBLPROPERTIES` command ([#1381](https://github.com/databrickslabs/ucx/issues/1381)). A recent commit has been made to address an issue with the `test_revert_migrated_table` test failing due to the new output format of the `SHOW TBLPROPERTIES` command in the open-source library. Previously, the output was blank if a table property was missing, but now it shows a message indicating that the table does not have the specified property. The commit updates the `is_migrated` method in the `migration_status.py` file to handle this new output format, where the method now uses the `fetch` method to retrieve the `upgraded_to` property for a given schema and table. If the property is missing, the method will continue to the next table. The commit also updates tests for the changes, including a manual test that has not been verified on a staging environment. Changes have been made in the `test_table_migrate.py` file, where rows with table properties have been updated to return new data, and the `timestamp` function now sets the `datetime.datetime` to a `FakeDate`. No new methods have been added, and existing functionality related to `SHOW TBLPROPERTIES` command output handling has been changed in scope.
* Ignore whitelisted imports ([#1367](https://github.com/databrickslabs/ucx/issues/1367)). This commit introduces a new class `DependencyResolver` that filters Python import dependencies based on a whitelist, and updates to the `DependencyGraph` class to support this new resolver. A new optional parameter `resolver` has been added to the `NotebookMigrator` class constructor and the `DependencyGraph` constructor. A new file `whitelist.py` has been added, introducing classes and functions for defining and managing a whitelist of Python packages based on their name and version. These changes aim to improve control over which dependencies are included in the dependency graph, contributing to a more modular and maintainable codebase.
* Increased memory for ucx clusters ([#1366](https://github.com/databrickslabs/ucx/issues/1366)). This release introduces an update to enhance memory configuration for UCX clusters, addressing issue [#1366](https://github.com/databrickslabs/ucx/issues/1366). The main change involves a new method for selecting a node type with a minimum of 16GB of memory and local disk enabled, implemented in the policy.py file of the installer module. This modification results in the `node_type_id` parameter for creating clusters, instance pools, and pipelines now requiring a minimum memory of 16 GB. This change is reflected in the fixtures.py file, `ws.clusters.select_node_type()`, `ws.instance_pools.create()`, and `pipelines.PipelineCluster` method calls, ensuring that any newly created clusters, instance pools, and pipelines benefit from the increased memory allocation. This update aims to improve user experience by offering higher memory configurations out-of-the-box for UCX-related workloads.
* Integrate detection of notebook dependencies ([#1338](https://github.com/databrickslabs/ucx/issues/1338)). In this release, the NotebookMigrator has been updated to integrate dependency graph construction for detecting notebook dependencies, addressing issues 1204, 1286, and 1326. The changes include modifying the NotebookMigrator class to include the dependency graph and updating relevant tests. A new file, python_linter.py, has been added for linting Python code, which now detects calls to "dbutils.notebook.run" with dynamic paths. The linter uses the ast module to parse the code and locate nodes matching the specified criteria. The NotebookMigrator's apply method has been updated to check for ObjectType.NOTEBOOK, loading the notebook using the new _load_notebook method, and incorporating a new _apply method for modifying the code in the notebook based on applicable fixes. A new DependencyGraph class has been introduced to build a graph of dependencies within the notebook, and several new methods have been added, including _load_object, _load_notebook_from_path, and revert. This release is co-authored by Cor and aims to improve dependency management in the notebook system.
* Isolate grants computation when migrating tables ([#1233](https://github.com/databrickslabs/ucx/issues/1233)). In this release, we have implemented a change to improve the reliability of table migrations. Previously, grants to migrate were computed and snapshotted outside the loop that iterates through tables to migrate, which could lead to inconsistencies if the grants or migrated groups changed during migration. Now, grants are re-computed for each table, reducing the chance of such issues. We have introduced a new method `_compute_grants` that takes in the table to migrate, ACL strategy, and snapshots of all grants to migrate, migrated groups, and principal grants. If `acl_strategy` is `None`, it defaults to an empty list. The method checks each strategy in the ACL strategy list, extending the `grants` list if the strategy is `AclMigrationWhat.LEGACY_TACL` or `AclMigrationWhat.PRINCIPAL`. The `migrate_tables` method has been updated to use this new method to compute grants. It first checks if `acl_strategy` is `None`, and if so, sets it to an empty list. It then calls `_compute_grants` with the current table, `acl_strategy`, and the snapshots of all grants to migrate, migrated groups, and principal grants. The computed grants are then used to migrate the table. This change enhances the robustness of the migration process by isolating grants computation for each table.
* Log more often from workflows ([#1348](https://github.com/databrickslabs/ucx/issues/1348)). In this update, the log formatting for the debug log file in the "tasks.py" file of the "databricks/labs/ucx/framework" module has been modified. The `TimedRotatingFileHandler` function has been adjusted to rotate the log file every minute, increasing the frequency of log file rotation from every 10 minutes. Furthermore, the logging format has been enhanced to include the time, level name, name, thread name, and message. These improvements are in response to issue [#1171](https://github.com/databrickslabs/ucx/issues/1171) and the implementation of more frequent logging as per issue [#1348](https://github.com/databrickslabs/ucx/issues/1348), ensuring more detailed and up-to-date logs for debugging and analysis purposes.
* Make `databricks labs ucx assign-metastore` prompt for workspace if no workspace id provided ([#1500](https://github.com/databrickslabs/ucx/issues/1500)). The `databricks labs ucx assign-metastore` command has been updated to allow for a optional `workspace_id` parameter, with a prompt for the workspace ID displayed if it is not provided. Both the `assign-metastore` and `show-all-metastores` commands have been made account-level only. The functionality of the `migrate_local_code` function remains unchanged. Error handling for etag issues related to default catalog settings has been implemented. Unit tests and manual testing have been conducted on a staging environment to verify the changes. The `show_all_metastores` and `assign_metastore` commands have been updated to accept an optional `workspace_id` parameter. The unit tests cover various scenarios, including cases where a user has multiple metastores and needs to select one, as well as cases where a default catalog name is provided and needs to be selected. If no metastore is found, a `ValueError` will be raised. The `metastore_id` and `workspace_id` flags in the yml file have been renamed to `metastore-id` and `workspace-id`, respectively, and a new `default-catalog` flag has been added.
* Modified update existing role to amend the AssumeRole statement rather than rewriting it ([#1423](https://github.com/databrickslabs/ucx/issues/1423)). The `_aws_role_trust_doc` method of the `aws.py` file has been updated to return a dictionary object instead of a JSON string for the AWS IAM role trust policy document. This change allows for more fine-grained control when updating the trust relationships of an existing role in AWS IAM. The `create_uc_role` method has been updated to pass the role trust document to the `_create_role` method using the `_get_json_for_cli` method. The `update_uc_trust_role` method has been refactored to retrieve the existing role's trust policy document, modify its `Statement` field, and replace it with the returned value of the `_aws_role_trust_doc` method with the specified `external_id`. Additionally, the `test_update_uc_trust_role` function in the `test_aws.py` file has been updated to provide more detailed and realistic mocked responses for the `command_call` function, including handling the case where the `iam update-assume-role-policy` command is called and returning a mocked response with a modified assume role policy document that includes a new principal with an external ID condition. These changes improve the testing capabilities of the `test_update_uc_trust_role` function and provide more comprehensive testing of the assume role statement and role update functionality.
* Modifies dependency resolution logic to detect deprecated use of s3fs package ([#1395](https://github.com/databrickslabs/ucx/issues/1395)). In this release, the dependency resolution logic has been enhanced to detect and handle deprecated usage of the s3fs package. A new function, `_download_side_effect`, has been implemented to mock the download behavior of the `workspace_client_mock` function, allowing for more precise control during testing. The `DependencyResolver` class now includes a list of `Advice` objects to inform developers about the use of deprecated dependencies, without modifying the `DependencyGraph` class. This change also introduces a new import statement for the s3fs package, encouraging the adoption of up-to-date packages and practices for improved system compatibility and maintainability. Additionally, a unit test file, test_s3fs.py, has been added with test cases for various import scenarios of s3fs to ensure proper detection and issuance of deprecation warnings.
* Prompt for warehouse choice in uninstall if the original chosen warehouse does not exist anymore ([#1484](https://github.com/databrickslabs/ucx/issues/1484)). In this release, we have added a new method `_check_and_fix_if_warehouse_does_not_exists()` to the `WorkspaceInstaller` class, which checks if the specified warehouse in the configuration still exists. If it doesn't, the method generates a new configuration using a new `WorkspaceInstaller` object, saves it, and updates the `_sql_backend` attribute with the new warehouse ID. This change ensures that if the original chosen warehouse no longer exists, the user will be prompted to choose a new one during uninstallation. Additionally, we have added a new import statement for `ResourceDoesNotExist` exception and introduced a new function `test_uninstallation_after_warehouse_is_deleted`, which simulates a scenario where a warehouse has been manually deleted and checks if the uninstallation process correctly resets the warehouse. The `StatementExecutionBackend` object is initialized with a non-existent warehouse ID, and the configuration and sql_backend objects are updated accordingly. This test case ensures that the uninstallation process handles the scenario where a warehouse has been manually deleted.
* Propagate source location information within the import package dependency graph ([#1431](https://github.com/databrickslabs/ucx/issues/1431)). This change modifies the dependency graph build logic within several modules of the `databricks.labs.ucx` package to propagate source location information within the import package dependency graph. A new `ImportDependency` class now represents import sources, and a `list_import_sources` method returns a list of `ImportDependency` objects, which include import string and original source code file path. A new `IncompatiblePackage` class is added to the `Whitelist` class, returning `UCCompatibility.NONE` when checking for compatibility. The `ImportChecker` class checks for deprecated imports and returns `Advice` or `Deprecation` objects with location information. Unit tests have been added to ensure the correct behavior of these changes. Additionally, the `Location` class and a new test function for invalid processors have been introduced.
* Scan `site-packages` ([#1411](https://github.com/databrickslabs/ucx/issues/1411)). A SitePackages scanner has been implemented, enhancing the linkage of module root names with the actual Python code within installed packages using metadata. This development addresses issue [#1410](https://github.com/databrickslabs/ucx/issues/1410) and is connected to [#1202](https://github.com/databrickslabs/ucx/issues/1202). New functionalities include user documentation, a CLI command, a workflow, and a table, accompanied by modifications to an existing command and workflow, as well as alterations to another table. Unit tests have been added to ensure the feature's proper functionality. In the diff, a new unit test file for `site_packages.py` has been added, checking for `databrix` compatibility, which returns as uncompatible. This enhancement aims to bolster the user experience by providing more detailed insights into installed packages.
* Select DISTINCT job_run_id ([#1352](https://github.com/databrickslabs/ucx/issues/1352)). A modification has been implemented to optimize the SQL query for accessing log data, now retrieving distinct job_run_ids instead of a single one, nested in a subquery. The enhanced query selects the message field from the inventory.logs table, filtering based on job_run_id matches with the latest timestamp within the same table. This change enables multiple job_run_ids to correlate with the same timestamp, delivering a more holistic perspective of logs at a given moment. By upgrading the query functionality to accommodate multiple job run IDs, this improvement ensures more precise and detailed retrieval of log data.
* Support table migration to Unity Catalog in Python code ([#1210](https://github.com/databrickslabs/ucx/issues/1210)). This release introduces changes to the Python codebase that enhance the SparkSql linter/fixer to support migrating Spark SQL table references to Unity Catalog. The release includes modifications to existing commands, specifically `databricks labs ucx migrate_local_code`, and the addition of unit tests. The `SparkSql` class has been updated to support a new `index` parameter, allowing for migration support. New classes including `QueryMatcher`, `TableNameMatcher`, `ReturnValueMatcher`, and `SparkMatchers` have been added to hold various matchers for different spark methods. The release also includes modifications to existing methods for caching, creating, getting, refreshing, and un-caching tables, as well as updates to the `listTables` method to reflect the new format. The `saveAsTable` and `register` methods have been updated to handle variable and f-string arguments for the table name. The `databricks labs ucx migrate_local_code` command has been modified to handle spark.sql function calls that include a table name as a parameter and suggest necessary changes to migrate to the new Unity Catalog format. Integration tests are still needed.
* When building dependency graph, raise problems with problematic dependencies ([#1529](https://github.com/databrickslabs/ucx/issues/1529)). A new `DependencyProblem` class has been added to the databricks.labs.ucx.source_code.dependencies module to handle issues encountered during dependency graph construction. This class is used to raise issues when problematic dependencies are encountered during the build of the dependency graph. The `build_dependency_graph` method of the `SourceContainer` abstract class now accepts a `problem_collector` parameter, which is a callable function that collects and handles dependency problems. Instead of raising `ValueError` exceptions, the `DependencyProblem` class is used to collect and store information about the issues. This change improves error handling and diagnostic information during dependency graph construction. Relevant user documentation, a new CLI command, and a new workflow have been added, along with modifications to existing commands and workflows. Unit tests have been added to verify the new functionality.
* WorkspacePath to implement `pathlib.Path` API ([#1509](https://github.com/databrickslabs/ucx/issues/1509)). A new file, 'wspath.py', has been added to the `mixins` directory of the 'databricks.labs.ucx' package, implementing the custom Path object 'WorkspacePath'. This subclass of 'pathlib.Path' provides additional methods and functionality for the Databricks Workspace, including 'cwd()', 'home()', 'scandir()', and 'listdir()'. `WorkspacePath` interacts with the Databricks Workspace API for operations such as checking if a file/directory exists, creating and deleting directories, and downloading files. The `WorkspacePath` class has been updated to implement 'pathlib.Path' API for a more intuitive and consistent interface when working with file and directory paths. The class now includes methods like 'absolute()', 'exists()', 'joinpath()', 'parent', and supports the `with` statement for thread-safe code. A new test file 'test_wspath.py' has been added for the WorkspacePath mixin. New methods like 'expanduser()', 'as_fuse()', 'as_uri()', 'replace()', 'write_text()', 'write_bytes()', 'read_text()', and 'read_bytes()' have also been added. 'mkdir()' and 'rmdir()' now raise errors when called on non-absolute paths and non-empty directories, respectively.

Dependency updates:

 * Bump actions/checkout from 3 to 4 ([#1191](https://github.com/databrickslabs/ucx/pull/1191)).
 * Bump actions/setup-python from 4 to 5 ([#1189](https://github.com/databrickslabs/ucx/pull/1189)).
 * Bump codecov/codecov-action from 1 to 4 ([#1190](https://github.com/databrickslabs/ucx/pull/1190)).
 * Bump softprops/action-gh-release from 1 to 2 ([#1188](https://github.com/databrickslabs/ucx/pull/1188)).
 * Bump databricks-sdk from 0.23.0 to 0.24.0 ([#1223](https://github.com/databrickslabs/ucx/pull/1223)).
 * Updated databricks-labs-lsql requirement from ~=0.3.0 to >=0.3,<0.5 ([#1387](https://github.com/databrickslabs/ucx/pull/1387)).
 * Updated sqlglot requirement from ~=23.9.0 to >=23.9,<23.11 ([#1409](https://github.com/databrickslabs/ucx/pull/1409)).
 * Updated sqlglot requirement from <23.11,>=23.9 to >=23.9,<23.12 ([#1486](https://github.com/databrickslabs/ucx/pull/1486)).
nfx added a commit that referenced this issue Apr 26, 2024
* A notebook linter to detect DBFS references within notebook cells
([#1393](https://github.com/databrickslabs/ucx/issues/1393)). A new
linter has been implemented in the open-source library to identify
references to Databricks File System (DBFS) mount points or folders
within SQL and Python cells of Notebooks, raising Advisory or Deprecated
alerts when detected. This feature, resolving issue
[#1108](https://github.com/databrickslabs/ucx/issues/1108), enhances
code maintainability by discouraging DBFS usage, and improves security
by avoiding hard-coded DBFS paths. The linter's functionality includes
parsing the code and searching for Table elements within statements,
raising warnings when DBFS references are found. Implementation changes
include updates to the `NotebookLinter` class, a new `from_source` class
method, and an `original_offset` argument in the `Cell` class. The
linter now also supports the `databricks` dialect for SQL code parsing.
This feature improves the library's security and maintainability by
ensuring better data management and avoiding hard-coded DBFS paths.
* Added CLI commands to trigger table migration workflow
([#1511](https://github.com/databrickslabs/ucx/issues/1511)). A new
`migrate_tables` command has been added to the 'databricks.labs.ucx.cli'
module, which triggers the `migrate-tables` workflow and, optionally,
the `migrate-external-hiveserde-tables-in-place-experimental` workflow.
The `migrate-tables` workflow is responsible for managing table
migrations, while the
`migrate-external-hiveserde-tables-in-place-experimental` workflow
handles migrations for external hiveserde tables. The new `What` class
from the 'databricks.labs.ucx.hive_metastore.tables' module is used to
identify hiveserde tables. If hiveserde tables are detected, the user is
prompted to confirm running the
`migrate-external-hiveserde-tables-in-place-experimental` workflow. The
`migrate_tables` command requires a WorkspaceClient and Prompts objects
and accepts an optional WorkspaceContext object, which is set to the
WorkspaceContext of the WorkspaceClient if not provided. Additionally, a
new `migrate_external_hiveserde_tables_in_place` command has been added
which will run the
`migrate-external-hiveserde-tables-in-place-experimental` workflow if it
finds any hiveserde tables, making it easier to manage table migrations
from the command line.
* Added CSV, JSON and include path in mounts
([#1329](https://github.com/databrickslabs/ucx/issues/1329)). In this
release, the TablesInMounts function has been enhanced to support CSV
and JSON file formats, along with the existing Parquet and Delta table
formats. The new `include_paths_in_mount` parameter has been introduced,
enabling users to specify a list of paths to crawl within all mounts.
The WorkspaceConfig class in the config.py file has been updated to
accommodate these changes. Additionally, a new `_assess_path` method has
been introduced to assess the format of a given file and return a
`TableInMount` object accordingly. Several existing methods, such as
`_find_delta_log_folders`, `_is_parquet`, `_is_csv`, `_is_json`, and
`_path_is_delta`, have been updated to reflect these improvements.
Furthermore, two new unit tests, `test_mount_include_paths` and
`test_mount_listing_csv_json`, have been added to ensure the proper
functioning of the TablesInMounts function with the new file formats and
the `include_paths_in_mount` parameter. These changes aim to improve the
functionality and flexibility of the TablesInMounts library, allowing
for more precise crawling and identification of tables based on specific
file formats and paths.
* Added CTAS migration workflow for external tables cannot be in place
migrated ([#1510](https://github.com/databrickslabs/ucx/issues/1510)).
In this release, we have added a new CTAS (Create Table As Select)
migration workflow for external tables that cannot be migrated in-place.
This feature includes a `MigrateExternalTablesCTAS` class with three
tasks to migrate non-SYNC supported and non-HiveSerde external tables,
migrate HiveSerde tables, and migrate views from the Hive Metastore to
the Unity Catalog. We have also added new methods for managed and
external table migration, deprecated old methods, and added a new test
function to ensure proper CTAS migration for external tables using
HiveSerDe. This change also introduces a new JSON file for external
table configurations and a mock backend to simulate the Hive Metastore
and test the migration process. Overall, these changes improve the
migration capabilities for external tables and ensure a more flexible
and reliable migration process.
* Added Python linter for table creation with implicit format
([#1435](https://github.com/databrickslabs/ucx/issues/1435)). A new
linter has been added to the Python library to advise on implicit table
formats when the 'writeTo', 'table', 'insertInto', or `saveAsTable`
methods are invoked without an explicit format specified in the same
chain of calls. This feature is useful for software engineers working
with Databricks Runtime (DBR) v8.0 and later, where the default table
format changed from `parquet` to 'delta'. The linter, implemented in
'table_creation.py', utilizes reusable AST utilities from
'python_ast_util.py' and is not automated, providing advice instead of
fixing the code. The linter skips linting when a DRM version of 8.0 or
higher is passed, as the default format change only applies to versions
prior to 8.0. Unit tests have been added for both files as part of the
code migration workflow.
* Added Support for Migrating Table ACL of Interactive clusters using
SPN ([#1077](https://github.com/databrickslabs/ucx/issues/1077)). This
change introduces support for migrating table Access Control Lists
(ACLs) of interactive clusters using a Security Principal Name (SPN) for
Azure Databricks environments in the UCX project. It includes
modifications to the `hive_metastore` and `workspace_access` modules, as
well as the addition of new classes, methods, and import statements for
handling ACLs and grants. This feature enables more secure and granular
control over table permissions when using SPN authentication for
interactive clusters in Azure. This will benefit software engineers
working with interactive clusters in Azure Databricks by enhancing
security and providing more control over data access.
* Added Support for migrating Schema/Catalog ACL for Interactive cluster
([#1413](https://github.com/databrickslabs/ucx/issues/1413)). This
commit adds support for migrating schema and catalog ACLs for
interactive clusters, specifically for AWS and Azure, with partial fixes
for issues [#1192](https://github.com/databrickslabs/ucx/issues/1192)
and [#1193](https://github.com/databrickslabs/ucx/issues/1193). The
changes identify and filter database ACL grants, create mappings from
Hive metastore schema to Unity Catalog schema and catalog, and replace
Hive metastore actions with equivalent Unity Catalog actions for both
schema and catalog. External location permission is not included in this
commit and will be addressed separately. New methods for creating
mappings, updating principal ACLs, and getting catalog schema grants
have been added, and existing functionalities have been modified to
handle both AWS and Azure. The code has undergone manual testing and
passed unit and integration tests. The changes are targeted towards
software engineers who adopt the project.
* Added `databricks labs ucx logs` command
([#1350](https://github.com/databrickslabs/ucx/issues/1350)). A new
command, 'databricks labs ucx logs', has been added to the open-source
library to enhance logging and debugging capabilities. This command
allows developers and administrators to view logs from the latest job
run or specify a particular workflow name to display its logs. By
default, logs with levels of INFO, WARNING, and ERROR are shown, but the
--debug flag can be used for more detailed DEBUG logs. This feature
utilizes the relay_logs method from the deployed_workflows object in the
WorkspaceContext class and addresses issue
[#1282](https://github.com/databrickslabs/ucx/issues/1282). The addition
of this command aims to improve the usability and maintainability of the
framework, making it easier for users to diagnose and resolve issues.
* Added check for DBFS mounts in SQL code
([#1351](https://github.com/databrickslabs/ucx/issues/1351)). A new
feature has been introduced to check for Databricks File System (DBFS)
mounts within SQL code, enhancing data management and accessibility in
the Databricks environment. The `dbfsqueries.py` file in the
`databricks/labs/ucx/source_code` directory now includes a function that
verifies the presence of DBFS mounts in SQL queries and returns
appropriate messages. The `Languages` class in the `__init__` method has
been updated to incorporate a new class, `FromDbfsFolder`, which
replaces the existing `from_table` linter with a new linter,
`DBFSUsageLinter`, for handling DBFS usage in SQL code. In addition, a
Staff Software Engineer has improved the functionality of a DBFS usage
linter tool by adding new methods to check for deprecated DBFS mounts in
SQL code, returning deprecation warnings as needed. These enhancements
ensure more robust handling of DBFS mounts throughout the system,
allowing for better integration and management of DBFS-related issues in
SQL-based operations.
* Added check for circular view dependency
([#1502](https://github.com/databrickslabs/ucx/issues/1502)). A circular
view dependency check has been implemented to prevent issues caused by
circular dependencies in views. This includes a new test for chained
circular dependencies (A->B, B->C, C->A) and an update to the existing
circular view dependency test. The checks have been implemented through
modifications to the tests in `test_views_sequencer.py`, including a new
test method and an update to the existing test method. If any circular
dependencies are encountered during migration, a ValueError with an
error message will be raised. These changes include updates to the
`tables_and_views.json` file, with the addition of a new view `v12` that
depends on `v11`, creating a circular dependency. The changes have been
tested through the addition of unit tests and are expected to function
as intended. No new methods have been added, but changes have been made
to the existing `_next_batch` method and two new methods,
`_check_circular_dependency` and `_get_view_instance`, have been
introduced.
* Added commands for metastores listing & assignment
([#1489](https://github.com/databrickslabs/ucx/issues/1489)). This
commit introduces new commands for handling metastores in the Databricks
Labs Unity Catalog (UCX) tool, which enables more efficient management
of metastores. The `databricks labs ucx assign-metastore` command
automatically assigns a metastore to a specified workspace when
possible, while the `databricks labs ucx show-all-metastores` command
displays all possible metastores that can be assigned to a workspace.
These changes include new methods for handling metastores in the account
and workspace classes, as well as new user documentation, manual
testing, and unit tests. The new functionality is added to improve the
usability and efficiency of the UCX tool in handling metastores.
Additional information on the UCX metastore commands is provided in the
README.md file.
* Added functionality to migrate external tables using Create Table (No
Sync) ([#1432](https://github.com/databrickslabs/ucx/issues/1432)). A
new feature has been implemented for migrating external tables in
Databricks' Hive metastore using the "Create Table (No Sync)" method.
This feature includes the addition of two new methods,
`_migrate_non_sync_table` and `_get_create_in_place_sql`, for handling
migration and SQL query generation. The existing methods
`_migrate_dbfs_root_table` and `_migrate_acl` have also been updated. A
test case has been added to demonstrate migration of external tables
while preserving their location and properties. This new functionality
provides more flexibility in managing migrations for specific use cases.
The SQL parsing library sqlglot has been utilized to replace the current
table name with the updated catalog and change the CREATE statement to
CREATE IF NOT EXISTS. This increases the efficiency and security of
migrating external tables in the Databricks' Hive metastore.
* Added initial version of account-level installer
([#1339](https://github.com/databrickslabs/ucx/issues/1339)). A new
account-level installer has been added to the UCX library, allowing
account administrators to install UCX on all workspaces within an
account in a single operation. The installer authenticates to the
account, prompts the user for configuration of the first workspace, and
then runs the installation and offers to repeat the process for all
remaining workspaces. This is achieved through the creation of a new
`prompt_for_new_installation` method which saves user responses to a new
`InstallationConfig` data class, allowing for reuse in other workspaces.
The existing `databricks labs install ucx` command now supports
account-level installation when the `UCX_FORCE_INSTALL` environment
variable is set to 'account'. The changes have been manually tested and
include updates to documentation and error handling for
`PermissionDenied`, `NotFound`, and `ValueError` exceptions.
Additionally, a new `AccountInstaller` class has been added to manage
the installation process at the account level.
* Added linting for DBFS usage
([#1341](https://github.com/databrickslabs/ucx/issues/1341)). A new
linter, "DBFSUsageLinter", has been added to our open-source library to
check for deprecated file system paths in Python code, specifically for
Database File System (DBFS) usage. Implemented as part of the
"databricks.labs.ucx.source_code" package in the "languages.py" file,
this linter defines a visitor, "DetectDbfsVisitor", that detects file
system paths in the code and checks them against a list of known
deprecated paths. If a match is found, it creates a Deprecation or
Advisory object with information about the deprecated code, including
the line number and column offset, and adds it to a list. This feature
will assist in identifying and removing deprecated file system paths
from the codebase, ensuring consistent and proper use of DBFS within the
project.
* Added log task to parse logs and store the logs in the ucx database
([#1272](https://github.com/databrickslabs/ucx/issues/1272)). A new log
task has been added to parse logs and store them in the ucx database,
added as a log crawler task to all workflows after other tasks have
completed. The LogRecord has been updated to include all necessary
fields, and logs below a certain minimum level will no longer be stored.
A new CLI command to retrieve errors and warnings from the latest
workflow run has been added, while existing commands and workflows have
been modified. User documentation has been updated, and new methods have
been added for log parsing and storage. A new table called `logs` has
been added to the database, and unit and integration tests have been
added to ensure functionality. This change also resolves issues
[#1148](https://github.com/databrickslabs/ucx/issues/1148) and
[#1283](https://github.com/databrickslabs/ucx/issues/1283), with
modifications to existing classes such as RuntimeContext,
TaskRunWarningRecorder, and LogRecord, and the addition of new classes
and methods including HiveMetastoreLineageEnabler and LogRecord in the
logs.py file. The deploy_schema function has been updated to include the
new table, and the existing command `databricks labs ucx` has been
modified to accommodate the new log functionality. Existing workflows
have been updated and a new workflow has been added, all of which are
tested through unit tests, integration tests, and manual testing. The
`TaskLogger` class and `TaskRunWarningRecorder` class are used to log
and record task run data, with the `parse_logs` method used to parse log
files into partial log records, which are then used to create snapshot
rows in the `logs` table.
* Added migration for non delta dbfs tables using Create Table As Select
(CTAS). Convert such tables to Delta tables
([#1434](https://github.com/databrickslabs/ucx/issues/1434)). In this
release, we've developed new methods to migrate non-Delta DBFS root
tables to managed Delta tables, enhancing compatibility with various
table formats and configurations. We've added support for safer SQL
statement generation in our Create Table As Select (CTAS) functionality
and incorporated new creation methods. Additionally, we've introduced
grant assignments during the migration process and updated integration
tests. The changes include the addition of a `TablesMigrator` class with
an updated `migrate_tables` method, a new `PrincipalACL` parameter, and
the `test_dbfs_non_delta_tables_should_produce_proper_queries` function
to test the migration of non-Delta DBFS tables to managed Delta tables.
These improvements promote safer CTAS functionality and expanded
compatibility for non-Delta DBFS root tables.
* Added support for %pip cells
([#1401](https://github.com/databrickslabs/ucx/issues/1401)). A new cell
type, %pip, has been introduced to the notebook interface, allowing for
the execution of pip commands within the notebook. The new class,
PipCell, has been added with several methods, including is_runnable,
build_dependency_graph, and migrate_notebook_path, enabling the notebook
interface to recognize and handle pip cells differently from other cell
types. This allows for the installation of Python packages directly
within a notebook setting, enhancing the notebook environment and
providing users with the ability to dynamically install necessary
packages as they work. The new sample notebook file demonstrates the
installation of a package using the %pip install command. The
implementation includes modifying the notebook runtime to recognize and
execute %pip cells, and installing packages in a manner consistent with
standard pip installation processes. Additionally, a new tuple,
PIP_NOTEBOOK_SAMPLE, has been added to the existing test notebook sample
tuple list, enabling testing the handling of %pip cells during notebook
splitting.
* Added support for %sh cells
([#1400](https://github.com/databrickslabs/ucx/issues/1400)). A new
`SHELL` CellLanguage has been implemented to support %sh cells, enabling
the execution of shell commands directly within the notebook interface.
This enhancement, addressing issue
[#1400](https://github.com/databrickslabs/ucx/issues/1400) and linked to
[#1399](https://github.com/databrickslabs/ucx/issues/1399) and
[#1202](https://github.com/databrickslabs/ucx/issues/1202), streamlines
the process of running shell scripts in the notebook, eliminating the
need for external tools. The new SHELL_NOTEBOOK_SAMPLE tuple, part of
the updated test suite, demonstrates the feature's functionality with a
shell cell, while the new methods manage the underlying mechanics of
executing these shell commands. These changes not only extend the
platform's capabilities by providing built-in support for shell commands
but also improve productivity and ease-of-use for teams relying on shell
commands as part of their data processing and analysis pipelines.
* Added support for migrating Table ACL for interactive cluster in AWS
using Instance Profile
([#1285](https://github.com/databrickslabs/ucx/issues/1285)). This
change adds support for migrating table access control lists (ACLs) for
interactive clusters in AWS using an Instance Profile. A new method
`get_iam_role_from_cluster_policy` has been introduced in the `AwsACL`
class, which replaces the static method
`_get_iam_role_from_cluster_policy`. The `create_uber_principal` method
now uses this new method to obtain the IAM role name from the cluster
policy. Additionally, the project now includes AWS Role Action and AWS
Resource Permissions to handle permissions for migrating table ACLs for
interactive clusters in AWS. New methods and classes have been added to
support AWS-specific functionality and handle AWS instance profile
information. Two new tests have been added to tests/unit/test_cli.py to
test various scenarios for interactive clusters with and without ACL in
AWS. A new argument `is_gcp` has been added to WorkspaceContext to
differentiate between Google Cloud Platform and other cloud providers.
* Added support for views in `table-migration` workflow
([#1325](https://github.com/databrickslabs/ucx/issues/1325)). A new
`MigrationStatus` class has been added to track the migration status of
tables and views in a Hive metastore, and a `MigrationIndex` class has
been added to check if a table or view has been migrated or not. The
`MigrationStatusRefresher` class has been updated to use a new approach
for migrating tables and views, and is now responsible for refreshing
the migration status of tables and indexing it using the
`MigrationIndex` class. A `ViewsMigrationSequencer` class has also been
introduced to sequence the migration of views based on dependencies.
These changes improve the migration process for tables and views in the
`table-migration` workflow.
* Added workflow for in-place migrating external Parquet, Orc, Avro
hiveserde tables
([#1412](https://github.com/databrickslabs/ucx/issues/1412)). This
change introduces a new workflow, `MigrateHiveSerdeTablesInPlace`, for
in-place upgrading external Parquet, Orc, and Avro hiveserde tables to
the Unity Catalog. The workflow includes new functions to describe the
table and extract hiveserde details, update the DDL from `show create
table`, and replace the old table name with the migration target and
DBFS mount table location if any. A new function
`_migrate_external_table_hiveserde` has been added to
`table_migrate.py`, and two new arguments, `mounts` and
`hiveserde_in_place_migrate`, have been added to the `TablesMigrator`
class. These arguments control which hiveserde to migrate and replace
the DBFS mnt table location if any, enabling multiple tasks to run in
parallel and migrate only one type of hiveserde at a time. This feature
does not include user documentation, new CLI commands, or changes to
existing commands, but it does add a new workflow and modify the
existing `migrate_tables` function in `table_migrate.py`. The changes
have been manually tested, but no unit tests, integration tests, or
staging environment verification have been provided.
* Build dependency graph for local files
([#1462](https://github.com/databrickslabs/ucx/issues/1462)). This
commit refactors dependency classes to distinguish between resolution
and loading, and introduces new classes to handle different types of
dependencies. A new method, `LocalFileMigrator.build_dependency_graph`,
is implemented, following the pattern of `NotebookMigrator`, to build a
dependency graph for local files. This resolves issue
[[#1202](https://github.com/databrickslabs/ucx/issues/1202)](https://github.com/databrickslabs/ucx/issues/1202)
and addresses issue
[[#1360](https://github.com/databrickslabs/ucx/issues/1360)](https://github.com/databrickslabs/ucx/issues/1360).
While the refactoring and implementation of new methods improve the
accuracy of dependency graphs and ensure that dependencies are correctly
registered based on the file's language, there are no user-facing
changes, such as new or modified CLI commands, tables, or workflows.
Unit tests are added to ensure that the new changes function as
expected.
* Build dependency graph for site packages
([#1504](https://github.com/databrickslabs/ucx/issues/1504)). This
commit introduces changes to the dependency graph building process for
site packages within the ucx project. When a package is not recognized,
package files are added as dependencies to prevent errors during import
dependency determination, thereby fixing an infinite loop issue when
encountering cyclical graphs. This resolves issues
[#1427](https://github.com/databrickslabs/ucx/issues/1427) and is
related to [#1202](https://github.com/databrickslabs/ucx/issues/1202).
The changes include adding new methods for handling package files as
dependencies and preventing infinite loops when visiting cyclical
graphs. The `SitePackage` class in the `site_packages.py` file has been
updated to handle package files more accurately, with the `__init__`
method now accepting `module_paths` as a list of Path objects instead of
a list of strings. A new method, `module_paths`, has also been
introduced. Unit tests have been added to ensure the correct
functionality of these changes, and a hack in the PR will be removed
once issue [#1421](https://github.com/databrickslabs/ucx/issues/1421) is
implemented.
* Build notebook dependency graph for `%run` cells
([#1279](https://github.com/databrickslabs/ucx/issues/1279)). A new
`Notebook` class has been developed to parse source code and split it
into cells, and a `NotebookDependencyGraph` class with related utilities
has been added to discover dependencies in `%run` cells, addressing
issue [#1201](https://github.com/databrickslabs/ucx/issues/1201). The
new functionality enhances the management and tracking of dependencies
within notebooks, improving code organization and efficiency. The commit
includes updates to existing notebooks to utilize the new classes and
methods, with no impact on existing functionality outside of the `%run`
context.
* Create UC External Location, Schema, and Table Grants based on
workspace-wide Azure SPN mount points
([#1374](https://github.com/databrickslabs/ucx/issues/1374)). This
change adds new functionality to create Unity Catalog (UC) external
location, schema, and table grants based on workspace-wide Azure Service
Principal Names (SPN) mount points. The majority of the work was
completed in a previous pull request. The main change in this pull
request is the addition of a new test function,
`test_migrate_external_tables_with_principal_acl_azure`, which tests the
migration of tables with principal ACLs in an Azure environment. This
function includes the creation of a new user with cluster access,
another user without cluster access, and a new group with cluster access
to validate the migration of table grants to these entities. The
`make_cluster_permissions` method now accepts a `service_principal_name`
parameter, and after migrating the tables with the `acl_strategy` set to
`PRINCIPAL`, the function checks if the appropriate grants have been
assigned to the Azure SPN. This change is part of an effort to improve
the integration of Unity Catalog with Azure SPNs and is accessible
through the UCX CLI command. The changes have been tested through manual
testing, unit tests, and integration tests and have been verified in a
staging environment.
* Detect DBFS use in SQL statements in notebooks
([#1372](https://github.com/databrickslabs/ucx/issues/1372)). A new
linter has been added to detect and discourage the use of DBFS
(Databricks File System) in SQL statements within notebooks. This linter
raises deprecated advisories for any identified DBFS folder or mount
point references in SQL statements, encouraging the use of alternative
storage options. The change is implemented in the `NotebookLinter` class
of the 'notebook_linter.py' file, and is tested through unit tests to
ensure proper functionality. The target audience for this update
includes software engineers who use Databricks or similar platforms, as
the new linter will help users transition away from using DBFS in their
SQL statements and adopt alternative storage methods.
* Detect `sys.path` manipulation
([#1380](https://github.com/databrickslabs/ucx/issues/1380)). A change
has been introduced to the Python linter to detect manipulation of
`sys.path`. New classes, AbsolutePath and RelativePath, have been added
as subclasses of SysPath. The SysPathVisitor class has been implemented
to track additions to sys.path and the visit_Call method in
SysPathVisitor checks for 'sys.path.append' and 'os.path.abspath' calls.
The new functionality includes a new method, collect_appended_sys_paths
in PythonLinter, and a static method, list_appended_sys_paths, to
retrieve the appended paths. Additionally, new tests have been added to
the PythonLinter to detect manipulation of the `sys.path` variable,
specifically the `list_appended_sys_paths` method. The new test cases
include using aliases for `sys`, `os`, and `os.path`, and using both
absolute and relative paths. This improvement will enhance the linter's
ability to detect potential issues related to manipulation of the
`sys.path` variable. The change resolves issue
[#1379](https://github.com/databrickslabs/ucx/issues/1379) and is linked
to issue [#1202](https://github.com/databrickslabs/ucx/issues/1202). No
user documentation or CLI commands have been added or modified, and no
manual testing has been performed. Unit tests for the new functionality
have been added.
* Detect direct access to cloud storage and raise a deprecation warning
([#1506](https://github.com/databrickslabs/ucx/issues/1506)). In this
release, the Pyspark linter has been enhanced to detect and issue
deprecation warnings for direct access to cloud storage. This change,
which resolves issue
[#1133](https://github.com/databrickslabs/ucx/issues/1133), introduces
new classes `AstHelper` and `TableNameMatcher` to determine the
fully-qualified name of functions and replace instances of direct cloud
storage access with migration index table names. Instances of direct
access using 'dbfs:/', 'dbfs://', and default 'dbfs:' references will
now be detected and flagged with a deprecation warning. The test file
`test_pyspark.py` has been updated to include new tests for detecting
direct cloud storage access. Users should be aware of these changes when
updating their code to avoid deprecation warnings.
* Detect imported files and packages
([#1362](https://github.com/databrickslabs/ucx/issues/1362)). This
commit introduces functionality to parse Python code for `import` and
`import from` processing instructions, enabling the detection and
management of imported files and packages. It includes a new CLI
command, modifications to existing commands, new and updated workflows,
and additional tables. The code modifications include new methods for
visiting Import and ImportFrom nodes, and the addition of unit tests to
ensure correctness. Relevant user documentation has been added, and the
new functionality has been tested through manual testing, unit tests,
and verification on a staging environment. This comprehensive update
enhances dependency management, code organization, and understanding for
a more streamlined user experience.
* Enhanced migrate views task to support views created with explicit
column list
([#1375](https://github.com/databrickslabs/ucx/issues/1375)). The commit
enhances the migrate views task to better support handling of views with
an explicit column list, improving overall compatibility. A new lookup
based on `SHOW CREATE TABLE` has been added to extract the column list
from the create script, ensuring accurate migration. The
`_migrate_view_table` method has been refactored, and a new
`_sql_migrate_view` method is added to fetch the create statement of the
view. The `ViewToMigrate` class has been updated with a new
`_view_dependencies` method to determine view dependencies in the new
SQL text. Additionally, new methods `safe_sql_key` and `add_table` have
been introduced, and the `sqlglot.parse` method is used to parse the
code with `databricks` as the read argument. A new test for migrating
views with an explicit column list has been added, along with the
`upgraded_from` and `upgraded_to` table properties, and the migration
status is updated to reflect successful migration. New test functions
have also been added to test the migration of views with columns and
ACLs. Dependency sqlglot has been updated to version ~=23.9.0, enhancing
the overall functionality and compatibility of the migrate views task.
* Ensure that USE statements are recognized and apply to table
references without a qualifying schema in SQL and pyspark
([#1433](https://github.com/databrickslabs/ucx/issues/1433)). This
commit enhances the library's functionality in handling `USE` statements
in both SQL and PySpark by ensuring they are recognized and applied to
table references without a qualifying schema. A new
`CurrentSessionState` class is introduced to manage the current schema
of a session, and existing classes such as `FromTable` and
`TableNameMatcher` are updated to use this new class. Additionally, the
`lint` and `apply` methods have been updated to handle `USE` statements
and improve the precision of table reference handling. These changes are
particularly useful when working with tables in different schemas,
ensuring the library can manage table references more accurately in SQL
and PySpark. A new fixture, 'extended_test_index', has been added to
support unit tests, and the test file 'test_notebook.py' has been
updated to better reflect the intended schema for each table reference.
* Expand documentation for end to end workflows with external HMS
([#1458](https://github.com/databrickslabs/ucx/issues/1458)). The UCX
toolkit has been updated to support integration with an external Hive
Metastore (HMS), in addition to the default workspace HMS. This feature
allows users to easily set up UCX to work with an existing external HMS,
providing greater flexibility in managing and accessing data. During
installation, UCX will scan for evidence of an external HMS in the
cluster policies and Spark configurations. If found, UCX will prompt the
user to connect to the external HMS, create a new policy with the
necessary Spark and data access configurations, and set up job clusters
accordingly. However, users will need to manually update the data access
configuration for SQL Warehouses that are not configured for external
HMS. Users can also create a cluster policy with appropriate Spark
configurations and data access for external HMS, or edit existing
policies in specified UCX workflows. Once set up, the assessment
workflow will scan tables and views from the external HMS, and the table
migration workflow will upgrade tables and views from the external HMS
to the Unity Catalog. Users should note that if the external HMS is
shared between multiple workspaces, a different inventory database name
should be specified for each UCX installation. It is important to plan
carefully when setting up a workspace with multiple external HMS, as the
assessment dashboard will fail if the SQL warehouse is not configured
correctly. Users can have multiple UCX installations in a workspace,
each set up with a different external HMS, or manually modify the
cluster policy and SQL data access configuration to point to the correct
external HMS after UCX has been installed.
* Extend service principal migration with option to create access
connectors with managed identity for each storage account
([#1417](https://github.com/databrickslabs/ucx/issues/1417)). This
commit extends the service principal migration feature to create access
connectors with managed identities for each storage account, enhancing
security and isolation by preventing cross-account access. A new CLI
command has been added, and an existing command has been modified. The
`create_access_connectors_for_storage_accounts` method creates access
connectors with the required permissions for each storage account used
in external tables. The `_apply_storage_permission` method has also been
updated. New unit and integration tests have been included, covering
various scenarios such as secret value decoding, secret read exceptions,
and single storage account testing. The necessary permissions for these
connectors will be set in a subsequent pull request. Additionally, a new
method, `azure_resources_list_access_connectors`, and
`azure_resources_get_access_connector` have been introduced to ensure
access connectors are returned as expected. This change has been tested
manually and through automated tests, ensuring backward compatibility
while providing improved security features.
* Fixed UCX policy creation when instance pool is specified
([#1457](https://github.com/databrickslabs/ucx/issues/1457)). In this
release, we have made significant improvements to the handling of
instance pools in UCX policy creation. The `policy.py` file has been
updated to properly handle the case when an instance pool is specified,
by setting the `instance_pool_id` attribute and removing the
`node_type_id` attribute in the policy definition. Additionally, the
availability attribute has been removed for all cloud providers,
including AWS, Azure, and GCP, when an instance pool ID is provided. A
new `pop` method call has also been added to remove the
`gcp_attributes.availability` attribute when an instance pool ID is
provided. These changes ensure consistency in the policy definition
across all cloud providers. Furthermore, tests for this functionality
have been updated in the 'test_policy.py' file, specifically the
`test_cluster_policy_instance_pool` function, to check the correct
addition of the instance pool to the cluster policy. The purpose of
these changes is to improve the reliability and functionality of UCX
policy creation, specifically when an instance pool is specified.
* Fixed `migrate-credentials` command on aws
([#1501](https://github.com/databrickslabs/ucx/issues/1501)). In this
release, the `migrate-credentials` command for the `labs.yml`
configuration file has been updated to include new flags for specifying
a subscription ID and AWS profile. This allows users to scan a specific
storage account and authenticate using a particular AWS profile when
migrating credentials for storage access to UC storage credentials. The
`create-account-groups` command remains unchanged. Additionally, several
issues related to the `migrate-credentials` command for AWS have been
addressed, such as hallucinating the presence of a `--profile` flag,
using a monotonically increasing role ID, and not handling cases where
there are no IAM roles to migrate. The `run` method of the
`AwsUcStorageCredentials` class has been updated to handle these cases,
and several test functions have been added or updated to ensure proper
functionality. These changes improve the functionality and robustness of
the `migrate-credentials` command for AWS.
* Fixed edge case for `RegexSubStrategy`
([#1561](https://github.com/databrickslabs/ucx/issues/1561)). In this
release, we have implemented fixes for the `RegexSubStrategy` class
within the `GroupMigrationStrategy`, addressing an issue where matching
account groups could not be found using the display name. The
`generate_migrated_groups` function has been updated to include a check
for account groups with matching external IDs when either the display
name or regex substitution of the display name fails to yield a match.
Additionally, we have expanded testing for the `GroupManager` class,
which handles group management. This includes new tests using regular
expressions to match groups, and ensuring that the `GroupManager` class
can correctly identify and manage groups based on different criteria
such as the group's ID, display name, or external ID. These changes
improve the robustness of the `GroupMigrationStrategy` and ensure the
proper functioning of the `GroupManager` class when using regular
expression substitution and matching.
* Fixed table in mount partition scans for JSON and CSV
([#1437](https://github.com/databrickslabs/ucx/issues/1437)). This
release introduces a fix for an issue where table scans on partitioned
CSV and JSON files were not being correctly identified. The
`TablesInMounts` scan function has been updated to accurately detect
these files, addressing the problem reported in issue
[#1389](https://github.com/databrickslabs/ucx/issues/1389) and linked
issue [#1437](https://github.com/databrickslabs/ucx/issues/1437). To
ensure functionality, new private methods `_find_partition_file_format`
and `_assess_path` have been introduced, with the latter updated to
handle partitioned directories. Additionally, unit tests have been added
to test partitioned CSVs and JSONs, simulating the file system's
response to various calls. These changes provide enhanced detection and
handling of partitioned CSVs and JSONs in the `TablesInMounts` scan
function.
* Forward remote logs on `run_workflow` and removed `destroy-schema`
workflow in favour of `databricks labs uninstall ucx`
([#1349](https://github.com/databrickslabs/ucx/issues/1349)). In this
release, the `destroy-schema` workflow has been removed and replaced
with the `databricks labs uninstall ucx` command, addressing issue
[#1186](https://github.com/databrickslabs/ucx/issues/1186). The
`run_workflow` function has been updated to forward remote logs, and the
`run_task` function now accepts a new argument `sql_backend`. The `Task`
class includes a new method `is_testing()` and has been updated to use
`RuntimeBackend` before `SqlBackend` in the
`databricks.labs.lsql.backends` module. The `TaskLogger` class has been
modified to include a new argument `attempt` and a new class method
`log_path()`. The `verify_metastore` method in the `verification.py`
file has been updated to handle `PermissionDenied` exceptions more
gracefully. The `destroySchema` class and its `destroy_schema` method
have been removed. The `workflow_task.py` file has been updated to
include a new argument `attempt` in the `task_run_warning_recorder`
method. These changes aim to improve the system's efficiency, error
handling, and functionality.
* Give all access connectors `Storage Blob Data Contributor` role
([#1425](https://github.com/databrickslabs/ucx/issues/1425)). A new
change has been introduced to grant the `Storage Blob Data Contributor`
role, which provides the highest level of data access, to all access
connectors for each storage account in the system. This adjustment, part
of issue [#142](https://github.com/databrickslabs/ucx/issues/142)
* Grant uber principal write permissions so that SYNC command will
succeed ([#1505](https://github.com/databrickslabs/ucx/issues/1505)). A
change has been implemented to modify the `databricks labs ucx
create-uber-principal` command, granting the uber principal write
permissions on Azure Blob Storage. This aligns with the existing
implementation on AWS where the uber principal has write access to all
S3 buckets. The modification includes the addition of a new role,
"STORAGE_BLOB_DATA_CONTRIBUTOR", to the `_ROLES` dictionary in the
`resources.py` file. A new method, `clean_up_spn`, has also been added
to clear ucx uber service principals. This change resolves issue
[#939](https://github.com/databrickslabs/ucx/issues/939) and ensures
consistent behavior with AWS, enabling the uber principal to have write
permissions on all Azure blob containers and ensuring the success of the
`SYNC` command. The changes have been manually tested but not yet
verified on a staging environment.
* Handled new output format of `SHOW TBLPROPERTIES` command
([#1381](https://github.com/databrickslabs/ucx/issues/1381)). A recent
commit has been made to address an issue with the
`test_revert_migrated_table` test failing due to the new output format
of the `SHOW TBLPROPERTIES` command in the open-source library.
Previously, the output was blank if a table property was missing, but
now it shows a message indicating that the table does not have the
specified property. The commit updates the `is_migrated` method in the
`migration_status.py` file to handle this new output format, where the
method now uses the `fetch` method to retrieve the `upgraded_to`
property for a given schema and table. If the property is missing, the
method will continue to the next table. The commit also updates tests
for the changes, including a manual test that has not been verified on a
staging environment. Changes have been made in the
`test_table_migrate.py` file, where rows with table properties have been
updated to return new data, and the `timestamp` function now sets the
`datetime.datetime` to a `FakeDate`. No new methods have been added, and
existing functionality related to `SHOW TBLPROPERTIES` command output
handling has been changed in scope.
* Ignore whitelisted imports
([#1367](https://github.com/databrickslabs/ucx/issues/1367)). This
commit introduces a new class `DependencyResolver` that filters Python
import dependencies based on a whitelist, and updates to the
`DependencyGraph` class to support this new resolver. A new optional
parameter `resolver` has been added to the `NotebookMigrator` class
constructor and the `DependencyGraph` constructor. A new file
`whitelist.py` has been added, introducing classes and functions for
defining and managing a whitelist of Python packages based on their name
and version. These changes aim to improve control over which
dependencies are included in the dependency graph, contributing to a
more modular and maintainable codebase.
* Increased memory for ucx clusters
([#1366](https://github.com/databrickslabs/ucx/issues/1366)). This
release introduces an update to enhance memory configuration for UCX
clusters, addressing issue
[#1366](https://github.com/databrickslabs/ucx/issues/1366). The main
change involves a new method for selecting a node type with a minimum of
16GB of memory and local disk enabled, implemented in the policy.py file
of the installer module. This modification results in the `node_type_id`
parameter for creating clusters, instance pools, and pipelines now
requiring a minimum memory of 16 GB. This change is reflected in the
fixtures.py file, `ws.clusters.select_node_type()`,
`ws.instance_pools.create()`, and `pipelines.PipelineCluster` method
calls, ensuring that any newly created clusters, instance pools, and
pipelines benefit from the increased memory allocation. This update aims
to improve user experience by offering higher memory configurations
out-of-the-box for UCX-related workloads.
* Integrate detection of notebook dependencies
([#1338](https://github.com/databrickslabs/ucx/issues/1338)). In this
release, the NotebookMigrator has been updated to integrate dependency
graph construction for detecting notebook dependencies, addressing
issues 1204, 1286, and 1326. The changes include modifying the
NotebookMigrator class to include the dependency graph and updating
relevant tests. A new file, python_linter.py, has been added for linting
Python code, which now detects calls to "dbutils.notebook.run" with
dynamic paths. The linter uses the ast module to parse the code and
locate nodes matching the specified criteria. The NotebookMigrator's
apply method has been updated to check for ObjectType.NOTEBOOK, loading
the notebook using the new _load_notebook method, and incorporating a
new _apply method for modifying the code in the notebook based on
applicable fixes. A new DependencyGraph class has been introduced to
build a graph of dependencies within the notebook, and several new
methods have been added, including _load_object,
_load_notebook_from_path, and revert. This release is co-authored by Cor
and aims to improve dependency management in the notebook system.
* Isolate grants computation when migrating tables
([#1233](https://github.com/databrickslabs/ucx/issues/1233)). In this
release, we have implemented a change to improve the reliability of
table migrations. Previously, grants to migrate were computed and
snapshotted outside the loop that iterates through tables to migrate,
which could lead to inconsistencies if the grants or migrated groups
changed during migration. Now, grants are re-computed for each table,
reducing the chance of such issues. We have introduced a new method
`_compute_grants` that takes in the table to migrate, ACL strategy, and
snapshots of all grants to migrate, migrated groups, and principal
grants. If `acl_strategy` is `None`, it defaults to an empty list. The
method checks each strategy in the ACL strategy list, extending the
`grants` list if the strategy is `AclMigrationWhat.LEGACY_TACL` or
`AclMigrationWhat.PRINCIPAL`. The `migrate_tables` method has been
updated to use this new method to compute grants. It first checks if
`acl_strategy` is `None`, and if so, sets it to an empty list. It then
calls `_compute_grants` with the current table, `acl_strategy`, and the
snapshots of all grants to migrate, migrated groups, and principal
grants. The computed grants are then used to migrate the table. This
change enhances the robustness of the migration process by isolating
grants computation for each table.
* Log more often from workflows
([#1348](https://github.com/databrickslabs/ucx/issues/1348)). In this
update, the log formatting for the debug log file in the "tasks.py" file
of the "databricks/labs/ucx/framework" module has been modified. The
`TimedRotatingFileHandler` function has been adjusted to rotate the log
file every minute, increasing the frequency of log file rotation from
every 10 minutes. Furthermore, the logging format has been enhanced to
include the time, level name, name, thread name, and message. These
improvements are in response to issue
[#1171](https://github.com/databrickslabs/ucx/issues/1171) and the
implementation of more frequent logging as per issue
[#1348](https://github.com/databrickslabs/ucx/issues/1348), ensuring
more detailed and up-to-date logs for debugging and analysis purposes.
* Make `databricks labs ucx assign-metastore` prompt for workspace if no
workspace id provided
([#1500](https://github.com/databrickslabs/ucx/issues/1500)). The
`databricks labs ucx assign-metastore` command has been updated to allow
for a optional `workspace_id` parameter, with a prompt for the workspace
ID displayed if it is not provided. Both the `assign-metastore` and
`show-all-metastores` commands have been made account-level only. The
functionality of the `migrate_local_code` function remains unchanged.
Error handling for etag issues related to default catalog settings has
been implemented. Unit tests and manual testing have been conducted on a
staging environment to verify the changes. The `show_all_metastores` and
`assign_metastore` commands have been updated to accept an optional
`workspace_id` parameter. The unit tests cover various scenarios,
including cases where a user has multiple metastores and needs to select
one, as well as cases where a default catalog name is provided and needs
to be selected. If no metastore is found, a `ValueError` will be raised.
The `metastore_id` and `workspace_id` flags in the yml file have been
renamed to `metastore-id` and `workspace-id`, respectively, and a new
`default-catalog` flag has been added.
* Modified update existing role to amend the AssumeRole statement rather
than rewriting it
([#1423](https://github.com/databrickslabs/ucx/issues/1423)). The
`_aws_role_trust_doc` method of the `aws.py` file has been updated to
return a dictionary object instead of a JSON string for the AWS IAM role
trust policy document. This change allows for more fine-grained control
when updating the trust relationships of an existing role in AWS IAM.
The `create_uc_role` method has been updated to pass the role trust
document to the `_create_role` method using the `_get_json_for_cli`
method. The `update_uc_trust_role` method has been refactored to
retrieve the existing role's trust policy document, modify its
`Statement` field, and replace it with the returned value of the
`_aws_role_trust_doc` method with the specified `external_id`.
Additionally, the `test_update_uc_trust_role` function in the
`test_aws.py` file has been updated to provide more detailed and
realistic mocked responses for the `command_call` function, including
handling the case where the `iam update-assume-role-policy` command is
called and returning a mocked response with a modified assume role
policy document that includes a new principal with an external ID
condition. These changes improve the testing capabilities of the
`test_update_uc_trust_role` function and provide more comprehensive
testing of the assume role statement and role update functionality.
* Modifies dependency resolution logic to detect deprecated use of s3fs
package ([#1395](https://github.com/databrickslabs/ucx/issues/1395)). In
this release, the dependency resolution logic has been enhanced to
detect and handle deprecated usage of the s3fs package. A new function,
`_download_side_effect`, has been implemented to mock the download
behavior of the `workspace_client_mock` function, allowing for more
precise control during testing. The `DependencyResolver` class now
includes a list of `Advice` objects to inform developers about the use
of deprecated dependencies, without modifying the `DependencyGraph`
class. This change also introduces a new import statement for the s3fs
package, encouraging the adoption of up-to-date packages and practices
for improved system compatibility and maintainability. Additionally, a
unit test file, test_s3fs.py, has been added with test cases for various
import scenarios of s3fs to ensure proper detection and issuance of
deprecation warnings.
* Prompt for warehouse choice in uninstall if the original chosen
warehouse does not exist anymore
([#1484](https://github.com/databrickslabs/ucx/issues/1484)). In this
release, we have added a new method
`_check_and_fix_if_warehouse_does_not_exists()` to the
`WorkspaceInstaller` class, which checks if the specified warehouse in
the configuration still exists. If it doesn't, the method generates a
new configuration using a new `WorkspaceInstaller` object, saves it, and
updates the `_sql_backend` attribute with the new warehouse ID. This
change ensures that if the original chosen warehouse no longer exists,
the user will be prompted to choose a new one during uninstallation.
Additionally, we have added a new import statement for
`ResourceDoesNotExist` exception and introduced a new function
`test_uninstallation_after_warehouse_is_deleted`, which simulates a
scenario where a warehouse has been manually deleted and checks if the
uninstallation process correctly resets the warehouse. The
`StatementExecutionBackend` object is initialized with a non-existent
warehouse ID, and the configuration and sql_backend objects are updated
accordingly. This test case ensures that the uninstallation process
handles the scenario where a warehouse has been manually deleted.
* Propagate source location information within the import package
dependency graph
([#1431](https://github.com/databrickslabs/ucx/issues/1431)). This
change modifies the dependency graph build logic within several modules
of the `databricks.labs.ucx` package to propagate source location
information within the import package dependency graph. A new
`ImportDependency` class now represents import sources, and a
`list_import_sources` method returns a list of `ImportDependency`
objects, which include import string and original source code file path.
A new `IncompatiblePackage` class is added to the `Whitelist` class,
returning `UCCompatibility.NONE` when checking for compatibility. The
`ImportChecker` class checks for deprecated imports and returns `Advice`
or `Deprecation` objects with location information. Unit tests have been
added to ensure the correct behavior of these changes. Additionally, the
`Location` class and a new test function for invalid processors have
been introduced.
* Scan `site-packages`
([#1411](https://github.com/databrickslabs/ucx/issues/1411)). A
SitePackages scanner has been implemented, enhancing the linkage of
module root names with the actual Python code within installed packages
using metadata. This development addresses issue
[#1410](https://github.com/databrickslabs/ucx/issues/1410) and is
connected to [#1202](https://github.com/databrickslabs/ucx/issues/1202).
New functionalities include user documentation, a CLI command, a
workflow, and a table, accompanied by modifications to an existing
command and workflow, as well as alterations to another table. Unit
tests have been added to ensure the feature's proper functionality. In
the diff, a new unit test file for `site_packages.py` has been added,
checking for `databrix` compatibility, which returns as uncompatible.
This enhancement aims to bolster the user experience by providing more
detailed insights into installed packages.
* Select DISTINCT job_run_id
([#1352](https://github.com/databrickslabs/ucx/issues/1352)). A
modification has been implemented to optimize the SQL query for
accessing log data, now retrieving distinct job_run_ids instead of a
single one, nested in a subquery. The enhanced query selects the message
field from the inventory.logs table, filtering based on job_run_id
matches with the latest timestamp within the same table. This change
enables multiple job_run_ids to correlate with the same timestamp,
delivering a more holistic perspective of logs at a given moment. By
upgrading the query functionality to accommodate multiple job run IDs,
this improvement ensures more precise and detailed retrieval of log
data.
* Support table migration to Unity Catalog in Python code
([#1210](https://github.com/databrickslabs/ucx/issues/1210)). This
release introduces changes to the Python codebase that enhance the
SparkSql linter/fixer to support migrating Spark SQL table references to
Unity Catalog. The release includes modifications to existing commands,
specifically `databricks labs ucx migrate_local_code`, and the addition
of unit tests. The `SparkSql` class has been updated to support a new
`index` parameter, allowing for migration support. New classes including
`QueryMatcher`, `TableNameMatcher`, `ReturnValueMatcher`, and
`SparkMatchers` have been added to hold various matchers for different
spark methods. The release also includes modifications to existing
methods for caching, creating, getting, refreshing, and un-caching
tables, as well as updates to the `listTables` method to reflect the new
format. The `saveAsTable` and `register` methods have been updated to
handle variable and f-string arguments for the table name. The
`databricks labs ucx migrate_local_code` command has been modified to
handle spark.sql function calls that include a table name as a parameter
and suggest necessary changes to migrate to the new Unity Catalog
format. Integration tests are still needed.
* When building dependency graph, raise problems with problematic
dependencies
([#1529](https://github.com/databrickslabs/ucx/issues/1529)). A new
`DependencyProblem` class has been added to the
databricks.labs.ucx.source_code.dependencies module to handle issues
encountered during dependency graph construction. This class is used to
raise issues when problematic dependencies are encountered during the
build of the dependency graph. The `build_dependency_graph` method of
the `SourceContainer` abstract class now accepts a `problem_collector`
parameter, which is a callable function that collects and handles
dependency problems. Instead of raising `ValueError` exceptions, the
`DependencyProblem` class is used to collect and store information about
the issues. This change improves error handling and diagnostic
information during dependency graph construction. Relevant user
documentation, a new CLI command, and a new workflow have been added,
along with modifications to existing commands and workflows. Unit tests
have been added to verify the new functionality.
* WorkspacePath to implement `pathlib.Path` API
([#1509](https://github.com/databrickslabs/ucx/issues/1509)). A new
file, 'wspath.py', has been added to the `mixins` directory of the
'databricks.labs.ucx' package, implementing the custom Path object
'WorkspacePath'. This subclass of 'pathlib.Path' provides additional
methods and functionality for the Databricks Workspace, including
'cwd()', 'home()', 'scandir()', and 'listdir()'. `WorkspacePath`
interacts with the Databricks Workspace API for operations such as
checking if a file/directory exists, creating and deleting directories,
and downloading files. The `WorkspacePath` class has been updated to
implement 'pathlib.Path' API for a more intuitive and consistent
interface when working with file and directory paths. The class now
includes methods like 'absolute()', 'exists()', 'joinpath()', 'parent',
and supports the `with` statement for thread-safe code. A new test file
'test_wspath.py' has been added for the WorkspacePath mixin. New methods
like 'expanduser()', 'as_fuse()', 'as_uri()', 'replace()',
'write_text()', 'write_bytes()', 'read_text()', and 'read_bytes()' have
also been added. 'mkdir()' and 'rmdir()' now raise errors when called on
non-absolute paths and non-empty directories, respectively.

Dependency updates:

* Bump actions/checkout from 3 to 4
([#1191](https://github.com/databrickslabs/ucx/pull/1191)).
* Bump actions/setup-python from 4 to 5
([#1189](https://github.com/databrickslabs/ucx/pull/1189)).
* Bump codecov/codecov-action from 1 to 4
([#1190](https://github.com/databrickslabs/ucx/pull/1190)).
* Bump softprops/action-gh-release from 1 to 2
([#1188](https://github.com/databrickslabs/ucx/pull/1188)).
* Bump databricks-sdk from 0.23.0 to 0.24.0
([#1223](https://github.com/databrickslabs/ucx/pull/1223)).
* Updated databricks-labs-lsql requirement from ~=0.3.0 to >=0.3,<0.5
([#1387](https://github.com/databrickslabs/ucx/pull/1387)).
* Updated sqlglot requirement from ~=23.9.0 to >=23.9,<23.11
([#1409](https://github.com/databrickslabs/ucx/pull/1409)).
* Updated sqlglot requirement from <23.11,>=23.9 to >=23.9,<23.12
([#1486](https://github.com/databrickslabs/ucx/pull/1486)).
nfx added a commit that referenced this issue May 1, 2024
## Changes
Implement resolvers as a stack
Implements an initial version of SysPathProvider
Eliminates horrible hacks introduced in previous PRs 

### Linked issues
#1202
Resolves #1499
Resolves #1421


### Functionality 

- [ ] added relevant user documentation
- [ ] added new CLI command
- [ ] modified existing command: `databricks labs ucx ...`
- [ ] added a new workflow
- [ ] modified existing workflow: `...`
- [ ] added a new table
- [ ] modified existing table: `...`

### Tests

- [ ] manually tested
- [x] added unit tests
- [ ] added integration tests
- [ ] verified on staging environment (screenshot attached)

@nfx this PR does not address the tactical `problem_collector`
parameter, I have created a specific issue for that
#1559

---------

Co-authored-by: Serge Smertin <serge.smertin@databricks.com>
nfx pushed a commit that referenced this issue May 3, 2024
## Changes
Add support for cwd to SysPathProvider
Implement LocalNotebookLoader
Provide tests for local files and folders

### Linked issues
#1202
Resolves #1499
Resolves #1287

replaces #1593

---------

Co-authored-by: Eric Vergnaud <eic.vergnaud@databricks.com>
@nfx nfx closed this as completed in #1633 May 7, 2024
nfx added a commit that referenced this issue May 7, 2024
…ys.path` (#1633)

## Changes
Update PathLookup in sequence during building of dependency graph

### Linked issues
closes #1202
Resolves #1468

### Functionality 

- [ ] added relevant user documentation
- [ ] added new CLI command
- [ ] modified existing command: `databricks labs ucx ...`
- [ ] added a new workflow
- [ ] modified existing workflow: `...`
- [ ] added a new table
- [ ] modified existing table: `...`

### Tests
- [ ] manually tested
- [x] added unit tests
- [ ] added integration tests
- [ ] verified on staging environment (screenshot attached)

---------

Co-authored-by: Serge Smertin <serge.smertin@databricks.com>
@nfx nfx mentioned this issue May 8, 2024
nfx added a commit that referenced this issue May 8, 2024
* Added DBSQL queries & dashboard migration ([#1532](#1532)). The Databricks Labs Unified Command Extensions (UCX) project has been updated with two new experimental commands: `migrate-dbsql-dashboards` and `revert-dbsql-dashboards`. These commands are designed for migrating and reverting the migration of Databricks SQL dashboards in the workspace. The `migrate-dbsql-dashboards` command transforms all Databricks SQL dashboards in the workspace after table migration, tagging migrated dashboards and queries with `migrated by UCX` and backing up original queries. The `revert-dbsql-dashboards` command returns migrated Databricks SQL dashboards to their original state before migration. Both commands accept a `--dashboard-id` flag for migrating or reverting a specific dashboard. Additionally, two new functions, `migrate_dbsql_dashboards` and `revert_dbsql_dashboards`, have been added to the `cli.py` file, and new classes have been added to interact with Redash for data visualization and querying. The `make_dashboard` fixture has been updated to enhance testing capabilities, and new unit tests have been added for migrating and reverting DBSQL dashboards.
* Added UDFs assessment ([#1610](#1610)). A User Defined Function (UDF) assessment feature has been introduced, addressing issue [#1610](#1610). A new method, DESCRIBE_FUNCTION, has been implemented to retrieve detailed information about UDFs, including function description, input parameters, and return types. This method has been integrated into existing test cases, enhancing the validation of UDF metadata and associated privileges, and ensuring system reliability. The UDF constructor has been updated with a new parameter 'comment', initially left blank in the test function. Additionally, two new columns, `success` and 'failures', have been added to the udf table in the inventory database to store assessment data for UDFs. The UdfsCrawler class has been updated to return a list of UDF objects, and the assertions in the test have been updated accordingly. Furthermore, a new SQL file has been added to calculate the total count of UDFs in the $inventory.udfs table, with a widget displaying this information as a counter visualization named "Total UDF Count".
* Added `databricks labs ucx create-missing-principals` command to create the missing UC roles in AWS ([#1495](#1495)). The `databricks labs ucx` tool now includes a new command, `create-missing-principals`, which creates missing Universal Catalog (UC) roles in AWS for S3 locations that lack a UC compatible role. This command is implemented using `IamRoleCreation` from `databricks.labs.ucx.aws.credentials` and updates `AWSRoleAction` with the corresponding `role_arn` while adding `AWSUCRoleCandidate`. The new command only supports AWS and does not affect Azure. The existing `migrate_credentials` function has been updated to handle Azure Service Principals migration. Additionally, new classes and methods have been added, including `AWSUCRoleCandidate` in `aws.py`, and `create_missing_principals` and `list_uc_roles` methods in `access.py`. The `create_uc_roles_cli` method in `access.py` has been refactored and renamed to `list_uc_roles`. New unit tests have been implemented to test the functionality of `create_missing_principals` for AWS and Azure, as well as testing the behavior when the command is not approved.
* Added baseline for workflow linter ([#1613](#1613)). This change introduces the `WorkflowLinter` class in the `application.py` file of the `databricks.labs.ucx.source_code.jobs` package. The class is used to lint workflows by checking their dependencies and ensuring they meet certain criteria, taking in arguments such as `workspace_client`, `dependency_resolver`, `path_lookup`, and `migration_index`. Several properties have been moved from `dependency_resolver` to the `CliContext` class, and the `NotebookLoader` class has been moved to a new location. Additionally, several classes and methods have been introduced to build a dependency graph, resolve dependencies, and manage allowed dependencies, site packages, and supported programming languages. The `generic` and `redash` modules from `databricks.labs.ucx.workspace_access` and the `GroupManager` class from `databricks.labs.ucx.workspace_access.groups` are used. The `VerifyHasMetastore`, `UdfsCrawler`, and `TablesMigrator` classes from `databricks.labs.ucx.hive_metastore` and the `DeployedWorkflows` class from `databricks.labs.ucx.installer.workflows` are also used. This commit is part of a larger effort to improve workflow linting and addresses several related issues and pull requests.
* Added linter to check for RDD use and JVM access ([#1606](#1606)). A new `AstHelper` class has been added to provide utility functions for working with abstract syntax trees (ASTs) in Python code, including methods for extracting attribute and function call node names. Additionally, a linter has been integrated to check for RDD use and JVM access, utilizing the `AstHelper` class, which has been moved to a separate module. A new file, 'spark_connect.py', introduces a linter with three matchers to ensure conformance to best practices and catch potential issues early in the development process related to RDD usage and JVM access. The linter is environment-aware, accommodating shared cluster and serverless configurations, and includes new test methods to validate its functionality. These improvements enhance codebase quality, promote reusability, and ensure performance and stability in Spark cluster environments.
* Added non-Delta DBFS table migration (What.DBFS_ROOT_NON_DELTA) in migrate_table workflow ([#1621](#1621)). The `migrate_tables` workflow in `workflows.py` has been enhanced to support a new scenario, DBFS_ROOT_NON_DELTA, which covers non-delta tables stored in DBFS root from the Hive Metastore to the Unity Catalog using CTAS. Additionally, the ACL migration strategy has been updated to include the AclMigrationWhat.PRINCIPAL strategy. The `migrate_external_tables_sync`, `migrate_dbfs_root_delta_tables`, and `migrate_views` tasks now incorporate the new ACL migration strategy. These changes have been thoroughly tested through unit tests and integration tests, ensuring the continued functionality of the existing workflow while expanding its capabilities.
* Added "seen tables" feature ([#1465](#1465)). The `seen tables` feature has been introduced, allowing for better handling of existing tables in the hive metastore and supporting their migration to UC. This enhancement includes the addition of a `snapshot` method that fetches and crawls table inventory, appending or overwriting records based on assessment results. The `_crawl` function has been updated to check for and skip existing tables in the current workspace. New methods such as '_get_tables_paths_from_assessment', '_overwrite_records', and `_get_table_location` have been included to facilitate these improvements. In the testing realm, a new test `test_mount_listing_seen_tables` has been implemented, replacing 'test_partitioned_csv_jsons'. This test checks the behavior of the TablesInMounts class when enumerating tables in mounts for a specific context, accounting for different table formats and managing external and managed tables. The diff modifies the 'locations.py' file in the databricks/labs/ucx directory, related to the hive metastore.
* Added support for `migrate-tables-ctas` workflow in the `databricks labs ucx migrate-tables` CLI command ([#1660](#1660)). This commit adds support for the `migrate-tables-ctas` workflow in the `databricks labs ucx migrate-tables` command, which checks for external tables that cannot be synced and prompts the user to run the `migrate-tables-ctas` workflow. Two new methods, `test_migrate_external_tables_ctas(ws)` and `migrate_tables(ws, prompts, ctx=ctx)`, have been added. The first method checks if the `migrate-external-tables-ctas` workflow is called correctly, while the second method runs the workflow after prompting the user. The method `test_migrate_external_hiveserde_tables_in_place(ws)` has been modified to test if the `migrate-external-hiveserde-tables-in-place-experimental` workflow is called correctly. No new methods or significant modifications to existing functionality have been made in this commit. The changes include updated unit tests and user documentation. The target audience for this feature are software engineers who adopt the project.
* Added support for migrating external location permissions from interactive cluster mounts ([#1487](#1487)). This commit adds support for migrating external location permissions from interactive cluster mounts in Databricks Labs' UCX project, enhancing security and access control. It retrieves interactive cluster locations and user mappings from the AzureACL class, granting necessary permissions to each cluster principal for each location. The existing `databricks labs ucx` command is modified, with the addition of the new method `create_external_locations` and thorough testing through manual, unit, and integration tests. This feature is developed by vuong-nguyen and Vuong and addresses issues [#1192](#1192) and [#1193](#1193), ensuring a more robust and controlled user experience with interactive clusters.
* Added uber principal spn details in SQL warehouse data access configuration when creating uber-SPN ([#1631](#1631)). In this release, we've implemented new features to enhance the security and control over data access during the migration process for the SQL warehouse data access configuration. The `databricks labs ucx create-uber-principal` command now creates a service principal with read-only access to all the storage used by tables in the workspace. The UCX Cluster Policy and SQL Warehouse data access configuration will be updated to use this service principal for migration workflows. A new method, `_update_sql_dac_with_instance_profile`, has been introduced in the `access.py` file to update the SQL data access configuration with the provided AWS instance profile, ensuring a more streamlined management of instance profiles within the SQL data access configuration during the creation of an uber service principal (SPN). Additionally, new methods and tests have been added to the sql module of the databricks.sdk.service package to improve Azure resource permissions, handling different scenarios related to creating a global SPN in the presence or absence of various conditions, such as storage, cluster policies, or secrets.
* Addressed issue with disabled features in certain regions ([#1618](#1618)). In this release, we have implemented improvements to address an issue where certain features were disabled in specific regions. We have added error handling when listing serving endpoints to raise a NotFound error if a feature is disabled, preventing the code from failing silently and providing better error messages. A new method, test_serving_endpoints_not_enabled, has been added, which creates a mock WorkspaceClient and raises a NotFound error if serving endpoints are not enabled for a shard. The GenericPermissionsSupport class uses this method to get crawler tasks, and if serving endpoints are not enabled, an error message is logged. These changes increase the reliability and robustness of the codebase by providing better error handling and messaging for this particular issue. Additionally, the change includes unit tests and manual testing to ensure the proper functioning of the new features.
* Aggregate UCX output across workspaces with CLI command ([#1596](#1596)). A new `report-account-compatibility` command has been added to the `databricks labs ucx` tool, enabling users to evaluate the compatibility of an entire Azure Databricks account with UCX (Unified Client Context). This command generates a readiness report for an Azure Databricks account, specifically for evaluating compatibility with UCX, by querying various aspects of the account such as clusters, configurations, and data formats. It uses Azure CLI authentication with AAD tokens for authentication and accepts a profile as an argument. The output includes warnings for workspaces that do not have UCX installed, and provides information about unsupported cluster types, unsupported configurations, data format compatibility, and more. Additionally, a new feature has been added to aggregate UCX output across workspaces in an account through a new CLI command, "report-account-compatibility", which can be run at the account level. The existing `manual-workspace-info` command remains unchanged. These changes will help assess the readiness and compatibility of an Azure Databricks account for UCX integration and simplify the process of checking compatibility across an entire account.
* Assert if group name is in cluster policy ([#1665](#1665)). In this release, we have implemented a change to ensure the presence of the display name of a specific workspace group (ws_group_a) in the cluster policy. This is to prevent a key error previously encountered. The cluster policy is now loaded as a dictionary, and the group name is checked to confirm its presence. If the group is not found, a message is raised alerting users. Additionally, the permission level for the group is verified to ensure it is set to CAN_USE. No new methods have been added, and existing functionality remains unchanged. The test file test_ext_hms.py has been updated to include the new assertion and has undergone both unit tests and manual testing to ensure proper implementation. This change is intended for software engineers who adopt the project.
* Automatically retrying with `auth_type=azure-cli` when constructing `workspace_clients` on Azure ([#1650](#1650)). This commit introduces automatic retrying with 'auth_type=azure-cli' when constructing `workspace_clients` on Azure, resolving TODO items for `AccountWorkspaces` and adding relevant suggestions in 'troubleshooting.md'. It closes issues [#1574](#1574) and [#1430](#1430), and includes new methods for generating readiness reports in `AccountAggregate` and testing the `get_accessible_workspaces` method in 'test_workspaces.py'. User documentation has been updated and the changes have been manually verified in a staging environment. For macOS and Windows users, explicit auth type settings are required for command line utilities.
* Changes to identify service principal with custom roles on Azure storage account for principal-prefix-access ([#1576](#1576)). This release introduces several enhancements to the identification of service principals with custom roles on Azure storage accounts for principal-prefix-access. New methods such as `_get_permission_level`, `_get_custom_role_privilege`, and `_get_role_privilege` have been added to improve the functionality of the module. Additionally, two new classes, AzureRoleAssignment and AzureRoleDetails, have been added to enable more detailed management and access control for custom roles on Azure storage accounts. The 'test_access.py' file has been updated to include tests for saving custom roles in Azure storage accounts and ensuring the correct identification of service principals with custom roles. A new unit test function, test_role_assignments_custom_storage(), has also been added to verify the behavior of custom roles in Azure storage accounts. Overall, these changes provide a more efficient and fine-grained way to manage and control custom roles on Azure storage accounts.
* Clarified unsupported config in compute crawler ([#1656](#1656)). In this release, we have made significant changes to clarify and improve the handling of unsupported configurations in our compute crawler related to the Hive metastore. We have expanded error messages for unsupported configurations and provided detailed recommendations for remediation. Additionally, we have added relevant user documentation and manually tested the changes. The changes include updates to the configuration for external Hive metastore and passthrough security model for Unity Catalog, which are incompatible with the current configurations. We recommend removing or altering the configs while migrating existing tables and views using UCX or other compatible clusters, and mapping the passthrough security model to a security model compatible with Unity Catalog. The code modifications include the addition of new methods for checking cluster init script and Spark configurations, as well as refining the error messages for unsupported configurations. We also added a new assertion in the `test_cluster_with_multiple_failures` unit test to check for the presence of a specific message regarding the use of the `spark.databricks.passthrough.enabled` configuration. This release is not yet verified on the staging environment.
* Created a unique default schema when External Hive Metastore is detected ([#1579](#1579)). A new default database `ucx` is introduced for storing inventory in the hive metastore, with a suffix consisting of the workspace's client ID to ensure uniqueness when an external hive metastore is detected. The `has_ext_hms()` method is added to the `InstallationPolicy` class to detect external HMS and thereby create a unique default schema. The `_prompt_for_new_installation` method's default value for the `Inventory Database stored in hive_metastore` prompt is updated to use the new default database name, modified to include the workspace's client ID if external HMS is detected. Additionally, a test function `test_save_config_ext_hms` is implemented to demonstrate the `WorkspaceInstaller` class's behavior with external HMS, creating a unique default schema for improved system functionality and customization. This change is part of issue [#1579](#1579).
* Extend service principal migration to create storage credentials for access connectors created for each storage account ([#1426](#1426)). This commit extends the service principal migration to create storage credentials for access connectors associated with each storage account, resolving issues [#1384](#1384) and [#875](#875). The update includes modifications to the existing `databricks labs ucx` command for creating access connectors, adds a new CLI command for creating storage credentials, and updates the documentation. A new workflow has been added for creating credentials for access connectors and service principals, and updates have been made to existing workflows. The commit includes manual, unit, and integration tests, and no new or modified methods are specified in the diff. The focus is on the feature description and its impact on the project's functionality. The commit has been co-authored by Serge Smertin and vuong-nguyen.
* Suggest users to create Access Connector(s) with Managed Identity to access Azure Storage Accounts behind firewall ([#1589](#1589)). In this release, we have introduced a new feature to improve access to Azure Storage Accounts that are protected by firewalls. Due to limitations with service principals in such scenarios, we have developed Access Connectors with Managed Identities for more reliable connectivity. This change includes updates to the 'credentials.py' file, which introduces new methods for managing the migration of service principals to Access Connectors using Managed Identities. Users are warned that migrating to this new feature may cause issues when transitioning to UC, and are advised to validate external locations after running the migration command. This update enhances the security and functionality of the system, providing a more dependable method for accessing Azure Storage Accounts protected by firewalls.
* Fixed catalog/schema grants when tables with same source schema have different target schemas ([#1581](#1581)). In this release, we have implemented a fix to address an issue where catalog/schema grants were not being handled correctly when tables with the same source schema had different target schemas. This was causing problems with granting appropriate permissions to users. We have modified the prepare_test function to include an additional test case with a different target schema for the same source table. Furthermore, we have updated the test_catalog_schema_acl function to ensure that grants are being created correctly for all catalogs, schemas, and tables. We have also added an extra query to grant use schema permissions for catalog2.schema3 to user1. Additionally, we have introduced a new `SchemaInfo` class to store information about catalogs and schemas, and refactored the `_get_database_source_target_mapping` method to return a dictionary that maps source databases to a list of `SchemaInfo` objects instead of a single dictionary. These changes ensure that grants are being handled correctly for catalogs, schemas, and tables, even when tables with the same source schema have different target schemas. This will improve the overall functionality and reliability of the system, making it easier for users to manage their catalogs and schemas.
* Fixed Spark configuration parameter referencing secret ([#1635](#1635)). In this release, the code related to the Spark configuration parameter reference for a secret has been updated in the `access.py` file, specifically within the `_update_cluster_policy_definition` method. The change modifies the method to retrieve the OAuth client secret for a given storage account using an f-string to reference the secret, replacing the previous concatenation operator. This enhancement is aimed at improving the readability and maintainability of the code while preserving its functionality. Furthermore, the commit includes additional changes, such as new methods `test_create_global_spn` and "cluster_policies.edit", which may be related to this fix. These changes address the secret reference issue, ensuring secure access control and improved integration, particularly with the Spark configuration, benefiting engineers utilizing this project for handling sensitive information and managing clusters securely and effectively.
* Fixed `migration-locations` and `assign-metastore` definitions in `labs.yml` ([#1627](#1627)). In this release, the `migration-locations` command in the `labs.yml` file has been updated to include new flags `subscription-id` and `aws-profile`. The `subscription-id` flag allows users to specify the subscription to scan the storage account in, and the `aws-profile` flag allows for authentication using a specified AWS Profile. The `assign-metastore` command has also been updated with a new description: "Enable Unity Catalog features on a workspace by assigning a metastore to it." The `is_account_level` parameter remains unchanged, and the new optional flag `workspace-id` has been added, allowing users to specify the Workspace ID to assign a metastore to. This change enhances the functionality of the `migration-locations` and `assign-metastore` commands, providing more options for users to customize their storage scanning and metastore assignment processes. The `migration-locations` and `assign-metastore` definitions in the `labs.yml` file have been fixed in this release.
* Fixed prompt for using external metastore ([#1668](#1668)). A fix has been implemented in the `create` function of the `policy.py` file to correctly prompt users for using an external metastore. Previously, a missing period and space in the prompt caused potential confusion. The updated prompt now includes a clarifying sentence and the `_prompts.confirm` method has been modified to check if the user wants to set UCX to connect to an external metastore in two scenarios: when one or more cluster policies are set up for an external metastore, and when the workspace warehouse is configured for an external metastore. If the user chooses to set up an external metastore, an informational message will be recorded in the logger. This change ensures clear and precise communication with users during the external metastore setup process.
* Fixed storage account network ACLs retrieved from properties ([#1620](#1620)). This release includes a fix to the storage account network ACLs retrieval in the open-source library, addressing issue [#1](#1). Previously, the network ACLs were being retrieved from an incorrect location, but this commit corrects that by obtaining the network ACLs from the storage account's properties.networkAcls field. The `StorageAccount` class has been updated to modify the way default network action is retrieved, with a new value `Unknown` added to the previous values `Deny` and "Allow". The `from_raw_resource` class method has also been updated to retrieve the default network action from the `properties.networkAcls` field instead of the `networkAcls` field. This change may affect any functionality that relies on network ACL information and impacts the existing command `databricks labs ucx ...`. Relevant tests, including a new test `test_azure_resource_storage_accounts_list_non_zero`, have been added and manually and unit tested to ensure the fix is functioning correctly.
* Fully refresh table migration status in table migration workflow ([#1630](#1630)). This release introduces a new method, `index_full_refresh()`, to the table migration workflow for fully refreshing the migration status, addressing an oversight from a previous commit ([#1623](#1623)) and resolving issue [#1628](#1628). The new method resets the `_migration_status_refresher` before computing the index, ensuring the latest migration status is used for determining whether view dependencies have been migrated. The `index()` method was previously used to refresh the migration status, but it only provided a partial refresh. With this update, `index_full_refresh()` is utilized for a comprehensive refresh, affecting the `refresh_migration_status` task in multiple workflows such as `migrate_views`, `scan_tables_in_mounts_experimental`, and others. This change ensures a more accurate migration report, presenting the updated migration status.
* Ignore existing corrupted installations when refreshing ([#1605](#1605)). A recent update has enhanced the error handling during the loading of installations in the `install.py` file. Specifically, the `installation.load` function now handles certain errors, including `PermissionDenied`, `SerdeError`, `ValueError`, and `AttributeError`, by logging a warning message and skipping the corrupted installation instead of raising an error. This behavior has been incorporated into both the `configure` and `_check_inventory_database_exists` functions, allowing the installation process to continue even in the presence of issues with existing installations, while providing improved error messages. This change resolves issue [#1601](#1601) and introduces a new test case for a corrupted installation configuration, as well as an updated existing test case for `test_save_config` that includes a mock installation.
* Improved exception handling ([#1584](#1584)). In this release, the exception handling during the upload of a wheel file to DBFS has been significantly improved. Previously, only PermissionDenied errors were caught and handled. Now, both BadRequest and PermissionDenied exceptions will be caught and logged as a warning. This change enhances the robustness of the code by handling a wider range of exceptions during the upload process. In addition, cluster overrides have been configured and DBFS write permissions have been set up. The specific changes made to the code include updating the import statement for NotFound to include BadRequest and modifying the except block in the _get_init_script_data method to catch both NotFound and BadRequest exceptions. These improvements ensure that the code can handle more types of errors, providing more helpful error messages and preventing crash scenarios, thereby enhancing the reliability and robustness of the code.
* Improved exception handling for `migrate_acl` ([#1590](#1590)). In this release, the `migrate_acl` functionality has been enhanced to improve exception handling, addressing a flakiness issue in the `test_migrate_managed_tables_with_acl` test. Previously, unhandled `not found` exceptions during parallel test execution caused the flakiness. This release resolves this issue ([#1549](#1549)) by introducing error handling in the `test_migrate_acls_should_produce_proper_queries` test. A controlled error is now introduced to simulate a failed grant migration due to a `TABLE_OR_VIEW_NOT_FOUND` error. This enhancement allows for precise testing of error handling and logging mechanisms when migration fails for specific objects, ensuring a more reliable testing environment for the `migrate_acl` functionality.
* Improved reliability of table migration status refresher ([#1623](#1623)). This release introduces improvements to the table migration status refresher in the open-source library, enhancing its reliability and robustness. The `table_migrate` function has been updated to ensure that the table migration status is always reset when requesting the latest snapshot, addressing issues [#1623](#1623), [#1622](#1622), and [#1615](#1615). Additionally, the function now handles `NotFound` errors when refreshing migration status. The `get_seen_tables` function has been modified to convert the returned iterator to a list and raise a `NotFound` exception if the schema does not exist, which is then caught and logged as a warning. Furthermore, the migration status reset behavior has been improved, and the `migration_status_refresher` parameter type in the `TableMigrate` class constructor has been modified. New private methods `_index_with_reset()` and updated `_migrate_views()` and `_view_can_be_migrated()` methods have been added to ensure a more accurate and consistent table migration process. The changes have been thoroughly tested and are ready for review.
* Refresh migration status at the end of the `migrate_tables` workflows ([#1599](#1599)). In this release, updates have been made to the migration status at the end of the `migrate_tables` workflows, with no new or modified tables or methods introduced. The `_migration_status_refresher.reset()` method has been added in two locations to ensure accurate migration status updates. A new `refresh_migration_status` method has been included in the `RuntimeContext` class in the `databricks.labs.ucx.hive_metastore.workflows` module, which refreshes the migration status for presentation in the dashboard. The changes also include the addition of the `refresh_migration_status` task in `migrate_views`, `migrate_views_with_acl`, and `scan_tables_in_mounts_experimental` workflows, and the `migration_report` method is now dependent on the `refresh_migration_status` task. Thorough testing has been conducted, including the creation of a new integration test in the file `tests/integration/hive_metastore/test_workflows.py` to verify that the migration status is refreshed after the migration job is run. These changes aim to ensure that the migration status is up-to-date and accurately presented in the dashboard.
* Removed DBFS library installations ([#1554](#1554)). In this release, the "configure.py" file has been removed, which previously contained the `ConfigureClusterOverrides` class with methods for validating cluster IDs, distinguishing between classic and Table Access Control (TACL) clusters, and building a prompt for users to select a valid active cluster ID. The removal of this file signifies that these functionalities are no longer available. This change is part of a larger commit that also removes DBFS library installations and updates the Estimates Dashboard to remove metastore assignment, addressing issue [#1098](#1098). The commit has been tested via integration tests and manual installation and running of UCX on a no-uc environment. Please note that the `create_jobs` method in the `install.py` file has been updated to reflect these changes, ensuring a more straightforward installation experience and usage of the Estimates Dashboard.
* Removed the `Is Terraform used` prompt ([#1664](#1664)). In this release, we have removed the `is_terraform_used` prompt from the configuration file and the installation process in the ucx package. This prompt was not being utilized and had been a source of confusion for some users. Although the variable that stored its outcome will be retained for backwards compatibility, no new methods or modifications to existing functionality have been introduced. No tests have been added or modified as part of this change. The removal of this prompt simplifies the configuration process and aligns with the project's future plans to eliminate the use of Terraform state for ucx migration. Manual testing has been conducted to ensure that the removal of the prompt does not affect the functionality of other properties in the configuration file or the installation process.
* Resolve relative paths when building dependency graph ([#1608](#1608)). This commit introduces support for resolving relative paths when building a dependency graph in the UCX project, addressing issues 1202, 1499, and 1287. The SysPathProvider now includes a `cwd` attribute, and a new class, LocalNotebookLoader, has been implemented to handle local files and folders. The PathLookup class is used to resolve paths, and new methods have been added to support these changes. Unit tests have been provided to ensure the correct functioning of the new functionality. This commit replaces issue 1593 and enhances the project's ability to handle local files and folders, resulting in a more robust and reliable dependency graph.
* Show tables migration status in migration dashboard ([#1507](#1507)). A migration dashboard has been added to display the status of data object migrations, addressing issue [#323](#323). This new feature includes a query to show the migration status of tables, a new CLI command, and a modification to an existing command. The `migrataion-*` workflow has been updated to include a refresh migration dashboard option. The `mock_installation` function has been modified with an updated state.json file. The changes consist of manual testing and can be found in the `migrations/main` directory as a new SQL query file. This migration dashboard provides users with an easier way to monitor the progress and status of their data migration tasks.
* Simulate loading of local files or notebooks after manipulation of `sys.path` ([#1633](#1633)). This commit updates the PathLookup process during the construction of the dependency graph, addressing issues [#1202](#1202) and [#1468](#1468). It simplifies the DependencyGraphBuilder by directly using the DependencyResolver with resolvers and lookup passed as arguments, and removes the DependencyGraphBuilder. The changes include new methods for handling compatibility checks, but no new user-facing features or changes to command-line interfaces or existing workflows are introduced. Unit tests are included to ensure correct behavior. The modifications aim to improve the internal handling of dependency resolution and compatibility checks.
* Test if `create-catalogs-schemas` works with tables defined as mount paths ([#1578](#1578)). This release includes a new unit test for the `create-catalogs-schemas` logic that verifies the correct creation and management of catalogs and schemas defined as mount paths. The test checks the storage location of catalogs, ensures non-existing schemas are properly created, and prevents the creation of catalogs without a storage location. It also verifies the catalog schema ACL is set correctly. Using the `CatalogSchema` class and various test functions, the test creates and grants permissions to catalogs and schemas. This change resolves issue [#1039](#1039) without modifying any existing commands or workflows. The release contains no new CLI commands or user documentation, but includes unit tests and assertion calls to validate the behavior of the `create_all_catalogs_schemas` method.
* Upgraded `databricks-sdk` to 0.27 ([#1626](#1626)). In this release, the `databricks-sdk` package has been upgraded to version 0.27, bringing updated methods for Redash objects. The `_install_query` method in the `dashboards.py` file has been updated to include a `tags` parameter, set to `None`, when calling `self._ws.queries.update` and `self._ws.queries.create`. This ensures that the updated SDK version is used and that tags are not applied during query updates and creation. Additionally, the `databricks-labs-lsql` and `databricks-labs-blueprint` packages have been updated to versions 0.4.0 and 0.4.3 respectively, and the dependency for PyYAML has been updated to a version between 6.0.0 and 7.0.0. These updates may impact the functionality of the project. The changes have been manually tested, but there is no verification on a staging environment.
* Use stack of dependency resolvers ([#1560](#1560)). This pull request introduces a stack-based implementation of resolvers, resolving issues [#1202](#1202), [#1499](#1499), and [#1421](#1421), and implements an initial version of SysPathProvider, while eliminating previous hacks. The new functionality includes modified existing commands, a new workflow, and the addition of unit tests. No new documentation or CLI commands have been added. The `problem_collector` parameter is not addressed in this PR and has been moved to a separate issue. The changes include renaming and moving a Python file, as well as modifications to the `Notebook` class and its related methods for handling notebook dependencies and dependency checking. The code has been tested, but manual testing and integration tests are still pending.
nfx added a commit that referenced this issue May 8, 2024
* Added DBSQL queries & dashboard migration
([#1532](#1532)). The
Databricks Labs Unified Command Extensions (UCX) project has been
updated with two new experimental commands: `migrate-dbsql-dashboards`
and `revert-dbsql-dashboards`. These commands are designed for migrating
and reverting the migration of Databricks SQL dashboards in the
workspace. The `migrate-dbsql-dashboards` command transforms all
Databricks SQL dashboards in the workspace after table migration,
tagging migrated dashboards and queries with `migrated by UCX` and
backing up original queries. The `revert-dbsql-dashboards` command
returns migrated Databricks SQL dashboards to their original state
before migration. Both commands accept a `--dashboard-id` flag for
migrating or reverting a specific dashboard. Additionally, two new
functions, `migrate_dbsql_dashboards` and `revert_dbsql_dashboards`,
have been added to the `cli.py` file, and new classes have been added to
interact with Redash for data visualization and querying. The
`make_dashboard` fixture has been updated to enhance testing
capabilities, and new unit tests have been added for migrating and
reverting DBSQL dashboards.
* Added UDFs assessment
([#1610](#1610)). A User
Defined Function (UDF) assessment feature has been introduced,
addressing issue
[#1610](#1610). A new
method, DESCRIBE_FUNCTION, has been implemented to retrieve detailed
information about UDFs, including function description, input
parameters, and return types. This method has been integrated into
existing test cases, enhancing the validation of UDF metadata and
associated privileges, and ensuring system reliability. The UDF
constructor has been updated with a new parameter 'comment', initially
left blank in the test function. Additionally, two new columns,
`success` and 'failures', have been added to the udf table in the
inventory database to store assessment data for UDFs. The UdfsCrawler
class has been updated to return a list of UDF objects, and the
assertions in the test have been updated accordingly. Furthermore, a new
SQL file has been added to calculate the total count of UDFs in the
$inventory.udfs table, with a widget displaying this information as a
counter visualization named "Total UDF Count".
* Added `databricks labs ucx create-missing-principals` command to
create the missing UC roles in AWS
([#1495](#1495)). The
`databricks labs ucx` tool now includes a new command,
`create-missing-principals`, which creates missing Universal Catalog
(UC) roles in AWS for S3 locations that lack a UC compatible role. This
command is implemented using `IamRoleCreation` from
`databricks.labs.ucx.aws.credentials` and updates `AWSRoleAction` with
the corresponding `role_arn` while adding `AWSUCRoleCandidate`. The new
command only supports AWS and does not affect Azure. The existing
`migrate_credentials` function has been updated to handle Azure Service
Principals migration. Additionally, new classes and methods have been
added, including `AWSUCRoleCandidate` in `aws.py`, and
`create_missing_principals` and `list_uc_roles` methods in `access.py`.
The `create_uc_roles_cli` method in `access.py` has been refactored and
renamed to `list_uc_roles`. New unit tests have been implemented to test
the functionality of `create_missing_principals` for AWS and Azure, as
well as testing the behavior when the command is not approved.
* Added baseline for workflow linter
([#1613](#1613)). This
change introduces the `WorkflowLinter` class in the `application.py`
file of the `databricks.labs.ucx.source_code.jobs` package. The class is
used to lint workflows by checking their dependencies and ensuring they
meet certain criteria, taking in arguments such as `workspace_client`,
`dependency_resolver`, `path_lookup`, and `migration_index`. Several
properties have been moved from `dependency_resolver` to the
`CliContext` class, and the `NotebookLoader` class has been moved to a
new location. Additionally, several classes and methods have been
introduced to build a dependency graph, resolve dependencies, and manage
allowed dependencies, site packages, and supported programming
languages. The `generic` and `redash` modules from
`databricks.labs.ucx.workspace_access` and the `GroupManager` class from
`databricks.labs.ucx.workspace_access.groups` are used. The
`VerifyHasMetastore`, `UdfsCrawler`, and `TablesMigrator` classes from
`databricks.labs.ucx.hive_metastore` and the `DeployedWorkflows` class
from `databricks.labs.ucx.installer.workflows` are also used. This
commit is part of a larger effort to improve workflow linting and
addresses several related issues and pull requests.
* Added linter to check for RDD use and JVM access
([#1606](#1606)). A new
`AstHelper` class has been added to provide utility functions for
working with abstract syntax trees (ASTs) in Python code, including
methods for extracting attribute and function call node names.
Additionally, a linter has been integrated to check for RDD use and JVM
access, utilizing the `AstHelper` class, which has been moved to a
separate module. A new file, 'spark_connect.py', introduces a linter
with three matchers to ensure conformance to best practices and catch
potential issues early in the development process related to RDD usage
and JVM access. The linter is environment-aware, accommodating shared
cluster and serverless configurations, and includes new test methods to
validate its functionality. These improvements enhance codebase quality,
promote reusability, and ensure performance and stability in Spark
cluster environments.
* Added non-Delta DBFS table migration (What.DBFS_ROOT_NON_DELTA) in
migrate_table workflow
([#1621](#1621)). The
`migrate_tables` workflow in `workflows.py` has been enhanced to support
a new scenario, DBFS_ROOT_NON_DELTA, which covers non-delta tables
stored in DBFS root from the Hive Metastore to the Unity Catalog using
CTAS. Additionally, the ACL migration strategy has been updated to
include the AclMigrationWhat.PRINCIPAL strategy. The
`migrate_external_tables_sync`, `migrate_dbfs_root_delta_tables`, and
`migrate_views` tasks now incorporate the new ACL migration strategy.
These changes have been thoroughly tested through unit tests and
integration tests, ensuring the continued functionality of the existing
workflow while expanding its capabilities.
* Added "seen tables" feature
([#1465](#1465)). The `seen
tables` feature has been introduced, allowing for better handling of
existing tables in the hive metastore and supporting their migration to
UC. This enhancement includes the addition of a `snapshot` method that
fetches and crawls table inventory, appending or overwriting records
based on assessment results. The `_crawl` function has been updated to
check for and skip existing tables in the current workspace. New methods
such as '_get_tables_paths_from_assessment', '_overwrite_records', and
`_get_table_location` have been included to facilitate these
improvements. In the testing realm, a new test
`test_mount_listing_seen_tables` has been implemented, replacing
'test_partitioned_csv_jsons'. This test checks the behavior of the
TablesInMounts class when enumerating tables in mounts for a specific
context, accounting for different table formats and managing external
and managed tables. The diff modifies the 'locations.py' file in the
databricks/labs/ucx directory, related to the hive metastore.
* Added support for `migrate-tables-ctas` workflow in the `databricks
labs ucx migrate-tables` CLI command
([#1660](#1660)). This
commit adds support for the `migrate-tables-ctas` workflow in the
`databricks labs ucx migrate-tables` command, which checks for external
tables that cannot be synced and prompts the user to run the
`migrate-tables-ctas` workflow. Two new methods,
`test_migrate_external_tables_ctas(ws)` and `migrate_tables(ws, prompts,
ctx=ctx)`, have been added. The first method checks if the
`migrate-external-tables-ctas` workflow is called correctly, while the
second method runs the workflow after prompting the user. The method
`test_migrate_external_hiveserde_tables_in_place(ws)` has been modified
to test if the `migrate-external-hiveserde-tables-in-place-experimental`
workflow is called correctly. No new methods or significant
modifications to existing functionality have been made in this commit.
The changes include updated unit tests and user documentation. The
target audience for this feature are software engineers who adopt the
project.
* Added support for migrating external location permissions from
interactive cluster mounts
([#1487](#1487)). This
commit adds support for migrating external location permissions from
interactive cluster mounts in Databricks Labs' UCX project, enhancing
security and access control. It retrieves interactive cluster locations
and user mappings from the AzureACL class, granting necessary
permissions to each cluster principal for each location. The existing
`databricks labs ucx` command is modified, with the addition of the new
method `create_external_locations` and thorough testing through manual,
unit, and integration tests. This feature is developed by vuong-nguyen
and Vuong and addresses issues
[#1192](#1192) and
[#1193](#1193), ensuring a
more robust and controlled user experience with interactive clusters.
* Added uber principal spn details in SQL warehouse data access
configuration when creating uber-SPN
([#1631](#1631)). In this
release, we've implemented new features to enhance the security and
control over data access during the migration process for the SQL
warehouse data access configuration. The `databricks labs ucx
create-uber-principal` command now creates a service principal with
read-only access to all the storage used by tables in the workspace. The
UCX Cluster Policy and SQL Warehouse data access configuration will be
updated to use this service principal for migration workflows. A new
method, `_update_sql_dac_with_instance_profile`, has been introduced in
the `access.py` file to update the SQL data access configuration with
the provided AWS instance profile, ensuring a more streamlined
management of instance profiles within the SQL data access configuration
during the creation of an uber service principal (SPN). Additionally,
new methods and tests have been added to the sql module of the
databricks.sdk.service package to improve Azure resource permissions,
handling different scenarios related to creating a global SPN in the
presence or absence of various conditions, such as storage, cluster
policies, or secrets.
* Addressed issue with disabled features in certain regions
([#1618](#1618)). In this
release, we have implemented improvements to address an issue where
certain features were disabled in specific regions. We have added error
handling when listing serving endpoints to raise a NotFound error if a
feature is disabled, preventing the code from failing silently and
providing better error messages. A new method,
test_serving_endpoints_not_enabled, has been added, which creates a mock
WorkspaceClient and raises a NotFound error if serving endpoints are not
enabled for a shard. The GenericPermissionsSupport class uses this
method to get crawler tasks, and if serving endpoints are not enabled,
an error message is logged. These changes increase the reliability and
robustness of the codebase by providing better error handling and
messaging for this particular issue. Additionally, the change includes
unit tests and manual testing to ensure the proper functioning of the
new features.
* Aggregate UCX output across workspaces with CLI command
([#1596](#1596)). A new
`report-account-compatibility` command has been added to the `databricks
labs ucx` tool, enabling users to evaluate the compatibility of an
entire Azure Databricks account with UCX (Unified Client Context). This
command generates a readiness report for an Azure Databricks account,
specifically for evaluating compatibility with UCX, by querying various
aspects of the account such as clusters, configurations, and data
formats. It uses Azure CLI authentication with AAD tokens for
authentication and accepts a profile as an argument. The output includes
warnings for workspaces that do not have UCX installed, and provides
information about unsupported cluster types, unsupported configurations,
data format compatibility, and more. Additionally, a new feature has
been added to aggregate UCX output across workspaces in an account
through a new CLI command, "report-account-compatibility", which can be
run at the account level. The existing `manual-workspace-info` command
remains unchanged. These changes will help assess the readiness and
compatibility of an Azure Databricks account for UCX integration and
simplify the process of checking compatibility across an entire account.
* Assert if group name is in cluster policy
([#1665](#1665)). In this
release, we have implemented a change to ensure the presence of the
display name of a specific workspace group (ws_group_a) in the cluster
policy. This is to prevent a key error previously encountered. The
cluster policy is now loaded as a dictionary, and the group name is
checked to confirm its presence. If the group is not found, a message is
raised alerting users. Additionally, the permission level for the group
is verified to ensure it is set to CAN_USE. No new methods have been
added, and existing functionality remains unchanged. The test file
test_ext_hms.py has been updated to include the new assertion and has
undergone both unit tests and manual testing to ensure proper
implementation. This change is intended for software engineers who adopt
the project.
* Automatically retrying with `auth_type=azure-cli` when constructing
`workspace_clients` on Azure
([#1650](#1650)). This
commit introduces automatic retrying with 'auth_type=azure-cli' when
constructing `workspace_clients` on Azure, resolving TODO items for
`AccountWorkspaces` and adding relevant suggestions in
'troubleshooting.md'. It closes issues
[#1574](#1574) and
[#1430](#1430), and includes
new methods for generating readiness reports in `AccountAggregate` and
testing the `get_accessible_workspaces` method in 'test_workspaces.py'.
User documentation has been updated and the changes have been manually
verified in a staging environment. For macOS and Windows users, explicit
auth type settings are required for command line utilities.
* Changes to identify service principal with custom roles on Azure
storage account for principal-prefix-access
([#1576](#1576)). This
release introduces several enhancements to the identification of service
principals with custom roles on Azure storage accounts for
principal-prefix-access. New methods such as `_get_permission_level`,
`_get_custom_role_privilege`, and `_get_role_privilege` have been added
to improve the functionality of the module. Additionally, two new
classes, AzureRoleAssignment and AzureRoleDetails, have been added to
enable more detailed management and access control for custom roles on
Azure storage accounts. The 'test_access.py' file has been updated to
include tests for saving custom roles in Azure storage accounts and
ensuring the correct identification of service principals with custom
roles. A new unit test function, test_role_assignments_custom_storage(),
has also been added to verify the behavior of custom roles in Azure
storage accounts. Overall, these changes provide a more efficient and
fine-grained way to manage and control custom roles on Azure storage
accounts.
* Clarified unsupported config in compute crawler
([#1656](#1656)). In this
release, we have made significant changes to clarify and improve the
handling of unsupported configurations in our compute crawler related to
the Hive metastore. We have expanded error messages for unsupported
configurations and provided detailed recommendations for remediation.
Additionally, we have added relevant user documentation and manually
tested the changes. The changes include updates to the configuration for
external Hive metastore and passthrough security model for Unity
Catalog, which are incompatible with the current configurations. We
recommend removing or altering the configs while migrating existing
tables and views using UCX or other compatible clusters, and mapping the
passthrough security model to a security model compatible with Unity
Catalog. The code modifications include the addition of new methods for
checking cluster init script and Spark configurations, as well as
refining the error messages for unsupported configurations. We also
added a new assertion in the `test_cluster_with_multiple_failures` unit
test to check for the presence of a specific message regarding the use
of the `spark.databricks.passthrough.enabled` configuration. This
release is not yet verified on the staging environment.
* Created a unique default schema when External Hive Metastore is
detected ([#1579](#1579)). A
new default database `ucx` is introduced for storing inventory in the
hive metastore, with a suffix consisting of the workspace's client ID to
ensure uniqueness when an external hive metastore is detected. The
`has_ext_hms()` method is added to the `InstallationPolicy` class to
detect external HMS and thereby create a unique default schema. The
`_prompt_for_new_installation` method's default value for the `Inventory
Database stored in hive_metastore` prompt is updated to use the new
default database name, modified to include the workspace's client ID if
external HMS is detected. Additionally, a test function
`test_save_config_ext_hms` is implemented to demonstrate the
`WorkspaceInstaller` class's behavior with external HMS, creating a
unique default schema for improved system functionality and
customization. This change is part of issue
[#1579](#1579).
* Extend service principal migration to create storage credentials for
access connectors created for each storage account
([#1426](#1426)). This
commit extends the service principal migration to create storage
credentials for access connectors associated with each storage account,
resolving issues
[#1384](#1384) and
[#875](#875). The update
includes modifications to the existing `databricks labs ucx` command for
creating access connectors, adds a new CLI command for creating storage
credentials, and updates the documentation. A new workflow has been
added for creating credentials for access connectors and service
principals, and updates have been made to existing workflows. The commit
includes manual, unit, and integration tests, and no new or modified
methods are specified in the diff. The focus is on the feature
description and its impact on the project's functionality. The commit
has been co-authored by Serge Smertin and vuong-nguyen.
* Suggest users to create Access Connector(s) with Managed Identity to
access Azure Storage Accounts behind firewall
([#1589](#1589)). In this
release, we have introduced a new feature to improve access to Azure
Storage Accounts that are protected by firewalls. Due to limitations
with service principals in such scenarios, we have developed Access
Connectors with Managed Identities for more reliable connectivity. This
change includes updates to the 'credentials.py' file, which introduces
new methods for managing the migration of service principals to Access
Connectors using Managed Identities. Users are warned that migrating to
this new feature may cause issues when transitioning to UC, and are
advised to validate external locations after running the migration
command. This update enhances the security and functionality of the
system, providing a more dependable method for accessing Azure Storage
Accounts protected by firewalls.
* Fixed catalog/schema grants when tables with same source schema have
different target schemas
([#1581](#1581)). In this
release, we have implemented a fix to address an issue where
catalog/schema grants were not being handled correctly when tables with
the same source schema had different target schemas. This was causing
problems with granting appropriate permissions to users. We have
modified the prepare_test function to include an additional test case
with a different target schema for the same source table. Furthermore,
we have updated the test_catalog_schema_acl function to ensure that
grants are being created correctly for all catalogs, schemas, and
tables. We have also added an extra query to grant use schema
permissions for catalog2.schema3 to user1. Additionally, we have
introduced a new `SchemaInfo` class to store information about catalogs
and schemas, and refactored the `_get_database_source_target_mapping`
method to return a dictionary that maps source databases to a list of
`SchemaInfo` objects instead of a single dictionary. These changes
ensure that grants are being handled correctly for catalogs, schemas,
and tables, even when tables with the same source schema have different
target schemas. This will improve the overall functionality and
reliability of the system, making it easier for users to manage their
catalogs and schemas.
* Fixed Spark configuration parameter referencing secret
([#1635](#1635)). In this
release, the code related to the Spark configuration parameter reference
for a secret has been updated in the `access.py` file, specifically
within the `_update_cluster_policy_definition` method. The change
modifies the method to retrieve the OAuth client secret for a given
storage account using an f-string to reference the secret, replacing the
previous concatenation operator. This enhancement is aimed at improving
the readability and maintainability of the code while preserving its
functionality. Furthermore, the commit includes additional changes, such
as new methods `test_create_global_spn` and "cluster_policies.edit",
which may be related to this fix. These changes address the secret
reference issue, ensuring secure access control and improved
integration, particularly with the Spark configuration, benefiting
engineers utilizing this project for handling sensitive information and
managing clusters securely and effectively.
* Fixed `migration-locations` and `assign-metastore` definitions in
`labs.yml` ([#1627](#1627)).
In this release, the `migration-locations` command in the `labs.yml`
file has been updated to include new flags `subscription-id` and
`aws-profile`. The `subscription-id` flag allows users to specify the
subscription to scan the storage account in, and the `aws-profile` flag
allows for authentication using a specified AWS Profile. The
`assign-metastore` command has also been updated with a new description:
"Enable Unity Catalog features on a workspace by assigning a metastore
to it." The `is_account_level` parameter remains unchanged, and the new
optional flag `workspace-id` has been added, allowing users to specify
the Workspace ID to assign a metastore to. This change enhances the
functionality of the `migration-locations` and `assign-metastore`
commands, providing more options for users to customize their storage
scanning and metastore assignment processes. The `migration-locations`
and `assign-metastore` definitions in the `labs.yml` file have been
fixed in this release.
* Fixed prompt for using external metastore
([#1668](#1668)). A fix has
been implemented in the `create` function of the `policy.py` file to
correctly prompt users for using an external metastore. Previously, a
missing period and space in the prompt caused potential confusion. The
updated prompt now includes a clarifying sentence and the
`_prompts.confirm` method has been modified to check if the user wants
to set UCX to connect to an external metastore in two scenarios: when
one or more cluster policies are set up for an external metastore, and
when the workspace warehouse is configured for an external metastore. If
the user chooses to set up an external metastore, an informational
message will be recorded in the logger. This change ensures clear and
precise communication with users during the external metastore setup
process.
* Fixed storage account network ACLs retrieved from properties
([#1620](#1620)). This
release includes a fix to the storage account network ACLs retrieval in
the open-source library, addressing issue
[#1](#1). Previously, the
network ACLs were being retrieved from an incorrect location, but this
commit corrects that by obtaining the network ACLs from the storage
account's properties.networkAcls field. The `StorageAccount` class has
been updated to modify the way default network action is retrieved, with
a new value `Unknown` added to the previous values `Deny` and "Allow".
The `from_raw_resource` class method has also been updated to retrieve
the default network action from the `properties.networkAcls` field
instead of the `networkAcls` field. This change may affect any
functionality that relies on network ACL information and impacts the
existing command `databricks labs ucx ...`. Relevant tests, including a
new test `test_azure_resource_storage_accounts_list_non_zero`, have been
added and manually and unit tested to ensure the fix is functioning
correctly.
* Fully refresh table migration status in table migration workflow
([#1630](#1630)). This
release introduces a new method, `index_full_refresh()`, to the table
migration workflow for fully refreshing the migration status, addressing
an oversight from a previous commit
([#1623](#1623)) and
resolving issue
[#1628](#1628). The new
method resets the `_migration_status_refresher` before computing the
index, ensuring the latest migration status is used for determining
whether view dependencies have been migrated. The `index()` method was
previously used to refresh the migration status, but it only provided a
partial refresh. With this update, `index_full_refresh()` is utilized
for a comprehensive refresh, affecting the `refresh_migration_status`
task in multiple workflows such as `migrate_views`,
`scan_tables_in_mounts_experimental`, and others. This change ensures a
more accurate migration report, presenting the updated migration status.
* Ignore existing corrupted installations when refreshing
([#1605](#1605)). A recent
update has enhanced the error handling during the loading of
installations in the `install.py` file. Specifically, the
`installation.load` function now handles certain errors, including
`PermissionDenied`, `SerdeError`, `ValueError`, and `AttributeError`, by
logging a warning message and skipping the corrupted installation
instead of raising an error. This behavior has been incorporated into
both the `configure` and `_check_inventory_database_exists` functions,
allowing the installation process to continue even in the presence of
issues with existing installations, while providing improved error
messages. This change resolves issue
[#1601](#1601) and
introduces a new test case for a corrupted installation configuration,
as well as an updated existing test case for `test_save_config` that
includes a mock installation.
* Improved exception handling
([#1584](#1584)). In this
release, the exception handling during the upload of a wheel file to
DBFS has been significantly improved. Previously, only PermissionDenied
errors were caught and handled. Now, both BadRequest and
PermissionDenied exceptions will be caught and logged as a warning. This
change enhances the robustness of the code by handling a wider range of
exceptions during the upload process. In addition, cluster overrides
have been configured and DBFS write permissions have been set up. The
specific changes made to the code include updating the import statement
for NotFound to include BadRequest and modifying the except block in the
_get_init_script_data method to catch both NotFound and BadRequest
exceptions. These improvements ensure that the code can handle more
types of errors, providing more helpful error messages and preventing
crash scenarios, thereby enhancing the reliability and robustness of the
code.
* Improved exception handling for `migrate_acl`
([#1590](#1590)). In this
release, the `migrate_acl` functionality has been enhanced to improve
exception handling, addressing a flakiness issue in the
`test_migrate_managed_tables_with_acl` test. Previously, unhandled `not
found` exceptions during parallel test execution caused the flakiness.
This release resolves this issue
([#1549](#1549)) by
introducing error handling in the
`test_migrate_acls_should_produce_proper_queries` test. A controlled
error is now introduced to simulate a failed grant migration due to a
`TABLE_OR_VIEW_NOT_FOUND` error. This enhancement allows for precise
testing of error handling and logging mechanisms when migration fails
for specific objects, ensuring a more reliable testing environment for
the `migrate_acl` functionality.
* Improved reliability of table migration status refresher
([#1623](#1623)). This
release introduces improvements to the table migration status refresher
in the open-source library, enhancing its reliability and robustness.
The `table_migrate` function has been updated to ensure that the table
migration status is always reset when requesting the latest snapshot,
addressing issues
[#1623](#1623),
[#1622](#1622), and
[#1615](#1615).
Additionally, the function now handles `NotFound` errors when refreshing
migration status. The `get_seen_tables` function has been modified to
convert the returned iterator to a list and raise a `NotFound` exception
if the schema does not exist, which is then caught and logged as a
warning. Furthermore, the migration status reset behavior has been
improved, and the `migration_status_refresher` parameter type in the
`TableMigrate` class constructor has been modified. New private methods
`_index_with_reset()` and updated `_migrate_views()` and
`_view_can_be_migrated()` methods have been added to ensure a more
accurate and consistent table migration process. The changes have been
thoroughly tested and are ready for review.
* Refresh migration status at the end of the `migrate_tables` workflows
([#1599](#1599)). In this
release, updates have been made to the migration status at the end of
the `migrate_tables` workflows, with no new or modified tables or
methods introduced. The `_migration_status_refresher.reset()` method has
been added in two locations to ensure accurate migration status updates.
A new `refresh_migration_status` method has been included in the
`RuntimeContext` class in the
`databricks.labs.ucx.hive_metastore.workflows` module, which refreshes
the migration status for presentation in the dashboard. The changes also
include the addition of the `refresh_migration_status` task in
`migrate_views`, `migrate_views_with_acl`, and
`scan_tables_in_mounts_experimental` workflows, and the
`migration_report` method is now dependent on the
`refresh_migration_status` task. Thorough testing has been conducted,
including the creation of a new integration test in the file
`tests/integration/hive_metastore/test_workflows.py` to verify that the
migration status is refreshed after the migration job is run. These
changes aim to ensure that the migration status is up-to-date and
accurately presented in the dashboard.
* Removed DBFS library installations
([#1554](#1554)). In this
release, the "configure.py" file has been removed, which previously
contained the `ConfigureClusterOverrides` class with methods for
validating cluster IDs, distinguishing between classic and Table Access
Control (TACL) clusters, and building a prompt for users to select a
valid active cluster ID. The removal of this file signifies that these
functionalities are no longer available. This change is part of a larger
commit that also removes DBFS library installations and updates the
Estimates Dashboard to remove metastore assignment, addressing issue
[#1098](#1098). The commit
has been tested via integration tests and manual installation and
running of UCX on a no-uc environment. Please note that the
`create_jobs` method in the `install.py` file has been updated to
reflect these changes, ensuring a more straightforward installation
experience and usage of the Estimates Dashboard.
* Removed the `Is Terraform used` prompt
([#1664](#1664)). In this
release, we have removed the `is_terraform_used` prompt from the
configuration file and the installation process in the ucx package. This
prompt was not being utilized and had been a source of confusion for
some users. Although the variable that stored its outcome will be
retained for backwards compatibility, no new methods or modifications to
existing functionality have been introduced. No tests have been added or
modified as part of this change. The removal of this prompt simplifies
the configuration process and aligns with the project's future plans to
eliminate the use of Terraform state for ucx migration. Manual testing
has been conducted to ensure that the removal of the prompt does not
affect the functionality of other properties in the configuration file
or the installation process.
* Resolve relative paths when building dependency graph
([#1608](#1608)). This
commit introduces support for resolving relative paths when building a
dependency graph in the UCX project, addressing issues 1202, 1499, and
1287. The SysPathProvider now includes a `cwd` attribute, and a new
class, LocalNotebookLoader, has been implemented to handle local files
and folders. The PathLookup class is used to resolve paths, and new
methods have been added to support these changes. Unit tests have been
provided to ensure the correct functioning of the new functionality.
This commit replaces issue 1593 and enhances the project's ability to
handle local files and folders, resulting in a more robust and reliable
dependency graph.
* Show tables migration status in migration dashboard
([#1507](#1507)). A
migration dashboard has been added to display the status of data object
migrations, addressing issue
[#323](#323). This new
feature includes a query to show the migration status of tables, a new
CLI command, and a modification to an existing command. The
`migrataion-*` workflow has been updated to include a refresh migration
dashboard option. The `mock_installation` function has been modified
with an updated state.json file. The changes consist of manual testing
and can be found in the `migrations/main` directory as a new SQL query
file. This migration dashboard provides users with an easier way to
monitor the progress and status of their data migration tasks.
* Simulate loading of local files or notebooks after manipulation of
`sys.path` ([#1633](#1633)).
This commit updates the PathLookup process during the construction of
the dependency graph, addressing issues
[#1202](#1202) and
[#1468](#1468). It
simplifies the DependencyGraphBuilder by directly using the
DependencyResolver with resolvers and lookup passed as arguments, and
removes the DependencyGraphBuilder. The changes include new methods for
handling compatibility checks, but no new user-facing features or
changes to command-line interfaces or existing workflows are introduced.
Unit tests are included to ensure correct behavior. The modifications
aim to improve the internal handling of dependency resolution and
compatibility checks.
* Test if `create-catalogs-schemas` works with tables defined as mount
paths ([#1578](#1578)). This
release includes a new unit test for the `create-catalogs-schemas` logic
that verifies the correct creation and management of catalogs and
schemas defined as mount paths. The test checks the storage location of
catalogs, ensures non-existing schemas are properly created, and
prevents the creation of catalogs without a storage location. It also
verifies the catalog schema ACL is set correctly. Using the
`CatalogSchema` class and various test functions, the test creates and
grants permissions to catalogs and schemas. This change resolves issue
[#1039](#1039) without
modifying any existing commands or workflows. The release contains no
new CLI commands or user documentation, but includes unit tests and
assertion calls to validate the behavior of the
`create_all_catalogs_schemas` method.
* Upgraded `databricks-sdk` to 0.27
([#1626](#1626)). In this
release, the `databricks-sdk` package has been upgraded to version 0.27,
bringing updated methods for Redash objects. The `_install_query` method
in the `dashboards.py` file has been updated to include a `tags`
parameter, set to `None`, when calling `self._ws.queries.update` and
`self._ws.queries.create`. This ensures that the updated SDK version is
used and that tags are not applied during query updates and creation.
Additionally, the `databricks-labs-lsql` and `databricks-labs-blueprint`
packages have been updated to versions 0.4.0 and 0.4.3 respectively, and
the dependency for PyYAML has been updated to a version between 6.0.0
and 7.0.0. These updates may impact the functionality of the project.
The changes have been manually tested, but there is no verification on a
staging environment.
* Use stack of dependency resolvers
([#1560](#1560)). This pull
request introduces a stack-based implementation of resolvers, resolving
issues [#1202](#1202),
[#1499](#1499), and
[#1421](#1421), and
implements an initial version of SysPathProvider, while eliminating
previous hacks. The new functionality includes modified existing
commands, a new workflow, and the addition of unit tests. No new
documentation or CLI commands have been added. The `problem_collector`
parameter is not addressed in this PR and has been moved to a separate
issue. The changes include renaming and moving a Python file, as well as
modifications to the `Notebook` class and its related methods for
handling notebook dependencies and dependency checking. The code has
been tested, but manual testing and integration tests are still pending.
@JCZuurmond JCZuurmond mentioned this issue May 14, 2024
11 tasks
github-merge-queue bot pushed a commit that referenced this issue May 14, 2024
…1685)

## Changes
Change SitePackageResolver logic, such that it looks for
`_package_/__init__.py` rather than rely on `dist-info` metadata.
Wire resolver back in global resolver chain
Update calling code


### Linked issues
#1202 

### Functionality 

- [ ] added relevant user documentation
- [ ] added new CLI command
- [ ] modified existing command: `databricks labs ucx ...`
- [ ] added a new workflow
- [ ] modified existing workflow: `...`
- [ ] added a new table
- [ ] modified existing table: `...`

### Tests
- [ ] manually tested
- [x] added unit tests
- [ ] added integration tests
- [ ] verified on staging environment (screenshot attached)

---------

Co-authored-by: Eric Vergnaud <eic.vergnaud@databricks.com>
github-merge-queue bot pushed a commit that referenced this issue May 15, 2024
## Changes
Specialize resolvers for notebooks and imports

### Linked issues
#1202

### Functionality 

- [ ] added relevant user documentation
- [ ] added new CLI command
- [ ] modified existing command: `databricks labs ucx ...`
- [ ] added a new workflow
- [ ] modified existing workflow: `...`
- [ ] added a new table
- [ ] modified existing table: `...`

### Tests
- [x] manually tested
- [ ] added unit tests
- [ ] added integration tests
- [ ] verified on staging environment (screenshot attached)

---------

Co-authored-by: Eric Vergnaud <eic.vergnaud@databricks.com>
Co-authored-by: Cor <jczuurmond@protonmail.com>
github-merge-queue bot pushed a commit that referenced this issue May 17, 2024
## Changes
Builds child dependency graph of libraries resolved via PipResolver,
using DistInfo data
Changed some tests that would consequently take minutes

### Linked issues
#1202
Resolves #1642 

### Functionality 

- [ ] added relevant user documentation
- [ ] added new CLI command
- [ ] modified existing command: `databricks labs ucx ...`
- [ ] added a new workflow
- [ ] modified existing workflow: `...`
- [ ] added a new table
- [ ] modified existing table: `...`

### Tests
- [ ] manually tested
- [x] added unit tests
- [ ] added integration tests
- [ ] verified on staging environment (screenshot attached)

---------

Co-authored-by: Eric Vergnaud <eic.vergnaud@databricks.com>
nfx added a commit that referenced this issue May 27, 2024
* Added `%pip` cell resolver ([#1697](#1697)). A newly developed pip resolver has been integrated into the ImportResolver for future use, addressing issue [#1642](#1642) and following up on [#1694](#1694). The resolver installs libraries and modifies the path lookup to make them available for import. This change affects existing workflows but does not introduce new CLI commands, tables, or files. The commit includes modifications to the build_dependency_graph method and the addition of unit tests to verify the new functionality. The resolver has been manually tested and passes the unit tests, ensuring better compatibility and accessibility for libraries used in the project.
* Added downloads of `requirementst.txt` dependency locally to register it to the dependency graph ([#1753](#1753)). This commit introduces support for linting job tasks that require a 'requirements.txt' file for specifying dependencies. It resolves issue [#1644](#1644) and is similar to [#1704](#1704). The changes include the addition of a new CLI command, modification of the existing 'databricks labs ucx ...' command, and modification of the `experimental-workflow-linter` workflow. The `lint_job` method has been updated to handle dependencies specified in a 'requirements.txt' file, checking for their presence in the job's libraries list and flagging any missing dependencies. The code changes include modifications to the 'jobs.py' file to register libraries specified in a 'requirements.txt' file to the dependency graph. Unit and integration tests have been added to verify the new functionality. The changes also include handling of jar libraries. The code includes TODO comments for future enhancements such as downloading the library wheel and adding it to the virtual system path, and handling references to other requirements files and constraints files.
* Added ability to install UCX on workspaces without Public Internet connectivity ([#1566](#1566)). A new flag, `upload_dependencies`, has been added to the WorkspaceConfig to enable users to upload dependencies to air-gapped workspaces without public internet connectivity. This flag is a boolean value that is set to False by default and can be set by the user through the installation prompt. This feature resolves issue [#573](#573) and was co-authored by hari-selvarajan_data. When this flag is set to True, it triggers the upload of specified dependencies during installation, which allows for the installation of UCX on workspaces without public internet access. This change also includes updating the version of `databricks-labs-blueprint` from `<0.7.0` to `>=0.6.0`, which may include changes to existing functionality. Additionally, new test functions have been added to test the functionality of uploading dependencies when the `upload_dependencies` flag is set to True.
* Added initial interface for data comparison framework ([#1695](#1695)). This commit introduces the initial interface for a data comparison framework, which includes classes and methods for managing metadata, profiling data, and comparing schema and data for tables. A new `StandardDataComparator` class has been implemented for comparing the data of two tables, and a `StandardSchemaComparator` class tests the comparison of table schemas. The framework also includes the `DatabricksTableMetadataRetriever` class for retrieving metadata about a given table using a SQL backend. Additional classes and methods will be implemented in future work to provide a robust data comparison framework, such as `StandardDataProfiler` for profiling data, `SchemaComparator` and `DataComparator` for comparing schema and data, and test fixtures and functions for testing the framework. This release lays the groundwork for enabling users to perform comprehensive data comparisons effectively, enhancing the project's capabilities and versatility.
* Added lint local code command ([#1710](#1710)). A new `lint local code` command has been added to the databricks labs ucx tool, allowing users to assess required migrations in a local directory or file. This command detects dependencies and analyzes them, currently supporting Python and SQL files, with an expected runtime of under a minute for code bases up to 50,000 lines of code. The command generates output that includes file links opening the file at the problematic line in modern IDEs, providing a quick and easy way to identify necessary migrations. The `lint-local-code` command is implemented in the `application.py` file, with supporting methods and classes added to the `workspace_cli.py` and `databricks.labs.ucx.source_code` packages, enhancing the linting process and providing valuable feedback for maintaining high code quality standards.
* Added table in mount migration ([#1225](#1225)). This commit introduces new functionality to migrate tables in mounts to the Unity Catalog, including creating a table in the Unity Catalog based on a table mapping CSV file, fixing an issue with include_paths_in_mount not being present in workflows.py, and adding the ability to set default ownership on each created table. A new method ScanTablesInMounts has been added to scan tables in mounts, and a TableMigration class creates tables in the Unity Catalog based on the table mapping. Two new methods, Rule and TableMapping, have been added to manage mappings of tables, and TableToMigrate is used to represent a table that needs to be migrated to Unity Catalog. The commit includes manual, unit, and integration testing to ensure the changes work as expected. The diff shows changes to the workflows.py file and the addition of several new methods, including Rule, TableMapping, TableToMigrate, create_autospec, and MockBackend.
* Added workflows to trigger table reconciliations ([#1721](#1721)). In this release, we've introduced several enhancements to our table migration workflow, focusing on data reconciliation and consistency. We've added a new post-migration data reconciliation task that validates migrated table integrity by comparing the schema, row count, and individual row content of the source and target tables. The new task stores and displays the number of missing rows in the Migration dashboard's `$inventory_database.reconciliation_results` view. Additionally, new workflows have been implemented to automatically trigger table reconciliations, ensuring consistency and integrity between different data sources. These workflows involve modifying relevant functions and modules, and may include new methods for data processing, scheduling, or monitoring based on the project's architecture. Furthermore, new configuration options for table reconciliation are now available in the WorkspaceConfig class, allowing for greater control and flexibility over migration processes. By incorporating these improvements, users can expect enhanced data consistency and more efficient table reconciliation management.
* Always refresh HMS stats when getting table size ([#1713](#1713)). A change has been implemented in the hive_metastore library to enhance the precision of table size calculations by ensuring that HMS stats are always refreshed before being retrieved. This has been achieved by calling the ANALYZE TABLE command with the COMPUTE STATISTICS NOSCAN option before computing the table size, thus preventing the use of stale stats. Specifically, the "backend.queries" list has been updated to include two ANALYZE statements for tables "db1.table1" and "db1.table2", ensuring that their statistics are updated and accurate. The test case `test_table_size_crawler` in the "test_table_size.py" file has been revised to validate the presence of the two ANALYZE statements in the "backend.queries" list and confirm the size of the results for both tables. This commit also includes manual testing, added unit tests, and verification on the staging environment to ensure the functionality.
* Automatically retrieve `aws_account_id` from aws profile instead of prompting ([#1715](#1715)). This commit introduces several improvements to the library's AWS integration, enhancing automation and user experience. It eliminates the need for manual input of `aws_account_id` by automatically retrieving it from the AWS profile. An optional `kms-key` flag has been documented for creating roles, providing more flexibility. The `create-missing-principals` command now accepts optional parameters such as KMS Key, Role Name, Policy Name, and allows creating a single role for all S3 locations, with a default behavior of creating one role per S3 location. These changes have been manually tested and verified in a staging environment, and resolve issue [#1714](#1714). Additionally, tests have been conducted to ensure the changes do not introduce regressions. A new method simulating a successful AWS CLI call has been added, replacing `aws_cli_run_command`, ensuring automated retrieval of `aws_account_id`. A test has also been added to raise an error when AWS CLI is not found in the system path.
* Detect dependencies of libraries installed via pip ([#1703](#1703)). This commit introduces a child dependency graph for libraries resolved via pip using DistInfo data, addressing issues [#1642](#1642) and [#1202](#1202). It modifies certain tests and reduces their execution time. The PipResolver class in `databricks.labs.ucx.source_code.graph` is used to detect and resolve library dependencies installed via pip, with methods to locate, install, and register libraries in a specified folder. A new Whitelist feature and updated DistInfoPackage class are also included. Although unit tests have been added, no new user documentation, CLI commands, workflows, or tables have been added or modified. The previous site_packages attribute has been removed from the GlobalContext class.
* Emit problems with code belonging to job ([#1730](#1730)). In this release, the jobs.py file has been updated with new functionality in the JobProblem class, enabling it to convert itself into a string message using the new as_message() method. The refresh_report() method has been modified to call a new _lint_job() method when provided with a job object, which returns a list of JobProblem instances. The lint_job() method has also been updated to call _lint_job() and return a list of JobProblem instances, with a new behavior to log warning messages when problems are found. The changes include the addition of a new method, `lint_job`, for linting a job and returning any problems found. The changes have been tested through the addition of a new integration test, `test_job_linter_some_notebook_graph_with_problems`, and are manually tested and covered with unit and integration tests. This release addresses issue [#1542](#1542) and improves the job linter functionality, specifically detecting and emitting problems related to code belonging to a job during the lin job. The new `JobProblem` class has an `as_message()` method that returns a string representation of the problem, and a unit test for this method has been added. The `DependencyResolver` in the `DependencyGraph` constructor has also been modified.
* Fixed `create-catalogs-schemas` to allow more than 1 level nesting more than the external location ([#1701](#1701)). The `create-catalogs-schemas` library has been updated to allow for more than one level of nesting beyond the external location, addressing issue [#1700](#1700). This release includes a new CLI command, as well as modifications to the existing `databricks labs ucx ...` command. A new workflow has been added and existing functionality has been changed to support the additional nesting levels. The changes have been thoroughly tested through manual testing, unit tests, and integration tests using the `fnmatch.fnmatch` method for validating location patterns. Software engineers adopting this project will benefit from these enhancements.
* Fixed local file resolver logic with relative paths and site-packages ([#1685](#1685)). This commit addresses an issue ([#1685](#1685)) related to the local file resolver logic for relative paths and site-packages. The resolver's logic has been updated to look for `_package_/__init__.py` instead of relying on `dist-info` metadata, and the resolver has been wired back into the global resolver chain with updated calling code. No changes have been made to user documentation, CLI commands, workflows, or tables. New methods have not been added, but existing functionality has been modified to enhance local file resolution handling. Unit tests have been added and manually verified to ensure proper functionality.
* Fixed look up logic where instance profile name does not match role name ([#1716](#1716)). A fix has been implemented to improve the robustness of the instance profile lookup mechanism in the open-source library. Previously, the code relied on the role name being the same as the instance profile name, which resulted in issues when the names did not match ([#1716](#1716), [#1711](#1711)). This has been addressed by updating the `role_name` method in the `AWSRoleAction` class to use a new regex pattern 'AWSResources.ROLE_NAME_REGEX', and renaming the `get_instance_profile` method in the `AWSResources` class to `get_instance_profile_arn` to reflect the change in return type from a string to an ARN. A new method, 'get_instance_profile_role_arn', has also been added to the `AWSResources` class to retrieve the role ARN from the instance profile. Additionally, new methods `get_instance_profile_arn` and `instance_lookup` have been added to improve testing capabilities.
* Fixed pip install in a multiline cell ([#1728](#1728)). This release includes a fix for an issue where pip install commands with multiline code were not being handled correctly (issue [#1728](#1728), issue [#1642](#1642)). The `build_dependency_graph` function of the `PipCell` class has been updated to properly register the library specified in the pip install command, even if it is spread over multiple lines. The function now splits the original code by spaces or new lines, allowing it to extract the library name correctly. These changes have been thoroughly tested through manual testing and unit tests to ensure that pip install commands with multiline code are now handled correctly, resulting in the library being installed and registered properly.
* README update about Standard workspaces ([#1734](#1734)). In this release, the README file of our open-source library has been updated to provide additional user documentation on compatibility with Standard Workspaces on Databricks. The changes include an outlined incompatibility section, specifically designed for users of Standard Workspaces. It is important to note that these updates are purely informational and do not involve any changes to existing commands, workflows, tables, or functionality within the code. No new methods or modifications have been made to the existing functionality. The commit does not include any tests, as the changes are limited to updating user documentation. The changes have been manually tested to ensure accuracy. The target audience for this release includes software engineers who are adopting the project and may require additional guidance on compatibility with Standard Workspaces. Additionally, please note that a Databricks Premium or Enterprise workspace is now a prerequisite for using this library.
* Show code problems found by workflow linter in the migration dashboard ([#1741](#1741)). This commit introduces a new feature to the migration dashboard: an experimental workflow linter that identifies code compatibility problems for Unity Catalog integration. The feature includes a new CLI command, `migration_report`, which refreshes the migration dashboard after all previous tasks are completed, and an existing command, `databricks labs ucx ...`, has been modified. The `experimental-workflow-linter` workflow has also been changed, and new functionality has been added in the form of a new workflow. A new SQL query for displaying code compatibility problems is located in the file "02_1_code_compatibility_problems.sql". User documentation has been added, and the changes have been manually tested. This feature aims to improve the migration dashboard's functionality and provide a better experience for users. Targeted at software engineers, this feature will help in identifying and resolving code compatibility issues during the migration process.
* Support for s3a/ s3n protocols when using mount point ([#1765](#1765)). In this release, we have added support for s3a and s3n protocols when using mount points in the metastore locations. A new static method, `_get_ext_location_definitions`, has been introduced, which generates a name for a resource defined by the location and now supports additional prefixes "s3a://" and "s3n://" for defining resources in S3. For Azure Blob Storage, the container name is extracted from the location and included in the resource name. If the location does not match the supported formats, a warning is logged, and the script is not generated. These changes offer more flexibility in defining resources and improve the system's ability to handle various cloud storage solutions. Additionally, the `test_save_external_location_mapping_missing_location` function in `test_locations.py` has been updated to include test cases for s3a and s3n protocols, enhancing the software's functionality.
* Support joining an existing collection when installing UCX ([#1675](#1675)). The AccountInstaller class has been updated to include a new functionality that allows users to join an existing collection during UCX installation. This is achieved by presenting the user with a list of workspaces they have access to, allowing them to select one, and then checking if there are existing workspace IDs present in the selected workspace. If so, the installation will join the corresponding collection; otherwise, a new collection will be created. This feature simplifies UCX migration for large organizations with multiple workspaces by allowing them to manage collections instead of individual workspaces. Relevant user documentation and CLI commands have been updated, along with new and modified tests to ensure proper functionality. The commit includes the addition of new methods, `join_collection` and `is_account_install`, as well as updates to the `install_on_account` method to call `join_collection` if specified. Unit tests and integration tests have been added to ensure the proper functioning of the new feature.
* Updated UCX job cluster policy AWS zone_id to `auto` ([#1735](#1735)). In this release, the UCX job cluster policy for AWS has been updated to use `auto` for the zone_id, allowing Databricks to choose the zone based on a default value in the region. This change, which resolves issue [#533](#533), affects the definition method in the policy.py file, where a check has been added to remove 'aws_attributes.zone_id' if an instance pool ID is provided. The tests for this change include manual testing and new unit tests, with modifications to existing workflows. The diff shows updates to the test_policy.py file, where the 'aws_attributes.zone_id' is set to `auto` in several functions. No new CLI commands or documentation have been provided as part of this update.
* Updated assessment.md - `spark.catalog.x` guidance needed updating ([#1708](#1708)). With the release of DBR 14+, the `spark.catalog.*` functions, which were previously not recommended for use on shared compute clusters due to security reasons, are now considered safe to use. This change in guidance is reflected in the updated assessment.md document, which also notes that `spark.sql("<sql command>")` may still be a more suitable alternative for certain common spark.catalog functions like tableExists, listTables, and setDefaultCatalog. The corresponding `spark._jsparkSession.catalog` methods are also mentioned as potential alternatives on DBR 14.1 and above. It is important to note that no new methods or functionality have been added, and no existing functionality has been changed - only the guidance in the documentation has been updated. This update has been manually tested and implemented in the documentation to ensure accuracy and reliability for software engineers.

Dependency updates:

 * Updated sqlglot requirement from <23.15,>=23.9 to >=23.9,<23.16 ([#1681](#1681)).
 * Updated databricks-labs-blueprint requirement from <0.6.0,>=0.4.3 to >=0.4.3,<0.7.0 ([#1688](#1688)).
 * Updated sqlglot requirement from <23.16,>=23.9 to >=23.9,<23.18 ([#1724](#1724)).
 * Updated sqlglot requirement from <23.18,>=23.9 to >=23.9,<24.1 ([#1745](#1745)).
 * Updated databricks-sdk requirement from ~=0.27.0 to >=0.27,<0.29 ([#1756](#1756)).
 * Bump databrickslabs/sandbox from acceptance/v0.2.1 to 0.2.2 ([#1769](#1769)).
@nfx nfx mentioned this issue May 27, 2024
nfx added a commit that referenced this issue May 27, 2024
* Added `%pip` cell resolver
([#1697](#1697)). A newly
developed pip resolver has been integrated into the ImportResolver for
future use, addressing issue
[#1642](#1642) and following
up on [#1694](#1694). The
resolver installs libraries and modifies the path lookup to make them
available for import. This change affects existing workflows but does
not introduce new CLI commands, tables, or files. The commit includes
modifications to the build_dependency_graph method and the addition of
unit tests to verify the new functionality. The resolver has been
manually tested and passes the unit tests, ensuring better compatibility
and accessibility for libraries used in the project.
* Added downloads of `requirementst.txt` dependency locally to register
it to the dependency graph
([#1753](#1753)). This
commit introduces support for linting job tasks that require a
'requirements.txt' file for specifying dependencies. It resolves issue
[#1644](#1644) and is
similar to [#1704](#1704).
The changes include the addition of a new CLI command, modification of
the existing 'databricks labs ucx ...' command, and modification of the
`experimental-workflow-linter` workflow. The `lint_job` method has been
updated to handle dependencies specified in a 'requirements.txt' file,
checking for their presence in the job's libraries list and flagging any
missing dependencies. The code changes include modifications to the
'jobs.py' file to register libraries specified in a 'requirements.txt'
file to the dependency graph. Unit and integration tests have been added
to verify the new functionality. The changes also include handling of
jar libraries. The code includes TODO comments for future enhancements
such as downloading the library wheel and adding it to the virtual
system path, and handling references to other requirements files and
constraints files.
* Added ability to install UCX on workspaces without Public Internet
connectivity
([#1566](#1566)). A new
flag, `upload_dependencies`, has been added to the WorkspaceConfig to
enable users to upload dependencies to air-gapped workspaces without
public internet connectivity. This flag is a boolean value that is set
to False by default and can be set by the user through the installation
prompt. This feature resolves issue
[#573](#573) and was
co-authored by hari-selvarajan_data. When this flag is set to True, it
triggers the upload of specified dependencies during installation, which
allows for the installation of UCX on workspaces without public internet
access. This change also includes updating the version of
`databricks-labs-blueprint` from `<0.7.0` to `>=0.6.0`, which may
include changes to existing functionality. Additionally, new test
functions have been added to test the functionality of uploading
dependencies when the `upload_dependencies` flag is set to True.
* Added initial interface for data comparison framework
([#1695](#1695)). This
commit introduces the initial interface for a data comparison framework,
which includes classes and methods for managing metadata, profiling
data, and comparing schema and data for tables. A new
`StandardDataComparator` class has been implemented for comparing the
data of two tables, and a `StandardSchemaComparator` class tests the
comparison of table schemas. The framework also includes the
`DatabricksTableMetadataRetriever` class for retrieving metadata about a
given table using a SQL backend. Additional classes and methods will be
implemented in future work to provide a robust data comparison
framework, such as `StandardDataProfiler` for profiling data,
`SchemaComparator` and `DataComparator` for comparing schema and data,
and test fixtures and functions for testing the framework. This release
lays the groundwork for enabling users to perform comprehensive data
comparisons effectively, enhancing the project's capabilities and
versatility.
* Added lint local code command
([#1710](#1710)). A new
`lint local code` command has been added to the databricks labs ucx
tool, allowing users to assess required migrations in a local directory
or file. This command detects dependencies and analyzes them, currently
supporting Python and SQL files, with an expected runtime of under a
minute for code bases up to 50,000 lines of code. The command generates
output that includes file links opening the file at the problematic line
in modern IDEs, providing a quick and easy way to identify necessary
migrations. The `lint-local-code` command is implemented in the
`application.py` file, with supporting methods and classes added to the
`workspace_cli.py` and `databricks.labs.ucx.source_code` packages,
enhancing the linting process and providing valuable feedback for
maintaining high code quality standards.
* Added table in mount migration
([#1225](#1225)). This
commit introduces new functionality to migrate tables in mounts to the
Unity Catalog, including creating a table in the Unity Catalog based on
a table mapping CSV file, fixing an issue with include_paths_in_mount
not being present in workflows.py, and adding the ability to set default
ownership on each created table. A new method ScanTablesInMounts has
been added to scan tables in mounts, and a TableMigration class creates
tables in the Unity Catalog based on the table mapping. Two new methods,
Rule and TableMapping, have been added to manage mappings of tables, and
TableToMigrate is used to represent a table that needs to be migrated to
Unity Catalog. The commit includes manual, unit, and integration testing
to ensure the changes work as expected. The diff shows changes to the
workflows.py file and the addition of several new methods, including
Rule, TableMapping, TableToMigrate, create_autospec, and MockBackend.
* Added workflows to trigger table reconciliations
([#1721](#1721)). In this
release, we've introduced several enhancements to our table migration
workflow, focusing on data reconciliation and consistency. We've added a
new post-migration data reconciliation task that validates migrated
table integrity by comparing the schema, row count, and individual row
content of the source and target tables. The new task stores and
displays the number of missing rows in the Migration dashboard's
`$inventory_database.reconciliation_results` view. Additionally, new
workflows have been implemented to automatically trigger table
reconciliations, ensuring consistency and integrity between different
data sources. These workflows involve modifying relevant functions and
modules, and may include new methods for data processing, scheduling, or
monitoring based on the project's architecture. Furthermore, new
configuration options for table reconciliation are now available in the
WorkspaceConfig class, allowing for greater control and flexibility over
migration processes. By incorporating these improvements, users can
expect enhanced data consistency and more efficient table reconciliation
management.
* Always refresh HMS stats when getting table size
([#1713](#1713)). A change
has been implemented in the hive_metastore library to enhance the
precision of table size calculations by ensuring that HMS stats are
always refreshed before being retrieved. This has been achieved by
calling the ANALYZE TABLE command with the COMPUTE STATISTICS NOSCAN
option before computing the table size, thus preventing the use of stale
stats. Specifically, the "backend.queries" list has been updated to
include two ANALYZE statements for tables "db1.table1" and "db1.table2",
ensuring that their statistics are updated and accurate. The test case
`test_table_size_crawler` in the "test_table_size.py" file has been
revised to validate the presence of the two ANALYZE statements in the
"backend.queries" list and confirm the size of the results for both
tables. This commit also includes manual testing, added unit tests, and
verification on the staging environment to ensure the functionality.
* Automatically retrieve `aws_account_id` from aws profile instead of
prompting ([#1715](#1715)).
This commit introduces several improvements to the library's AWS
integration, enhancing automation and user experience. It eliminates the
need for manual input of `aws_account_id` by automatically retrieving it
from the AWS profile. An optional `kms-key` flag has been documented for
creating roles, providing more flexibility. The
`create-missing-principals` command now accepts optional parameters such
as KMS Key, Role Name, Policy Name, and allows creating a single role
for all S3 locations, with a default behavior of creating one role per
S3 location. These changes have been manually tested and verified in a
staging environment, and resolve issue
[#1714](#1714).
Additionally, tests have been conducted to ensure the changes do not
introduce regressions. A new method simulating a successful AWS CLI call
has been added, replacing `aws_cli_run_command`, ensuring automated
retrieval of `aws_account_id`. A test has also been added to raise an
error when AWS CLI is not found in the system path.
* Detect dependencies of libraries installed via pip
([#1703](#1703)). This
commit introduces a child dependency graph for libraries resolved via
pip using DistInfo data, addressing issues
[#1642](#1642) and
[#1202](#1202). It modifies
certain tests and reduces their execution time. The PipResolver class in
`databricks.labs.ucx.source_code.graph` is used to detect and resolve
library dependencies installed via pip, with methods to locate, install,
and register libraries in a specified folder. A new Whitelist feature
and updated DistInfoPackage class are also included. Although unit tests
have been added, no new user documentation, CLI commands, workflows, or
tables have been added or modified. The previous site_packages attribute
has been removed from the GlobalContext class.
* Emit problems with code belonging to job
([#1730](#1730)). In this
release, the jobs.py file has been updated with new functionality in the
JobProblem class, enabling it to convert itself into a string message
using the new as_message() method. The refresh_report() method has been
modified to call a new _lint_job() method when provided with a job
object, which returns a list of JobProblem instances. The lint_job()
method has also been updated to call _lint_job() and return a list of
JobProblem instances, with a new behavior to log warning messages when
problems are found. The changes include the addition of a new method,
`lint_job`, for linting a job and returning any problems found. The
changes have been tested through the addition of a new integration test,
`test_job_linter_some_notebook_graph_with_problems`, and are manually
tested and covered with unit and integration tests. This release
addresses issue
[#1542](#1542) and improves
the job linter functionality, specifically detecting and emitting
problems related to code belonging to a job during the lin job. The new
`JobProblem` class has an `as_message()` method that returns a string
representation of the problem, and a unit test for this method has been
added. The `DependencyResolver` in the `DependencyGraph` constructor has
also been modified.
* Fixed `create-catalogs-schemas` to allow more than 1 level nesting
more than the external location
([#1701](#1701)). The
`create-catalogs-schemas` library has been updated to allow for more
than one level of nesting beyond the external location, addressing issue
[#1700](#1700). This release
includes a new CLI command, as well as modifications to the existing
`databricks labs ucx ...` command. A new workflow has been added and
existing functionality has been changed to support the additional
nesting levels. The changes have been thoroughly tested through manual
testing, unit tests, and integration tests using the `fnmatch.fnmatch`
method for validating location patterns. Software engineers adopting
this project will benefit from these enhancements.
* Fixed local file resolver logic with relative paths and site-packages
([#1685](#1685)). This
commit addresses an issue
([#1685](#1685)) related to
the local file resolver logic for relative paths and site-packages. The
resolver's logic has been updated to look for `_package_/__init__.py`
instead of relying on `dist-info` metadata, and the resolver has been
wired back into the global resolver chain with updated calling code. No
changes have been made to user documentation, CLI commands, workflows,
or tables. New methods have not been added, but existing functionality
has been modified to enhance local file resolution handling. Unit tests
have been added and manually verified to ensure proper functionality.
* Fixed look up logic where instance profile name does not match role
name ([#1716](#1716)). A fix
has been implemented to improve the robustness of the instance profile
lookup mechanism in the open-source library. Previously, the code relied
on the role name being the same as the instance profile name, which
resulted in issues when the names did not match
([#1716](#1716),
[#1711](#1711)). This has
been addressed by updating the `role_name` method in the `AWSRoleAction`
class to use a new regex pattern 'AWSResources.ROLE_NAME_REGEX', and
renaming the `get_instance_profile` method in the `AWSResources` class
to `get_instance_profile_arn` to reflect the change in return type from
a string to an ARN. A new method, 'get_instance_profile_role_arn', has
also been added to the `AWSResources` class to retrieve the role ARN
from the instance profile. Additionally, new methods
`get_instance_profile_arn` and `instance_lookup` have been added to
improve testing capabilities.
* Fixed pip install in a multiline cell
([#1728](#1728)). This
release includes a fix for an issue where pip install commands with
multiline code were not being handled correctly (issue
[#1728](#1728), issue
[#1642](#1642)). The
`build_dependency_graph` function of the `PipCell` class has been
updated to properly register the library specified in the pip install
command, even if it is spread over multiple lines. The function now
splits the original code by spaces or new lines, allowing it to extract
the library name correctly. These changes have been thoroughly tested
through manual testing and unit tests to ensure that pip install
commands with multiline code are now handled correctly, resulting in the
library being installed and registered properly.
* README update about Standard workspaces
([#1734](#1734)). In this
release, the README file of our open-source library has been updated to
provide additional user documentation on compatibility with Standard
Workspaces on Databricks. The changes include an outlined
incompatibility section, specifically designed for users of Standard
Workspaces. It is important to note that these updates are purely
informational and do not involve any changes to existing commands,
workflows, tables, or functionality within the code. No new methods or
modifications have been made to the existing functionality. The commit
does not include any tests, as the changes are limited to updating user
documentation. The changes have been manually tested to ensure accuracy.
The target audience for this release includes software engineers who are
adopting the project and may require additional guidance on
compatibility with Standard Workspaces. Additionally, please note that a
Databricks Premium or Enterprise workspace is now a prerequisite for
using this library.
* Show code problems found by workflow linter in the migration dashboard
([#1741](#1741)). This
commit introduces a new feature to the migration dashboard: an
experimental workflow linter that identifies code compatibility problems
for Unity Catalog integration. The feature includes a new CLI command,
`migration_report`, which refreshes the migration dashboard after all
previous tasks are completed, and an existing command, `databricks labs
ucx ...`, has been modified. The `experimental-workflow-linter` workflow
has also been changed, and new functionality has been added in the form
of a new workflow. A new SQL query for displaying code compatibility
problems is located in the file "02_1_code_compatibility_problems.sql".
User documentation has been added, and the changes have been manually
tested. This feature aims to improve the migration dashboard's
functionality and provide a better experience for users. Targeted at
software engineers, this feature will help in identifying and resolving
code compatibility issues during the migration process.
* Support for s3a/ s3n protocols when using mount point
([#1765](#1765)). In this
release, we have added support for s3a and s3n protocols when using
mount points in the metastore locations. A new static method,
`_get_ext_location_definitions`, has been introduced, which generates a
name for a resource defined by the location and now supports additional
prefixes "s3a://" and "s3n://" for defining resources in S3. For Azure
Blob Storage, the container name is extracted from the location and
included in the resource name. If the location does not match the
supported formats, a warning is logged, and the script is not generated.
These changes offer more flexibility in defining resources and improve
the system's ability to handle various cloud storage solutions.
Additionally, the `test_save_external_location_mapping_missing_location`
function in `test_locations.py` has been updated to include test cases
for s3a and s3n protocols, enhancing the software's functionality.
* Support joining an existing collection when installing UCX
([#1675](#1675)). The
AccountInstaller class has been updated to include a new functionality
that allows users to join an existing collection during UCX
installation. This is achieved by presenting the user with a list of
workspaces they have access to, allowing them to select one, and then
checking if there are existing workspace IDs present in the selected
workspace. If so, the installation will join the corresponding
collection; otherwise, a new collection will be created. This feature
simplifies UCX migration for large organizations with multiple
workspaces by allowing them to manage collections instead of individual
workspaces. Relevant user documentation and CLI commands have been
updated, along with new and modified tests to ensure proper
functionality. The commit includes the addition of new methods,
`join_collection` and `is_account_install`, as well as updates to the
`install_on_account` method to call `join_collection` if specified. Unit
tests and integration tests have been added to ensure the proper
functioning of the new feature.
* Updated UCX job cluster policy AWS zone_id to `auto`
([#1735](#1735)). In this
release, the UCX job cluster policy for AWS has been updated to use
`auto` for the zone_id, allowing Databricks to choose the zone based on
a default value in the region. This change, which resolves issue
[#533](#533), affects the
definition method in the policy.py file, where a check has been added to
remove 'aws_attributes.zone_id' if an instance pool ID is provided. The
tests for this change include manual testing and new unit tests, with
modifications to existing workflows. The diff shows updates to the
test_policy.py file, where the 'aws_attributes.zone_id' is set to `auto`
in several functions. No new CLI commands or documentation have been
provided as part of this update.
* Updated assessment.md - `spark.catalog.x` guidance needed updating
([#1708](#1708)). With the
release of DBR 14+, the `spark.catalog.*` functions, which were
previously not recommended for use on shared compute clusters due to
security reasons, are now considered safe to use. This change in
guidance is reflected in the updated assessment.md document, which also
notes that `spark.sql("<sql command>")` may still be a more suitable
alternative for certain common spark.catalog functions like tableExists,
listTables, and setDefaultCatalog. The corresponding
`spark._jsparkSession.catalog` methods are also mentioned as potential
alternatives on DBR 14.1 and above. It is important to note that no new
methods or functionality have been added, and no existing functionality
has been changed - only the guidance in the documentation has been
updated. This update has been manually tested and implemented in the
documentation to ensure accuracy and reliability for software engineers.

Dependency updates:

* Updated sqlglot requirement from <23.15,>=23.9 to >=23.9,<23.16
([#1681](#1681)).
* Updated databricks-labs-blueprint requirement from <0.6.0,>=0.4.3 to
>=0.4.3,<0.7.0
([#1688](#1688)).
* Updated sqlglot requirement from <23.16,>=23.9 to >=23.9,<23.18
([#1724](#1724)).
* Updated sqlglot requirement from <23.18,>=23.9 to >=23.9,<24.1
([#1745](#1745)).
* Updated databricks-sdk requirement from ~=0.27.0 to >=0.27,<0.29
([#1756](#1756)).
* Bump databrickslabs/sandbox from acceptance/v0.2.1 to 0.2.2
([#1769](#1769)).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CUJ critial user journey migrate/code Abstract Syntax Trees and other dark magic
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

2 participants