Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Release v0.24.0 #1775

Merged
merged 1 commit into from
May 27, 2024
Merged

Release v0.24.0 #1775

merged 1 commit into from
May 27, 2024

Conversation

nfx
Copy link
Collaborator

@nfx nfx commented May 27, 2024

  • Added %pip cell resolver (#1697). A newly developed pip resolver has been integrated into the ImportResolver for future use, addressing issue #1642 and following up on #1694. The resolver installs libraries and modifies the path lookup to make them available for import. This change affects existing workflows but does not introduce new CLI commands, tables, or files. The commit includes modifications to the build_dependency_graph method and the addition of unit tests to verify the new functionality. The resolver has been manually tested and passes the unit tests, ensuring better compatibility and accessibility for libraries used in the project.
  • Added downloads of requirementst.txt dependency locally to register it to the dependency graph (#1753). This commit introduces support for linting job tasks that require a 'requirements.txt' file for specifying dependencies. It resolves issue #1644 and is similar to #1704. The changes include the addition of a new CLI command, modification of the existing 'databricks labs ucx ...' command, and modification of the experimental-workflow-linter workflow. The lint_job method has been updated to handle dependencies specified in a 'requirements.txt' file, checking for their presence in the job's libraries list and flagging any missing dependencies. The code changes include modifications to the 'jobs.py' file to register libraries specified in a 'requirements.txt' file to the dependency graph. Unit and integration tests have been added to verify the new functionality. The changes also include handling of jar libraries. The code includes TODO comments for future enhancements such as downloading the library wheel and adding it to the virtual system path, and handling references to other requirements files and constraints files.
  • Added ability to install UCX on workspaces without Public Internet connectivity (#1566). A new flag, upload_dependencies, has been added to the WorkspaceConfig to enable users to upload dependencies to air-gapped workspaces without public internet connectivity. This flag is a boolean value that is set to False by default and can be set by the user through the installation prompt. This feature resolves issue #573 and was co-authored by hari-selvarajan_data. When this flag is set to True, it triggers the upload of specified dependencies during installation, which allows for the installation of UCX on workspaces without public internet access. This change also includes updating the version of databricks-labs-blueprint from <0.7.0 to >=0.6.0, which may include changes to existing functionality. Additionally, new test functions have been added to test the functionality of uploading dependencies when the upload_dependencies flag is set to True.
  • Added initial interface for data comparison framework (#1695). This commit introduces the initial interface for a data comparison framework, which includes classes and methods for managing metadata, profiling data, and comparing schema and data for tables. A new StandardDataComparator class has been implemented for comparing the data of two tables, and a StandardSchemaComparator class tests the comparison of table schemas. The framework also includes the DatabricksTableMetadataRetriever class for retrieving metadata about a given table using a SQL backend. Additional classes and methods will be implemented in future work to provide a robust data comparison framework, such as StandardDataProfiler for profiling data, SchemaComparator and DataComparator for comparing schema and data, and test fixtures and functions for testing the framework. This release lays the groundwork for enabling users to perform comprehensive data comparisons effectively, enhancing the project's capabilities and versatility.
  • Added lint local code command (#1710). A new lint local code command has been added to the databricks labs ucx tool, allowing users to assess required migrations in a local directory or file. This command detects dependencies and analyzes them, currently supporting Python and SQL files, with an expected runtime of under a minute for code bases up to 50,000 lines of code. The command generates output that includes file links opening the file at the problematic line in modern IDEs, providing a quick and easy way to identify necessary migrations. The lint-local-code command is implemented in the application.py file, with supporting methods and classes added to the workspace_cli.py and databricks.labs.ucx.source_code packages, enhancing the linting process and providing valuable feedback for maintaining high code quality standards.
  • Added table in mount migration (#1225). This commit introduces new functionality to migrate tables in mounts to the Unity Catalog, including creating a table in the Unity Catalog based on a table mapping CSV file, fixing an issue with include_paths_in_mount not being present in workflows.py, and adding the ability to set default ownership on each created table. A new method ScanTablesInMounts has been added to scan tables in mounts, and a TableMigration class creates tables in the Unity Catalog based on the table mapping. Two new methods, Rule and TableMapping, have been added to manage mappings of tables, and TableToMigrate is used to represent a table that needs to be migrated to Unity Catalog. The commit includes manual, unit, and integration testing to ensure the changes work as expected. The diff shows changes to the workflows.py file and the addition of several new methods, including Rule, TableMapping, TableToMigrate, create_autospec, and MockBackend.
  • Added workflows to trigger table reconciliations (#1721). In this release, we've introduced several enhancements to our table migration workflow, focusing on data reconciliation and consistency. We've added a new post-migration data reconciliation task that validates migrated table integrity by comparing the schema, row count, and individual row content of the source and target tables. The new task stores and displays the number of missing rows in the Migration dashboard's $inventory_database.reconciliation_results view. Additionally, new workflows have been implemented to automatically trigger table reconciliations, ensuring consistency and integrity between different data sources. These workflows involve modifying relevant functions and modules, and may include new methods for data processing, scheduling, or monitoring based on the project's architecture. Furthermore, new configuration options for table reconciliation are now available in the WorkspaceConfig class, allowing for greater control and flexibility over migration processes. By incorporating these improvements, users can expect enhanced data consistency and more efficient table reconciliation management.
  • Always refresh HMS stats when getting table size (#1713). A change has been implemented in the hive_metastore library to enhance the precision of table size calculations by ensuring that HMS stats are always refreshed before being retrieved. This has been achieved by calling the ANALYZE TABLE command with the COMPUTE STATISTICS NOSCAN option before computing the table size, thus preventing the use of stale stats. Specifically, the "backend.queries" list has been updated to include two ANALYZE statements for tables "db1.table1" and "db1.table2", ensuring that their statistics are updated and accurate. The test case test_table_size_crawler in the "test_table_size.py" file has been revised to validate the presence of the two ANALYZE statements in the "backend.queries" list and confirm the size of the results for both tables. This commit also includes manual testing, added unit tests, and verification on the staging environment to ensure the functionality.
  • Automatically retrieve aws_account_id from aws profile instead of prompting (#1715). This commit introduces several improvements to the library's AWS integration, enhancing automation and user experience. It eliminates the need for manual input of aws_account_id by automatically retrieving it from the AWS profile. An optional kms-key flag has been documented for creating roles, providing more flexibility. The create-missing-principals command now accepts optional parameters such as KMS Key, Role Name, Policy Name, and allows creating a single role for all S3 locations, with a default behavior of creating one role per S3 location. These changes have been manually tested and verified in a staging environment, and resolve issue #1714. Additionally, tests have been conducted to ensure the changes do not introduce regressions. A new method simulating a successful AWS CLI call has been added, replacing aws_cli_run_command, ensuring automated retrieval of aws_account_id. A test has also been added to raise an error when AWS CLI is not found in the system path.
  • Detect dependencies of libraries installed via pip (#1703). This commit introduces a child dependency graph for libraries resolved via pip using DistInfo data, addressing issues #1642 and #1202. It modifies certain tests and reduces their execution time. The PipResolver class in databricks.labs.ucx.source_code.graph is used to detect and resolve library dependencies installed via pip, with methods to locate, install, and register libraries in a specified folder. A new Whitelist feature and updated DistInfoPackage class are also included. Although unit tests have been added, no new user documentation, CLI commands, workflows, or tables have been added or modified. The previous site_packages attribute has been removed from the GlobalContext class.
  • Emit problems with code belonging to job (#1730). In this release, the jobs.py file has been updated with new functionality in the JobProblem class, enabling it to convert itself into a string message using the new as_message() method. The refresh_report() method has been modified to call a new _lint_job() method when provided with a job object, which returns a list of JobProblem instances. The lint_job() method has also been updated to call _lint_job() and return a list of JobProblem instances, with a new behavior to log warning messages when problems are found. The changes include the addition of a new method, lint_job, for linting a job and returning any problems found. The changes have been tested through the addition of a new integration test, test_job_linter_some_notebook_graph_with_problems, and are manually tested and covered with unit and integration tests. This release addresses issue #1542 and improves the job linter functionality, specifically detecting and emitting problems related to code belonging to a job during the lin job. The new JobProblem class has an as_message() method that returns a string representation of the problem, and a unit test for this method has been added. The DependencyResolver in the DependencyGraph constructor has also been modified.
  • Fixed create-catalogs-schemas to allow more than 1 level nesting more than the external location (#1701). The create-catalogs-schemas library has been updated to allow for more than one level of nesting beyond the external location, addressing issue #1700. This release includes a new CLI command, as well as modifications to the existing databricks labs ucx ... command. A new workflow has been added and existing functionality has been changed to support the additional nesting levels. The changes have been thoroughly tested through manual testing, unit tests, and integration tests using the fnmatch.fnmatch method for validating location patterns. Software engineers adopting this project will benefit from these enhancements.
  • Fixed local file resolver logic with relative paths and site-packages (#1685). This commit addresses an issue (#1685) related to the local file resolver logic for relative paths and site-packages. The resolver's logic has been updated to look for _package_/__init__.py instead of relying on dist-info metadata, and the resolver has been wired back into the global resolver chain with updated calling code. No changes have been made to user documentation, CLI commands, workflows, or tables. New methods have not been added, but existing functionality has been modified to enhance local file resolution handling. Unit tests have been added and manually verified to ensure proper functionality.
  • Fixed look up logic where instance profile name does not match role name (#1716). A fix has been implemented to improve the robustness of the instance profile lookup mechanism in the open-source library. Previously, the code relied on the role name being the same as the instance profile name, which resulted in issues when the names did not match (#1716, #1711). This has been addressed by updating the role_name method in the AWSRoleAction class to use a new regex pattern 'AWSResources.ROLE_NAME_REGEX', and renaming the get_instance_profile method in the AWSResources class to get_instance_profile_arn to reflect the change in return type from a string to an ARN. A new method, 'get_instance_profile_role_arn', has also been added to the AWSResources class to retrieve the role ARN from the instance profile. Additionally, new methods get_instance_profile_arn and instance_lookup have been added to improve testing capabilities.
  • Fixed pip install in a multiline cell (#1728). This release includes a fix for an issue where pip install commands with multiline code were not being handled correctly (issue #1728, issue #1642). The build_dependency_graph function of the PipCell class has been updated to properly register the library specified in the pip install command, even if it is spread over multiple lines. The function now splits the original code by spaces or new lines, allowing it to extract the library name correctly. These changes have been thoroughly tested through manual testing and unit tests to ensure that pip install commands with multiline code are now handled correctly, resulting in the library being installed and registered properly.
  • README update about Standard workspaces (#1734). In this release, the README file of our open-source library has been updated to provide additional user documentation on compatibility with Standard Workspaces on Databricks. The changes include an outlined incompatibility section, specifically designed for users of Standard Workspaces. It is important to note that these updates are purely informational and do not involve any changes to existing commands, workflows, tables, or functionality within the code. No new methods or modifications have been made to the existing functionality. The commit does not include any tests, as the changes are limited to updating user documentation. The changes have been manually tested to ensure accuracy. The target audience for this release includes software engineers who are adopting the project and may require additional guidance on compatibility with Standard Workspaces. Additionally, please note that a Databricks Premium or Enterprise workspace is now a prerequisite for using this library.
  • Show code problems found by workflow linter in the migration dashboard (#1741). This commit introduces a new feature to the migration dashboard: an experimental workflow linter that identifies code compatibility problems for Unity Catalog integration. The feature includes a new CLI command, migration_report, which refreshes the migration dashboard after all previous tasks are completed, and an existing command, databricks labs ucx ..., has been modified. The experimental-workflow-linter workflow has also been changed, and new functionality has been added in the form of a new workflow. A new SQL query for displaying code compatibility problems is located in the file "02_1_code_compatibility_problems.sql". User documentation has been added, and the changes have been manually tested. This feature aims to improve the migration dashboard's functionality and provide a better experience for users. Targeted at software engineers, this feature will help in identifying and resolving code compatibility issues during the migration process.
  • Support for s3a/ s3n protocols when using mount point (#1765). In this release, we have added support for s3a and s3n protocols when using mount points in the metastore locations. A new static method, _get_ext_location_definitions, has been introduced, which generates a name for a resource defined by the location and now supports additional prefixes "s3a://" and "s3n://" for defining resources in S3. For Azure Blob Storage, the container name is extracted from the location and included in the resource name. If the location does not match the supported formats, a warning is logged, and the script is not generated. These changes offer more flexibility in defining resources and improve the system's ability to handle various cloud storage solutions. Additionally, the test_save_external_location_mapping_missing_location function in test_locations.py has been updated to include test cases for s3a and s3n protocols, enhancing the software's functionality.
  • Support joining an existing collection when installing UCX (#1675). The AccountInstaller class has been updated to include a new functionality that allows users to join an existing collection during UCX installation. This is achieved by presenting the user with a list of workspaces they have access to, allowing them to select one, and then checking if there are existing workspace IDs present in the selected workspace. If so, the installation will join the corresponding collection; otherwise, a new collection will be created. This feature simplifies UCX migration for large organizations with multiple workspaces by allowing them to manage collections instead of individual workspaces. Relevant user documentation and CLI commands have been updated, along with new and modified tests to ensure proper functionality. The commit includes the addition of new methods, join_collection and is_account_install, as well as updates to the install_on_account method to call join_collection if specified. Unit tests and integration tests have been added to ensure the proper functioning of the new feature.
  • Updated UCX job cluster policy AWS zone_id to auto (#1735). In this release, the UCX job cluster policy for AWS has been updated to use auto for the zone_id, allowing Databricks to choose the zone based on a default value in the region. This change, which resolves issue #533, affects the definition method in the policy.py file, where a check has been added to remove 'aws_attributes.zone_id' if an instance pool ID is provided. The tests for this change include manual testing and new unit tests, with modifications to existing workflows. The diff shows updates to the test_policy.py file, where the 'aws_attributes.zone_id' is set to auto in several functions. No new CLI commands or documentation have been provided as part of this update.
  • Updated assessment.md - spark.catalog.x guidance needed updating (#1708). With the release of DBR 14+, the spark.catalog.* functions, which were previously not recommended for use on shared compute clusters due to security reasons, are now considered safe to use. This change in guidance is reflected in the updated assessment.md document, which also notes that spark.sql("<sql command>") may still be a more suitable alternative for certain common spark.catalog functions like tableExists, listTables, and setDefaultCatalog. The corresponding spark._jsparkSession.catalog methods are also mentioned as potential alternatives on DBR 14.1 and above. It is important to note that no new methods or functionality have been added, and no existing functionality has been changed - only the guidance in the documentation has been updated. This update has been manually tested and implemented in the documentation to ensure accuracy and reliability for software engineers.

Dependency updates:

  • Updated sqlglot requirement from <23.15,>=23.9 to >=23.9,<23.16 (#1681).
  • Updated databricks-labs-blueprint requirement from <0.6.0,>=0.4.3 to >=0.4.3,<0.7.0 (#1688).
  • Updated sqlglot requirement from <23.16,>=23.9 to >=23.9,<23.18 (#1724).
  • Updated sqlglot requirement from <23.18,>=23.9 to >=23.9,<24.1 (#1745).
  • Updated databricks-sdk requirement from ~=0.27.0 to >=0.27,<0.29 (#1756).
  • Bump databrickslabs/sandbox from acceptance/v0.2.1 to 0.2.2 (#1769).

* Added `%pip` cell resolver ([#1697](#1697)). A newly developed pip resolver has been integrated into the ImportResolver for future use, addressing issue [#1642](#1642) and following up on [#1694](#1694). The resolver installs libraries and modifies the path lookup to make them available for import. This change affects existing workflows but does not introduce new CLI commands, tables, or files. The commit includes modifications to the build_dependency_graph method and the addition of unit tests to verify the new functionality. The resolver has been manually tested and passes the unit tests, ensuring better compatibility and accessibility for libraries used in the project.
* Added downloads of `requirementst.txt` dependency locally to register it to the dependency graph ([#1753](#1753)). This commit introduces support for linting job tasks that require a 'requirements.txt' file for specifying dependencies. It resolves issue [#1644](#1644) and is similar to [#1704](#1704). The changes include the addition of a new CLI command, modification of the existing 'databricks labs ucx ...' command, and modification of the `experimental-workflow-linter` workflow. The `lint_job` method has been updated to handle dependencies specified in a 'requirements.txt' file, checking for their presence in the job's libraries list and flagging any missing dependencies. The code changes include modifications to the 'jobs.py' file to register libraries specified in a 'requirements.txt' file to the dependency graph. Unit and integration tests have been added to verify the new functionality. The changes also include handling of jar libraries. The code includes TODO comments for future enhancements such as downloading the library wheel and adding it to the virtual system path, and handling references to other requirements files and constraints files.
* Added ability to install UCX on workspaces without Public Internet connectivity ([#1566](#1566)). A new flag, `upload_dependencies`, has been added to the WorkspaceConfig to enable users to upload dependencies to air-gapped workspaces without public internet connectivity. This flag is a boolean value that is set to False by default and can be set by the user through the installation prompt. This feature resolves issue [#573](#573) and was co-authored by hari-selvarajan_data. When this flag is set to True, it triggers the upload of specified dependencies during installation, which allows for the installation of UCX on workspaces without public internet access. This change also includes updating the version of `databricks-labs-blueprint` from `<0.7.0` to `>=0.6.0`, which may include changes to existing functionality. Additionally, new test functions have been added to test the functionality of uploading dependencies when the `upload_dependencies` flag is set to True.
* Added initial interface for data comparison framework ([#1695](#1695)). This commit introduces the initial interface for a data comparison framework, which includes classes and methods for managing metadata, profiling data, and comparing schema and data for tables. A new `StandardDataComparator` class has been implemented for comparing the data of two tables, and a `StandardSchemaComparator` class tests the comparison of table schemas. The framework also includes the `DatabricksTableMetadataRetriever` class for retrieving metadata about a given table using a SQL backend. Additional classes and methods will be implemented in future work to provide a robust data comparison framework, such as `StandardDataProfiler` for profiling data, `SchemaComparator` and `DataComparator` for comparing schema and data, and test fixtures and functions for testing the framework. This release lays the groundwork for enabling users to perform comprehensive data comparisons effectively, enhancing the project's capabilities and versatility.
* Added lint local code command ([#1710](#1710)). A new `lint local code` command has been added to the databricks labs ucx tool, allowing users to assess required migrations in a local directory or file. This command detects dependencies and analyzes them, currently supporting Python and SQL files, with an expected runtime of under a minute for code bases up to 50,000 lines of code. The command generates output that includes file links opening the file at the problematic line in modern IDEs, providing a quick and easy way to identify necessary migrations. The `lint-local-code` command is implemented in the `application.py` file, with supporting methods and classes added to the `workspace_cli.py` and `databricks.labs.ucx.source_code` packages, enhancing the linting process and providing valuable feedback for maintaining high code quality standards.
* Added table in mount migration ([#1225](#1225)). This commit introduces new functionality to migrate tables in mounts to the Unity Catalog, including creating a table in the Unity Catalog based on a table mapping CSV file, fixing an issue with include_paths_in_mount not being present in workflows.py, and adding the ability to set default ownership on each created table. A new method ScanTablesInMounts has been added to scan tables in mounts, and a TableMigration class creates tables in the Unity Catalog based on the table mapping. Two new methods, Rule and TableMapping, have been added to manage mappings of tables, and TableToMigrate is used to represent a table that needs to be migrated to Unity Catalog. The commit includes manual, unit, and integration testing to ensure the changes work as expected. The diff shows changes to the workflows.py file and the addition of several new methods, including Rule, TableMapping, TableToMigrate, create_autospec, and MockBackend.
* Added workflows to trigger table reconciliations ([#1721](#1721)). In this release, we've introduced several enhancements to our table migration workflow, focusing on data reconciliation and consistency. We've added a new post-migration data reconciliation task that validates migrated table integrity by comparing the schema, row count, and individual row content of the source and target tables. The new task stores and displays the number of missing rows in the Migration dashboard's `$inventory_database.reconciliation_results` view. Additionally, new workflows have been implemented to automatically trigger table reconciliations, ensuring consistency and integrity between different data sources. These workflows involve modifying relevant functions and modules, and may include new methods for data processing, scheduling, or monitoring based on the project's architecture. Furthermore, new configuration options for table reconciliation are now available in the WorkspaceConfig class, allowing for greater control and flexibility over migration processes. By incorporating these improvements, users can expect enhanced data consistency and more efficient table reconciliation management.
* Always refresh HMS stats when getting table size ([#1713](#1713)). A change has been implemented in the hive_metastore library to enhance the precision of table size calculations by ensuring that HMS stats are always refreshed before being retrieved. This has been achieved by calling the ANALYZE TABLE command with the COMPUTE STATISTICS NOSCAN option before computing the table size, thus preventing the use of stale stats. Specifically, the "backend.queries" list has been updated to include two ANALYZE statements for tables "db1.table1" and "db1.table2", ensuring that their statistics are updated and accurate. The test case `test_table_size_crawler` in the "test_table_size.py" file has been revised to validate the presence of the two ANALYZE statements in the "backend.queries" list and confirm the size of the results for both tables. This commit also includes manual testing, added unit tests, and verification on the staging environment to ensure the functionality.
* Automatically retrieve `aws_account_id` from aws profile instead of prompting ([#1715](#1715)). This commit introduces several improvements to the library's AWS integration, enhancing automation and user experience. It eliminates the need for manual input of `aws_account_id` by automatically retrieving it from the AWS profile. An optional `kms-key` flag has been documented for creating roles, providing more flexibility. The `create-missing-principals` command now accepts optional parameters such as KMS Key, Role Name, Policy Name, and allows creating a single role for all S3 locations, with a default behavior of creating one role per S3 location. These changes have been manually tested and verified in a staging environment, and resolve issue [#1714](#1714). Additionally, tests have been conducted to ensure the changes do not introduce regressions. A new method simulating a successful AWS CLI call has been added, replacing `aws_cli_run_command`, ensuring automated retrieval of `aws_account_id`. A test has also been added to raise an error when AWS CLI is not found in the system path.
* Detect dependencies of libraries installed via pip ([#1703](#1703)). This commit introduces a child dependency graph for libraries resolved via pip using DistInfo data, addressing issues [#1642](#1642) and [#1202](#1202). It modifies certain tests and reduces their execution time. The PipResolver class in `databricks.labs.ucx.source_code.graph` is used to detect and resolve library dependencies installed via pip, with methods to locate, install, and register libraries in a specified folder. A new Whitelist feature and updated DistInfoPackage class are also included. Although unit tests have been added, no new user documentation, CLI commands, workflows, or tables have been added or modified. The previous site_packages attribute has been removed from the GlobalContext class.
* Emit problems with code belonging to job ([#1730](#1730)). In this release, the jobs.py file has been updated with new functionality in the JobProblem class, enabling it to convert itself into a string message using the new as_message() method. The refresh_report() method has been modified to call a new _lint_job() method when provided with a job object, which returns a list of JobProblem instances. The lint_job() method has also been updated to call _lint_job() and return a list of JobProblem instances, with a new behavior to log warning messages when problems are found. The changes include the addition of a new method, `lint_job`, for linting a job and returning any problems found. The changes have been tested through the addition of a new integration test, `test_job_linter_some_notebook_graph_with_problems`, and are manually tested and covered with unit and integration tests. This release addresses issue [#1542](#1542) and improves the job linter functionality, specifically detecting and emitting problems related to code belonging to a job during the lin job. The new `JobProblem` class has an `as_message()` method that returns a string representation of the problem, and a unit test for this method has been added. The `DependencyResolver` in the `DependencyGraph` constructor has also been modified.
* Fixed `create-catalogs-schemas` to allow more than 1 level nesting more than the external location ([#1701](#1701)). The `create-catalogs-schemas` library has been updated to allow for more than one level of nesting beyond the external location, addressing issue [#1700](#1700). This release includes a new CLI command, as well as modifications to the existing `databricks labs ucx ...` command. A new workflow has been added and existing functionality has been changed to support the additional nesting levels. The changes have been thoroughly tested through manual testing, unit tests, and integration tests using the `fnmatch.fnmatch` method for validating location patterns. Software engineers adopting this project will benefit from these enhancements.
* Fixed local file resolver logic with relative paths and site-packages ([#1685](#1685)). This commit addresses an issue ([#1685](#1685)) related to the local file resolver logic for relative paths and site-packages. The resolver's logic has been updated to look for `_package_/__init__.py` instead of relying on `dist-info` metadata, and the resolver has been wired back into the global resolver chain with updated calling code. No changes have been made to user documentation, CLI commands, workflows, or tables. New methods have not been added, but existing functionality has been modified to enhance local file resolution handling. Unit tests have been added and manually verified to ensure proper functionality.
* Fixed look up logic where instance profile name does not match role name ([#1716](#1716)). A fix has been implemented to improve the robustness of the instance profile lookup mechanism in the open-source library. Previously, the code relied on the role name being the same as the instance profile name, which resulted in issues when the names did not match ([#1716](#1716), [#1711](#1711)). This has been addressed by updating the `role_name` method in the `AWSRoleAction` class to use a new regex pattern 'AWSResources.ROLE_NAME_REGEX', and renaming the `get_instance_profile` method in the `AWSResources` class to `get_instance_profile_arn` to reflect the change in return type from a string to an ARN. A new method, 'get_instance_profile_role_arn', has also been added to the `AWSResources` class to retrieve the role ARN from the instance profile. Additionally, new methods `get_instance_profile_arn` and `instance_lookup` have been added to improve testing capabilities.
* Fixed pip install in a multiline cell ([#1728](#1728)). This release includes a fix for an issue where pip install commands with multiline code were not being handled correctly (issue [#1728](#1728), issue [#1642](#1642)). The `build_dependency_graph` function of the `PipCell` class has been updated to properly register the library specified in the pip install command, even if it is spread over multiple lines. The function now splits the original code by spaces or new lines, allowing it to extract the library name correctly. These changes have been thoroughly tested through manual testing and unit tests to ensure that pip install commands with multiline code are now handled correctly, resulting in the library being installed and registered properly.
* README update about Standard workspaces ([#1734](#1734)). In this release, the README file of our open-source library has been updated to provide additional user documentation on compatibility with Standard Workspaces on Databricks. The changes include an outlined incompatibility section, specifically designed for users of Standard Workspaces. It is important to note that these updates are purely informational and do not involve any changes to existing commands, workflows, tables, or functionality within the code. No new methods or modifications have been made to the existing functionality. The commit does not include any tests, as the changes are limited to updating user documentation. The changes have been manually tested to ensure accuracy. The target audience for this release includes software engineers who are adopting the project and may require additional guidance on compatibility with Standard Workspaces. Additionally, please note that a Databricks Premium or Enterprise workspace is now a prerequisite for using this library.
* Show code problems found by workflow linter in the migration dashboard ([#1741](#1741)). This commit introduces a new feature to the migration dashboard: an experimental workflow linter that identifies code compatibility problems for Unity Catalog integration. The feature includes a new CLI command, `migration_report`, which refreshes the migration dashboard after all previous tasks are completed, and an existing command, `databricks labs ucx ...`, has been modified. The `experimental-workflow-linter` workflow has also been changed, and new functionality has been added in the form of a new workflow. A new SQL query for displaying code compatibility problems is located in the file "02_1_code_compatibility_problems.sql". User documentation has been added, and the changes have been manually tested. This feature aims to improve the migration dashboard's functionality and provide a better experience for users. Targeted at software engineers, this feature will help in identifying and resolving code compatibility issues during the migration process.
* Support for s3a/ s3n protocols when using mount point ([#1765](#1765)). In this release, we have added support for s3a and s3n protocols when using mount points in the metastore locations. A new static method, `_get_ext_location_definitions`, has been introduced, which generates a name for a resource defined by the location and now supports additional prefixes "s3a://" and "s3n://" for defining resources in S3. For Azure Blob Storage, the container name is extracted from the location and included in the resource name. If the location does not match the supported formats, a warning is logged, and the script is not generated. These changes offer more flexibility in defining resources and improve the system's ability to handle various cloud storage solutions. Additionally, the `test_save_external_location_mapping_missing_location` function in `test_locations.py` has been updated to include test cases for s3a and s3n protocols, enhancing the software's functionality.
* Support joining an existing collection when installing UCX ([#1675](#1675)). The AccountInstaller class has been updated to include a new functionality that allows users to join an existing collection during UCX installation. This is achieved by presenting the user with a list of workspaces they have access to, allowing them to select one, and then checking if there are existing workspace IDs present in the selected workspace. If so, the installation will join the corresponding collection; otherwise, a new collection will be created. This feature simplifies UCX migration for large organizations with multiple workspaces by allowing them to manage collections instead of individual workspaces. Relevant user documentation and CLI commands have been updated, along with new and modified tests to ensure proper functionality. The commit includes the addition of new methods, `join_collection` and `is_account_install`, as well as updates to the `install_on_account` method to call `join_collection` if specified. Unit tests and integration tests have been added to ensure the proper functioning of the new feature.
* Updated UCX job cluster policy AWS zone_id to `auto` ([#1735](#1735)). In this release, the UCX job cluster policy for AWS has been updated to use `auto` for the zone_id, allowing Databricks to choose the zone based on a default value in the region. This change, which resolves issue [#533](#533), affects the definition method in the policy.py file, where a check has been added to remove 'aws_attributes.zone_id' if an instance pool ID is provided. The tests for this change include manual testing and new unit tests, with modifications to existing workflows. The diff shows updates to the test_policy.py file, where the 'aws_attributes.zone_id' is set to `auto` in several functions. No new CLI commands or documentation have been provided as part of this update.
* Updated assessment.md - `spark.catalog.x` guidance needed updating ([#1708](#1708)). With the release of DBR 14+, the `spark.catalog.*` functions, which were previously not recommended for use on shared compute clusters due to security reasons, are now considered safe to use. This change in guidance is reflected in the updated assessment.md document, which also notes that `spark.sql("<sql command>")` may still be a more suitable alternative for certain common spark.catalog functions like tableExists, listTables, and setDefaultCatalog. The corresponding `spark._jsparkSession.catalog` methods are also mentioned as potential alternatives on DBR 14.1 and above. It is important to note that no new methods or functionality have been added, and no existing functionality has been changed - only the guidance in the documentation has been updated. This update has been manually tested and implemented in the documentation to ensure accuracy and reliability for software engineers.

Dependency updates:

 * Updated sqlglot requirement from <23.15,>=23.9 to >=23.9,<23.16 ([#1681](#1681)).
 * Updated databricks-labs-blueprint requirement from <0.6.0,>=0.4.3 to >=0.4.3,<0.7.0 ([#1688](#1688)).
 * Updated sqlglot requirement from <23.16,>=23.9 to >=23.9,<23.18 ([#1724](#1724)).
 * Updated sqlglot requirement from <23.18,>=23.9 to >=23.9,<24.1 ([#1745](#1745)).
 * Updated databricks-sdk requirement from ~=0.27.0 to >=0.27,<0.29 ([#1756](#1756)).
 * Bump databrickslabs/sandbox from acceptance/v0.2.1 to 0.2.2 ([#1769](#1769)).
@nfx nfx requested review from a team and gcwang-db May 27, 2024 12:25
@nfx nfx merged commit 9b83666 into main May 27, 2024
5 of 6 checks passed
@nfx nfx deleted the prepare/0.24.0 branch May 27, 2024 12:26
Copy link

❌ 177/178 passed, 1 failed, 25 skipped, 2h36m56s total

❌ test_compare_remote_local_install_versions: Failed: DID NOT RAISE (38.484s)
Failed: DID NOT RAISE <class 'RuntimeWarning'>
[gw8] linux -- Python 3.10.14 /home/runner/work/ucx/ucx/.venv/bin/python
12:35 INFO [databricks.labs.ucx.mixins.fixtures] Schema hive_metastore.ucx_sezkz: https://DATABRICKS_HOST/explore/data/hive_metastore/ucx_sezkz
12:35 DEBUG [databricks.labs.ucx.mixins.fixtures] added schema fixture: SchemaInfo(browse_only=None, catalog_name='hive_metastore', catalog_type=None, comment=None, created_at=None, created_by=None, effective_predictive_optimization_flag=None, enable_predictive_optimization=None, full_name='hive_metastore.ucx_sezkz', metastore_id=None, name='ucx_sezkz', owner=None, properties=None, schema_id=None, storage_location=None, storage_root=None, updated_at=None, updated_by=None)
12:35 DEBUG [databricks.labs.ucx.install] Cannot find previous installation: Path (/Users/0a330eb5-dd51-4d97-b6e4-c474356b1d5d/.O6xq/config.yml) doesn't exist.
12:35 INFO [databricks.labs.ucx.install] Please answer a couple of questions to configure Unity Catalog migration
12:35 INFO [databricks.labs.ucx.installer.hms_lineage] HMS Lineage feature creates one system table named system.hms_to_uc_migration.table_access and helps in your migration process from HMS to UC by allowing you to programmatically query HMS lineage data.
12:35 INFO [databricks.labs.ucx.install] Fetching installations...
12:36 INFO [databricks.labs.ucx.installer.policy] Creating UCX cluster policy.
12:36 DEBUG [tests.integration.conftest] Waiting for clusters to start...
12:36 DEBUG [tests.integration.conftest] Waiting for clusters to start...
12:36 INFO [databricks.labs.ucx.install] Installing UCX v0.23.2+5020240527123602
12:36 INFO [databricks.labs.ucx.install] Creating ucx schemas...
12:36 INFO [databricks.labs.ucx.installer.workflows] Creating new job configuration for step=migrate-data-reconciliation
12:36 INFO [databricks.labs.ucx.installer.workflows] Creating new job configuration for step=migrate-external-tables-ctas
12:36 INFO [databricks.labs.ucx.installer.workflows] Creating new job configuration for step=validate-groups-permissions
12:36 INFO [databricks.labs.ucx.installer.workflows] Creating new job configuration for step=remove-workspace-local-backup-groups
12:36 INFO [databricks.labs.ucx.installer.workflows] Creating new job configuration for step=migrate-tables-in-mounts-experimental
12:36 INFO [databricks.labs.ucx.installer.workflows] Creating new job configuration for step=failing
12:36 INFO [databricks.labs.ucx.installer.workflows] Creating new job configuration for step=migrate-external-hiveserde-tables-in-place-experimental
12:36 INFO [databricks.labs.ucx.installer.workflows] Creating new job configuration for step=experimental-workflow-linter
12:36 INFO [databricks.labs.ucx.installer.workflows] Creating new job configuration for step=assessment
12:36 INFO [databricks.labs.ucx.installer.workflows] Creating new job configuration for step=scan-tables-in-mounts-experimental
12:36 INFO [databricks.labs.ucx.installer.workflows] Creating new job configuration for step=migrate-groups-experimental
12:36 INFO [databricks.labs.ucx.installer.workflows] Creating new job configuration for step=migrate-groups
12:36 INFO [databricks.labs.ucx.installer.workflows] Creating new job configuration for step=migrate-tables
12:36 INFO [databricks.labs.ucx.install] Installation completed successfully! Please refer to the https://DATABRICKS_HOST/#workspace/Users/0a330eb5-dd51-4d97-b6e4-c474356b1d5d/.O6xq/README for the next steps.
12:35 INFO [databricks.labs.ucx.mixins.fixtures] Schema hive_metastore.ucx_sezkz: https://DATABRICKS_HOST/explore/data/hive_metastore/ucx_sezkz
12:35 DEBUG [databricks.labs.ucx.mixins.fixtures] added schema fixture: SchemaInfo(browse_only=None, catalog_name='hive_metastore', catalog_type=None, comment=None, created_at=None, created_by=None, effective_predictive_optimization_flag=None, enable_predictive_optimization=None, full_name='hive_metastore.ucx_sezkz', metastore_id=None, name='ucx_sezkz', owner=None, properties=None, schema_id=None, storage_location=None, storage_root=None, updated_at=None, updated_by=None)
12:35 DEBUG [databricks.labs.ucx.install] Cannot find previous installation: Path (/Users/0a330eb5-dd51-4d97-b6e4-c474356b1d5d/.O6xq/config.yml) doesn't exist.
12:35 INFO [databricks.labs.ucx.install] Please answer a couple of questions to configure Unity Catalog migration
12:35 INFO [databricks.labs.ucx.installer.hms_lineage] HMS Lineage feature creates one system table named system.hms_to_uc_migration.table_access and helps in your migration process from HMS to UC by allowing you to programmatically query HMS lineage data.
12:35 INFO [databricks.labs.ucx.install] Fetching installations...
12:36 INFO [databricks.labs.ucx.installer.policy] Creating UCX cluster policy.
12:36 DEBUG [tests.integration.conftest] Waiting for clusters to start...
12:36 DEBUG [tests.integration.conftest] Waiting for clusters to start...
12:36 INFO [databricks.labs.ucx.install] Installing UCX v0.23.2+5020240527123602
12:36 INFO [databricks.labs.ucx.install] Creating ucx schemas...
12:36 INFO [databricks.labs.ucx.installer.workflows] Creating new job configuration for step=migrate-data-reconciliation
12:36 INFO [databricks.labs.ucx.installer.workflows] Creating new job configuration for step=migrate-external-tables-ctas
12:36 INFO [databricks.labs.ucx.installer.workflows] Creating new job configuration for step=validate-groups-permissions
12:36 INFO [databricks.labs.ucx.installer.workflows] Creating new job configuration for step=remove-workspace-local-backup-groups
12:36 INFO [databricks.labs.ucx.installer.workflows] Creating new job configuration for step=migrate-tables-in-mounts-experimental
12:36 INFO [databricks.labs.ucx.installer.workflows] Creating new job configuration for step=failing
12:36 INFO [databricks.labs.ucx.installer.workflows] Creating new job configuration for step=migrate-external-hiveserde-tables-in-place-experimental
12:36 INFO [databricks.labs.ucx.installer.workflows] Creating new job configuration for step=experimental-workflow-linter
12:36 INFO [databricks.labs.ucx.installer.workflows] Creating new job configuration for step=assessment
12:36 INFO [databricks.labs.ucx.installer.workflows] Creating new job configuration for step=scan-tables-in-mounts-experimental
12:36 INFO [databricks.labs.ucx.installer.workflows] Creating new job configuration for step=migrate-groups-experimental
12:36 INFO [databricks.labs.ucx.installer.workflows] Creating new job configuration for step=migrate-groups
12:36 INFO [databricks.labs.ucx.installer.workflows] Creating new job configuration for step=migrate-tables
12:36 INFO [databricks.labs.ucx.install] Installation completed successfully! Please refer to the https://DATABRICKS_HOST/#workspace/Users/0a330eb5-dd51-4d97-b6e4-c474356b1d5d/.O6xq/README for the next steps.
12:36 INFO [databricks.labs.ucx.install] Deleting UCX v0.23.2+5020240527123602 from https://DATABRICKS_HOST
12:36 INFO [databricks.labs.ucx.install] Deleting inventory database ucx_sezkz
12:36 INFO [databricks.labs.ucx.install] Deleting jobs
12:36 INFO [databricks.labs.ucx.install] Deleting migrate-data-reconciliation job_id=1065212200526656.
12:36 INFO [databricks.labs.ucx.install] Deleting migrate-external-tables-ctas job_id=191386804586092.
12:36 INFO [databricks.labs.ucx.install] Deleting validate-groups-permissions job_id=432783329453859.
12:36 INFO [databricks.labs.ucx.install] Deleting remove-workspace-local-backup-groups job_id=1118958600147088.
12:36 INFO [databricks.labs.ucx.install] Deleting migrate-tables-in-mounts-experimental job_id=1065086719537745.
12:36 INFO [databricks.labs.ucx.install] Deleting failing job_id=884695278792137.
12:36 INFO [databricks.labs.ucx.install] Deleting migrate-external-hiveserde-tables-in-place-experimental job_id=914752947872916.
12:36 INFO [databricks.labs.ucx.install] Deleting experimental-workflow-linter job_id=316798349399292.
12:36 INFO [databricks.labs.ucx.install] Deleting assessment job_id=587267059474907.
12:36 INFO [databricks.labs.ucx.install] Deleting scan-tables-in-mounts-experimental job_id=64063881324415.
12:36 INFO [databricks.labs.ucx.install] Deleting migrate-groups-experimental job_id=844462813714812.
12:36 INFO [databricks.labs.ucx.install] Deleting migrate-groups job_id=895919055495002.
12:36 INFO [databricks.labs.ucx.install] Deleting migrate-tables job_id=354297573412843.
12:36 INFO [databricks.labs.ucx.install] Deleting cluster policy
12:36 INFO [databricks.labs.ucx.install] Deleting secret scope
12:36 INFO [databricks.labs.ucx.install] UnInstalling UCX complete
12:36 DEBUG [databricks.labs.ucx.mixins.fixtures] clearing 0 workspace user fixtures
12:36 DEBUG [databricks.labs.ucx.mixins.fixtures] clearing 0 account group fixtures
12:36 DEBUG [databricks.labs.ucx.mixins.fixtures] clearing 0 workspace group fixtures
12:36 DEBUG [databricks.labs.ucx.mixins.fixtures] clearing 0 table fixtures
12:36 DEBUG [databricks.labs.ucx.mixins.fixtures] clearing 0 table fixtures
12:36 DEBUG [databricks.labs.ucx.mixins.fixtures] clearing 1 schema fixtures
12:36 DEBUG [databricks.labs.ucx.mixins.fixtures] removing schema fixture: SchemaInfo(browse_only=None, catalog_name='hive_metastore', catalog_type=None, comment=None, created_at=None, created_by=None, effective_predictive_optimization_flag=None, enable_predictive_optimization=None, full_name='hive_metastore.ucx_sezkz', metastore_id=None, name='ucx_sezkz', owner=None, properties=None, schema_id=None, storage_location=None, storage_root=None, updated_at=None, updated_by=None)
[gw8] linux -- Python 3.10.14 /home/runner/work/ucx/ucx/.venv/bin/python

Running from acceptance #3529

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant