fix(ibis): implement workaround for empty json result #1013

goldmedal · 2024-12-25T09:07:15Z

Close #909

Description

This PR implements a workaround for this issue to avoid query failure by the special cases.

If the type mapping fails, using the native BigQuery client to get the schema and create an empty pandas DF instead.

Summary by CodeRabbit

New Features
- Introduced a new BigQuery connector that enhances query handling, especially for empty results.
Bug Fixes
- Improved error handling for schema issues in SQL queries.
Tests
- Added two new asynchronous tests to validate the response of the BigQuery connector for empty JSON queries and custom data types.

coderabbitai · 2024-12-25T09:07:26Z

Walkthrough

The pull request introduces a new BigQueryConnector class in the ibis-server/app/model/connector.py file to handle specific BigQuery query scenarios, particularly when dealing with empty result sets or queries involving INTERVAL and JSON columns. The implementation adds robust error handling to retrieve the schema and return an empty DataFrame when the initial query execution fails due to schema-related issues. Additionally, new asynchronous tests are added to verify the behavior of the connector with these specific cases.

Changes

File	Change Summary
`ibis-server/app/model/connector.py`	Added `BigQueryConnector` class extending `SimpleConnector` with custom `query` method for BigQuery-specific error handling
`ibis-server/tests/routers/v2/connector/test_bigquery.py`	Added `test_query_empty_json` and `test_custom_datatypes_no_overrides` async test functions to validate BigQuery connector behavior with empty JSON queries and custom data types

Assessment against linked issues

Objective	Addressed	Explanation
Handle `ValueError` for empty BigQuery queries with `INTERVAL` or `JSON` columns [#909]	✅
Provide schema-based resolution for empty result sets	✅

Poem

🐰 In the realm of BigQuery's might,
A connector springs to coding light
Empty schemas no longer fright
With pandas and error handling tight
Our data flows with rabbit's delight! 🚀

Tip

CodeRabbit's docstrings feature is now available as part of our Early Access Program! Simply use the command @coderabbitai generate docstrings to have CodeRabbit automatically generate docstrings for your pull request. We would love to hear your feedback on Discord.

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

ibis-server/app/model/connector.py (1)

120-127: String-matching the error message can be fragile.
While searching for "Must pass schema" solves this issue, consider whether a more structured check (e.g., checking exception types or error codes) might be more robust in future.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0d0d531 and be86e85.

📒 Files selected for processing (2)

ibis-server/app/model/connector.py (3 hunks)
ibis-server/tests/routers/v2/connector/test_bigquery.py (1 hunks)

🔇 Additional comments (7)

ibis-server/tests/routers/v2/connector/test_bigquery.py (1)

201-214: Well-structured test for empty JSON queries.
This new async test verifies that the query successfully returns no data while still including the correct dtypes. The logic is concise and reflects the intended functionality of returning an empty DataFrame in these edge cases.

ibis-server/app/model/connector.py (6)

1-15: Imports support new BigQuery functionality.
The introduction of BigQuery-specific imports (ibis.backends.bigquery, google.cloud.bigquery, google.oauth2.service_account) is appropriate and necessary for handling the new BigQueryConnector class.

32-33: Seamlessly integrating the new BigQueryConnector.
Selecting BigQueryConnector when data_source == DataSource.bigquery is a standard approach that cleanly extends the existing connector logic.

111-115: BigQueryConnector initialization looks good.
Properly calling super().__init__(DataSource.bigquery, connection_info) ensures you inherit the fundamental logic from SimpleConnector.

128-131: Ensure credential decoding remains secure.
Decoding credentials from base64 is valid. Ensure that logs and error messages do not accidentally capture sensitive details.

133-140: Schema reconstruction approach is logical.
Retrieving the schema from the BigQuery client query result and creating an empty DataFrame addresses the ibis library’s limitation for special types like JSON or Interval. This is a clear and effective workaround.

141-143: Re-raising other exceptions preserves expected behavior.
Unrelated errors are rightfully allowed to propagate, which maintains error transparency for non-schema-related issues.

wwwy3y3 · 2024-12-25T09:31:01Z

Try if only json column is empty

goldmedal · 2024-12-25T09:36:52Z

Try if only json column is empty

Sounds great.
Added tests to assert this case in af5c332

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

ibis-server/tests/routers/v2/connector/test_bigquery.py (1)

201-231: LGTM! Consider enhancing test coverage.

The test cases effectively validate the workaround for empty JSON results and NULL handling. However, consider these improvements:

Add docstring explaining the test purpose and scenarios:

 async def test_query_empty_json(client, manifest_str):
+    """
+    Test BigQuery connector's handling of JSON columns in special cases:
+    1. Empty result set with JSON column
+    2. NULL JSON value with non-NULL integer
+    """

Split into separate test functions for better isolation:

async def test_query_empty_json_result(client, manifest_str):
    """Test empty result set with JSON column."""
    response = await client.post(
        url=f"{base_url}/query",
        json={
            "manifestStr": manifest_str,
            "connectionInfo": connection_info,
            "sql": "select json_object('a', 1, 'b', 2) limit 0",
        },
    )
    assert response.status_code == 200
    result = response.json()
    assert len(result["data"]) == 0
    assert result["dtypes"] == {"f0_": "object"}

async def test_query_null_json(client, manifest_str):
    """Test NULL JSON value with non-NULL integer."""
    response = await client.post(
        url=f"{base_url}/query",
        json={
            "manifestStr": manifest_str,
            "connectionInfo": connection_info,
            "sql": "select cast(null as JSON), 1",
        },
    )
    assert response.status_code == 200
    result = response.json()
    assert len(result["data"]) == 1
    assert result["data"][0][0] is None
    assert result["data"][0][1] == 1
    assert result["dtypes"] == {"f0_": "object", "f1_": "int64"}

Add test cases for:
- Multiple JSON columns in empty result
- Error scenarios (e.g., invalid JSON)
- Schema validation

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between be86e85 and af5c332.

📒 Files selected for processing (1)

ibis-server/tests/routers/v2/connector/test_bigquery.py (1 hunks)

goldmedal · 2024-12-25T10:14:56Z

There are some issues about BigQuery CI. I tested it locally instead. All passed.

 ~/git/wren-engine/ibis-server ▓▒░ just test bigquery                                                                                                                                                                                                                                                                                                                                                           ░▒▓ ✔ │ 31s │ ibis-server Py │ 18:07:15 
poetry run pytest -m 'bigquery'
================================================================================================================================================================================================================== test session starts ==================================================================================================================================================================================================================
platform darwin -- Python 3.11.11, pytest-8.3.4, pluggy-1.5.0
rootdir: /Users/jax/git/wren-engine/ibis-server
configfile: pyproject.toml
plugins: anyio-4.7.0
collected 221 items / 189 deselected / 32 selected                                                                                                                                                                                                                                                                                                                                                                                                      

tests/routers/v2/connector/test_bigquery.py ...................                                                                                                                                                                                                                                                                                                                                                                                   [ 59%]
tests/routers/v3/connector/bigquery/test_functions.py ....                                                                                                                                                                                                                                                                                                                                                                                        [ 71%]
tests/routers/v3/connector/bigquery/test_query.py .........

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (3)

ibis-server/app/model/connector.py (2)

115-129: Consider adding error handling for BigQuery client creation

The error handling for schema issues is good, but we should also handle potential errors during BigQuery client creation.

 def query(self, sql: str, limit: int) -> pd.DataFrame:
     try:
         return super().query(sql, limit)
     except ValueError as e:
         # Import here to avoid override the custom datatypes
         import ibis.backends.bigquery

         # Try to match the error message from the google cloud bigquery library matching Arrow type error.
         # If the error message matches, requries to get the schema from the result and generate a empty pandas dataframe with the mapped schema
         #
         # It's a workaround for the issue that the ibis library does not support empty result for some special types (e.g. JSON or Interval)
         # see details:
         # - https://github.com/Canner/wren-engine/issues/909
         # - https://github.com/ibis-project/ibis/issues/10612
         if "Must pass schema" in str(e):
+            try:
                 credits_json = loads(
                     base64.b64decode(
                         self.connection_info.credentials.get_secret_value()
                     ).decode("utf-8")
                 )
                 credentials = service_account.Credentials.from_service_account_info(
                     credits_json
                 )
                 client = bigquery.Client(credentials=credentials)
                 ibis_schema_mapper = ibis.backends.bigquery.BigQuerySchema()
                 bq_fields = client.query(sql).result()
                 ibis_fields = ibis_schema_mapper.to_ibis(bq_fields.schema)
                 return pd.DataFrame(columns=ibis_fields.names)
+            except Exception as client_error:
+                raise UnprocessableEntityError(f"Failed to create BigQuery client: {client_error}")
+            finally:
+                if 'client' in locals():
+                    client.close()
         else:
             raise e

130-142: Consider extracting BigQuery client creation to a separate method

The client creation logic could be extracted to improve readability and reusability.

+    def _create_bigquery_client(self) -> bigquery.Client:
+        credits_json = loads(
+            base64.b64decode(
+                self.connection_info.credentials.get_secret_value()
+            ).decode("utf-8")
+        )
+        credentials = service_account.Credentials.from_service_account_info(
+            credits_json
+        )
+        return bigquery.Client(credentials=credentials)

     def query(self, sql: str, limit: int) -> pd.DataFrame:
         try:
             return super().query(sql, limit)
         except ValueError as e:
             if "Must pass schema" in str(e):
-                credits_json = loads(
-                    base64.b64decode(
-                        self.connection_info.credentials.get_secret_value()
-                    ).decode("utf-8")
-                )
-                credentials = service_account.Credentials.from_service_account_info(
-                    credits_json
-                )
-                client = bigquery.Client(credentials=credentials)
+                client = self._create_bigquery_client()
                 ibis_schema_mapper = ibis.backends.bigquery.BigQuerySchema()
                 bq_fields = client.query(sql).result()
                 ibis_fields = ibis_schema_mapper.to_ibis(bq_fields.schema)
                 return pd.DataFrame(columns=ibis_fields.names)

ibis-server/tests/routers/v2/connector/test_bigquery.py (1)

201-231: LGTM! Well-structured test cases

The test cases effectively cover empty results and null handling for JSON columns. Consider adding these additional test cases:

Multiple JSON columns in the same query
JSON column with complex nested structures

# Example additional test cases:
"""Test multiple JSON columns."""
response = await client.post(
    url=f"{base_url}/query",
    json={
        "manifestStr": manifest_str,
        "connectionInfo": connection_info,
        "sql": "select json_object('a', 1) as j1, json_object('b', 2) as j2 limit 0",
    },
)

"""Test nested JSON structures."""
response = await client.post(
    url=f"{base_url}/query",
    json={
        "manifestStr": manifest_str,
        "connectionInfo": connection_info,
        "sql": "select json_object('a', json_object('b', array[1,2,3])) limit 0",
    },
)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between af5c332 and ff2257f.

📒 Files selected for processing (2)

ibis-server/app/model/connector.py (3 hunks)
ibis-server/tests/routers/v2/connector/test_bigquery.py (2 hunks)

🔇 Additional comments (4)

ibis-server/app/model/connector.py (2)

31-32: LGTM! Clean integration of BigQueryConnector

The new condition follows the existing pattern for connector initialization.

110-114: LGTM! Clean class initialization

The class properly inherits from SimpleConnector and stores connection info for potential fallback scenarios.

ibis-server/tests/routers/v2/connector/test_bigquery.py (2)

263-291: LGTM! Comprehensive type handling verification

The test effectively verifies the interaction between official and custom BigQuery types, ensuring that:

Empty JSON results work with official types
Custom type handling is properly restored for INTERVAL data

Line range hint 201-291: Verify the impact on query performance

The implementation adds an additional query execution when schema issues occur. While this is necessary for handling empty results, we should verify the performance impact.

✅ Verification successful

No performance concerns found with the additional query execution

Based on the codebase analysis, the implementation follows the existing pattern in app/model/connector.py where an additional query is executed only in exceptional cases (schema issues with empty results). This is not a frequent operation and serves as a fallback mechanism rather than the primary execution path.

Key findings:

The additional query is only triggered when schema issues occur with empty results
Similar pattern is already established in the connector implementation
No performance-related issues or complaints found in the codebase regarding this approach
The implementation maintains consistency with the existing error handling patterns

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for potential performance impact by analyzing query patterns
# Look for similar workarounds or patterns in the codebase that might indicate performance considerations

# Search for similar patterns of executing additional queries for schema retrieval
rg -A 5 "client\.query.*\.result\(\)" --type py

# Look for performance-related comments or issues
rg -i "performance|slow query|optimization" --type py

Length of output: 691

implement workaround for empty json result

be86e85

github-actions bot added bigquery ibis python Pull requests that update Python code labels Dec 25, 2024

goldmedal marked this pull request as ready for review December 25, 2024 09:19

coderabbitai bot reviewed Dec 25, 2024

View reviewed changes

goldmedal requested a review from wwwy3y3 December 25, 2024 09:24

add test for null json column

af5c332

coderabbitai bot reviewed Dec 25, 2024

View reviewed changes

goldmedal removed the bigquery label Dec 25, 2024

fix the custom datatype

ff2257f

github-actions bot added the bigquery label Dec 25, 2024

coderabbitai bot reviewed Dec 25, 2024

View reviewed changes

wwwy3y3 approved these changes Dec 25, 2024

View reviewed changes

goldmedal merged commit f79314f into Canner:main Dec 25, 2024
7 checks passed

goldmedal deleted the 909-workaround branch December 25, 2024 10:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(ibis): implement workaround for empty json result #1013

fix(ibis): implement workaround for empty json result #1013

goldmedal commented Dec 25, 2024 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Dec 25, 2024 •

edited

Loading

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Documentation and Community

coderabbitai bot left a comment

wwwy3y3 commented Dec 25, 2024

goldmedal commented Dec 25, 2024

coderabbitai bot left a comment

goldmedal commented Dec 25, 2024

coderabbitai bot left a comment

fix(ibis): implement workaround for empty json result #1013

fix(ibis): implement workaround for empty json result #1013

Conversation

goldmedal commented Dec 25, 2024 • edited by coderabbitai bot Loading

Description

Summary by CodeRabbit

coderabbitai bot commented Dec 25, 2024 • edited Loading

Walkthrough

Changes

Assessment against linked issues

Poem

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

coderabbitai bot left a comment

Choose a reason for hiding this comment

wwwy3y3 commented Dec 25, 2024

goldmedal commented Dec 25, 2024

coderabbitai bot left a comment

Choose a reason for hiding this comment

goldmedal commented Dec 25, 2024

coderabbitai bot left a comment

Choose a reason for hiding this comment

goldmedal commented Dec 25, 2024 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Dec 25, 2024 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)