Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(ibis): implement workaround for empty json result #1013

Merged
merged 3 commits into from
Dec 25, 2024

Conversation

goldmedal
Copy link
Contributor

@goldmedal goldmedal commented Dec 25, 2024

Close #909

Description

This PR implements a workaround for this issue to avoid query failure by the special cases.

If the type mapping fails, using the native BigQuery client to get the schema and create an empty pandas DF instead.

Summary by CodeRabbit

  • New Features

    • Introduced a new BigQuery connector that enhances query handling, especially for empty results.
  • Bug Fixes

    • Improved error handling for schema issues in SQL queries.
  • Tests

    • Added two new asynchronous tests to validate the response of the BigQuery connector for empty JSON queries and custom data types.

Copy link

coderabbitai bot commented Dec 25, 2024

Walkthrough

The pull request introduces a new BigQueryConnector class in the ibis-server/app/model/connector.py file to handle specific BigQuery query scenarios, particularly when dealing with empty result sets or queries involving INTERVAL and JSON columns. The implementation adds robust error handling to retrieve the schema and return an empty DataFrame when the initial query execution fails due to schema-related issues. Additionally, new asynchronous tests are added to verify the behavior of the connector with these specific cases.

Changes

File Change Summary
ibis-server/app/model/connector.py Added BigQueryConnector class extending SimpleConnector with custom query method for BigQuery-specific error handling
ibis-server/tests/routers/v2/connector/test_bigquery.py Added test_query_empty_json and test_custom_datatypes_no_overrides async test functions to validate BigQuery connector behavior with empty JSON queries and custom data types

Assessment against linked issues

Objective Addressed Explanation
Handle ValueError for empty BigQuery queries with INTERVAL or JSON columns [#909]
Provide schema-based resolution for empty result sets

Poem

🐰 In the realm of BigQuery's might,
A connector springs to coding light
Empty schemas no longer fright
With pandas and error handling tight
Our data flows with rabbit's delight! 🚀

Tip

CodeRabbit's docstrings feature is now available as part of our Early Access Program! Simply use the command @coderabbitai generate docstrings to have CodeRabbit automatically generate docstrings for your pull request. We would love to hear your feedback on Discord.


🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@github-actions github-actions bot added bigquery ibis python Pull requests that update Python code labels Dec 25, 2024
@goldmedal goldmedal marked this pull request as ready for review December 25, 2024 09:19
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
ibis-server/app/model/connector.py (1)

120-127: String-matching the error message can be fragile.
While searching for "Must pass schema" solves this issue, consider whether a more structured check (e.g., checking exception types or error codes) might be more robust in future.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0d0d531 and be86e85.

📒 Files selected for processing (2)
  • ibis-server/app/model/connector.py (3 hunks)
  • ibis-server/tests/routers/v2/connector/test_bigquery.py (1 hunks)
🔇 Additional comments (7)
ibis-server/tests/routers/v2/connector/test_bigquery.py (1)

201-214: Well-structured test for empty JSON queries.
This new async test verifies that the query successfully returns no data while still including the correct dtypes. The logic is concise and reflects the intended functionality of returning an empty DataFrame in these edge cases.

ibis-server/app/model/connector.py (6)

1-15: Imports support new BigQuery functionality.
The introduction of BigQuery-specific imports (ibis.backends.bigquery, google.cloud.bigquery, google.oauth2.service_account) is appropriate and necessary for handling the new BigQueryConnector class.


32-33: Seamlessly integrating the new BigQueryConnector.
Selecting BigQueryConnector when data_source == DataSource.bigquery is a standard approach that cleanly extends the existing connector logic.


111-115: BigQueryConnector initialization looks good.
Properly calling super().__init__(DataSource.bigquery, connection_info) ensures you inherit the fundamental logic from SimpleConnector.


128-131: Ensure credential decoding remains secure.
Decoding credentials from base64 is valid. Ensure that logs and error messages do not accidentally capture sensitive details.


133-140: Schema reconstruction approach is logical.
Retrieving the schema from the BigQuery client query result and creating an empty DataFrame addresses the ibis library’s limitation for special types like JSON or Interval. This is a clear and effective workaround.


141-143: Re-raising other exceptions preserves expected behavior.
Unrelated errors are rightfully allowed to propagate, which maintains error transparency for non-schema-related issues.

@goldmedal goldmedal requested a review from wwwy3y3 December 25, 2024 09:24
@wwwy3y3
Copy link
Member

wwwy3y3 commented Dec 25, 2024

Try if only json column is empty

@goldmedal
Copy link
Contributor Author

Try if only json column is empty

Sounds great.
Added tests to assert this case in af5c332

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
ibis-server/tests/routers/v2/connector/test_bigquery.py (1)

201-231: LGTM! Consider enhancing test coverage.

The test cases effectively validate the workaround for empty JSON results and NULL handling. However, consider these improvements:

  1. Add docstring explaining the test purpose and scenarios:
 async def test_query_empty_json(client, manifest_str):
+    """
+    Test BigQuery connector's handling of JSON columns in special cases:
+    1. Empty result set with JSON column
+    2. NULL JSON value with non-NULL integer
+    """
  1. Split into separate test functions for better isolation:
async def test_query_empty_json_result(client, manifest_str):
    """Test empty result set with JSON column."""
    response = await client.post(
        url=f"{base_url}/query",
        json={
            "manifestStr": manifest_str,
            "connectionInfo": connection_info,
            "sql": "select json_object('a', 1, 'b', 2) limit 0",
        },
    )
    assert response.status_code == 200
    result = response.json()
    assert len(result["data"]) == 0
    assert result["dtypes"] == {"f0_": "object"}

async def test_query_null_json(client, manifest_str):
    """Test NULL JSON value with non-NULL integer."""
    response = await client.post(
        url=f"{base_url}/query",
        json={
            "manifestStr": manifest_str,
            "connectionInfo": connection_info,
            "sql": "select cast(null as JSON), 1",
        },
    )
    assert response.status_code == 200
    result = response.json()
    assert len(result["data"]) == 1
    assert result["data"][0][0] is None
    assert result["data"][0][1] == 1
    assert result["dtypes"] == {"f0_": "object", "f1_": "int64"}
  1. Add test cases for:
    • Multiple JSON columns in empty result
    • Error scenarios (e.g., invalid JSON)
    • Schema validation
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between be86e85 and af5c332.

📒 Files selected for processing (1)
  • ibis-server/tests/routers/v2/connector/test_bigquery.py (1 hunks)

@goldmedal
Copy link
Contributor Author

There are some issues about BigQuery CI. I tested it locally instead. All passed.

 ~/git/wren-engine/ibis-server ▓▒░ just test bigquery                                                                                                                                                                                                                                                                                                                                                           ░▒▓ ✔ │ 31s │ ibis-server Py │ 18:07:15 
poetry run pytest -m 'bigquery'
================================================================================================================================================================================================================== test session starts ==================================================================================================================================================================================================================
platform darwin -- Python 3.11.11, pytest-8.3.4, pluggy-1.5.0
rootdir: /Users/jax/git/wren-engine/ibis-server
configfile: pyproject.toml
plugins: anyio-4.7.0
collected 221 items / 189 deselected / 32 selected                                                                                                                                                                                                                                                                                                                                                                                                      

tests/routers/v2/connector/test_bigquery.py ...................                                                                                                                                                                                                                                                                                                                                                                                   [ 59%]
tests/routers/v3/connector/bigquery/test_functions.py ....                                                                                                                                                                                                                                                                                                                                                                                        [ 71%]
tests/routers/v3/connector/bigquery/test_query.py .........    

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (3)
ibis-server/app/model/connector.py (2)

115-129: Consider adding error handling for BigQuery client creation

The error handling for schema issues is good, but we should also handle potential errors during BigQuery client creation.

 def query(self, sql: str, limit: int) -> pd.DataFrame:
     try:
         return super().query(sql, limit)
     except ValueError as e:
         # Import here to avoid override the custom datatypes
         import ibis.backends.bigquery

         # Try to match the error message from the google cloud bigquery library matching Arrow type error.
         # If the error message matches, requries to get the schema from the result and generate a empty pandas dataframe with the mapped schema
         #
         # It's a workaround for the issue that the ibis library does not support empty result for some special types (e.g. JSON or Interval)
         # see details:
         # - https://github.com/Canner/wren-engine/issues/909
         # - https://github.com/ibis-project/ibis/issues/10612
         if "Must pass schema" in str(e):
+            try:
                 credits_json = loads(
                     base64.b64decode(
                         self.connection_info.credentials.get_secret_value()
                     ).decode("utf-8")
                 )
                 credentials = service_account.Credentials.from_service_account_info(
                     credits_json
                 )
                 client = bigquery.Client(credentials=credentials)
                 ibis_schema_mapper = ibis.backends.bigquery.BigQuerySchema()
                 bq_fields = client.query(sql).result()
                 ibis_fields = ibis_schema_mapper.to_ibis(bq_fields.schema)
                 return pd.DataFrame(columns=ibis_fields.names)
+            except Exception as client_error:
+                raise UnprocessableEntityError(f"Failed to create BigQuery client: {client_error}")
+            finally:
+                if 'client' in locals():
+                    client.close()
         else:
             raise e

130-142: Consider extracting BigQuery client creation to a separate method

The client creation logic could be extracted to improve readability and reusability.

+    def _create_bigquery_client(self) -> bigquery.Client:
+        credits_json = loads(
+            base64.b64decode(
+                self.connection_info.credentials.get_secret_value()
+            ).decode("utf-8")
+        )
+        credentials = service_account.Credentials.from_service_account_info(
+            credits_json
+        )
+        return bigquery.Client(credentials=credentials)

     def query(self, sql: str, limit: int) -> pd.DataFrame:
         try:
             return super().query(sql, limit)
         except ValueError as e:
             if "Must pass schema" in str(e):
-                credits_json = loads(
-                    base64.b64decode(
-                        self.connection_info.credentials.get_secret_value()
-                    ).decode("utf-8")
-                )
-                credentials = service_account.Credentials.from_service_account_info(
-                    credits_json
-                )
-                client = bigquery.Client(credentials=credentials)
+                client = self._create_bigquery_client()
                 ibis_schema_mapper = ibis.backends.bigquery.BigQuerySchema()
                 bq_fields = client.query(sql).result()
                 ibis_fields = ibis_schema_mapper.to_ibis(bq_fields.schema)
                 return pd.DataFrame(columns=ibis_fields.names)
ibis-server/tests/routers/v2/connector/test_bigquery.py (1)

201-231: LGTM! Well-structured test cases

The test cases effectively cover empty results and null handling for JSON columns. Consider adding these additional test cases:

  1. Multiple JSON columns in the same query
  2. JSON column with complex nested structures
# Example additional test cases:
"""Test multiple JSON columns."""
response = await client.post(
    url=f"{base_url}/query",
    json={
        "manifestStr": manifest_str,
        "connectionInfo": connection_info,
        "sql": "select json_object('a', 1) as j1, json_object('b', 2) as j2 limit 0",
    },
)

"""Test nested JSON structures."""
response = await client.post(
    url=f"{base_url}/query",
    json={
        "manifestStr": manifest_str,
        "connectionInfo": connection_info,
        "sql": "select json_object('a', json_object('b', array[1,2,3])) limit 0",
    },
)
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between af5c332 and ff2257f.

📒 Files selected for processing (2)
  • ibis-server/app/model/connector.py (3 hunks)
  • ibis-server/tests/routers/v2/connector/test_bigquery.py (2 hunks)
🔇 Additional comments (4)
ibis-server/app/model/connector.py (2)

31-32: LGTM! Clean integration of BigQueryConnector

The new condition follows the existing pattern for connector initialization.


110-114: LGTM! Clean class initialization

The class properly inherits from SimpleConnector and stores connection info for potential fallback scenarios.

ibis-server/tests/routers/v2/connector/test_bigquery.py (2)

263-291: LGTM! Comprehensive type handling verification

The test effectively verifies the interaction between official and custom BigQuery types, ensuring that:

  1. Empty JSON results work with official types
  2. Custom type handling is properly restored for INTERVAL data

Line range hint 201-291: Verify the impact on query performance

The implementation adds an additional query execution when schema issues occur. While this is necessary for handling empty results, we should verify the performance impact.

✅ Verification successful

No performance concerns found with the additional query execution

Based on the codebase analysis, the implementation follows the existing pattern in app/model/connector.py where an additional query is executed only in exceptional cases (schema issues with empty results). This is not a frequent operation and serves as a fallback mechanism rather than the primary execution path.

Key findings:

  • The additional query is only triggered when schema issues occur with empty results
  • Similar pattern is already established in the connector implementation
  • No performance-related issues or complaints found in the codebase regarding this approach
  • The implementation maintains consistency with the existing error handling patterns
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for potential performance impact by analyzing query patterns
# Look for similar workarounds or patterns in the codebase that might indicate performance considerations

# Search for similar patterns of executing additional queries for schema retrieval
rg -A 5 "client\.query.*\.result\(\)" --type py

# Look for performance-related comments or issues
rg -i "performance|slow query|optimization" --type py

Length of output: 691

@goldmedal goldmedal merged commit f79314f into Canner:main Dec 25, 2024
7 checks passed
@goldmedal goldmedal deleted the 909-workaround branch December 25, 2024 10:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bigquery ibis python Pull requests that update Python code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ValueError: Must pass schema, or at least one RecordBatch in BigQuery
2 participants