-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(ibis): implement workaround for empty json result #1013
Conversation
WalkthroughThe pull request introduces a new Changes
Assessment against linked issues
Poem
Tip CodeRabbit's docstrings feature is now available as part of our Early Access Program! Simply use the command 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
ibis-server/app/model/connector.py (1)
120-127
: String-matching the error message can be fragile.
While searching for"Must pass schema"
solves this issue, consider whether a more structured check (e.g., checking exception types or error codes) might be more robust in future.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
ibis-server/app/model/connector.py
(3 hunks)ibis-server/tests/routers/v2/connector/test_bigquery.py
(1 hunks)
🔇 Additional comments (7)
ibis-server/tests/routers/v2/connector/test_bigquery.py (1)
201-214
: Well-structured test for empty JSON queries.
This new async test verifies that the query successfully returns no data while still including the correct dtypes. The logic is concise and reflects the intended functionality of returning an empty DataFrame in these edge cases.
ibis-server/app/model/connector.py (6)
1-15
: Imports support new BigQuery functionality.
The introduction of BigQuery-specific imports (ibis.backends.bigquery, google.cloud.bigquery, google.oauth2.service_account) is appropriate and necessary for handling the new BigQueryConnector class.
32-33
: Seamlessly integrating the new BigQueryConnector.
Selecting BigQueryConnector when data_source == DataSource.bigquery
is a standard approach that cleanly extends the existing connector logic.
111-115
: BigQueryConnector initialization looks good.
Properly calling super().__init__(DataSource.bigquery, connection_info)
ensures you inherit the fundamental logic from SimpleConnector.
128-131
: Ensure credential decoding remains secure.
Decoding credentials from base64 is valid. Ensure that logs and error messages do not accidentally capture sensitive details.
133-140
: Schema reconstruction approach is logical.
Retrieving the schema from the BigQuery client query result and creating an empty DataFrame addresses the ibis library’s limitation for special types like JSON or Interval. This is a clear and effective workaround.
141-143
: Re-raising other exceptions preserves expected behavior.
Unrelated errors are rightfully allowed to propagate, which maintains error transparency for non-schema-related issues.
Try if only json column is empty |
Sounds great. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
ibis-server/tests/routers/v2/connector/test_bigquery.py (1)
201-231
: LGTM! Consider enhancing test coverage.The test cases effectively validate the workaround for empty JSON results and NULL handling. However, consider these improvements:
- Add docstring explaining the test purpose and scenarios:
async def test_query_empty_json(client, manifest_str): + """ + Test BigQuery connector's handling of JSON columns in special cases: + 1. Empty result set with JSON column + 2. NULL JSON value with non-NULL integer + """
- Split into separate test functions for better isolation:
async def test_query_empty_json_result(client, manifest_str): """Test empty result set with JSON column.""" response = await client.post( url=f"{base_url}/query", json={ "manifestStr": manifest_str, "connectionInfo": connection_info, "sql": "select json_object('a', 1, 'b', 2) limit 0", }, ) assert response.status_code == 200 result = response.json() assert len(result["data"]) == 0 assert result["dtypes"] == {"f0_": "object"} async def test_query_null_json(client, manifest_str): """Test NULL JSON value with non-NULL integer.""" response = await client.post( url=f"{base_url}/query", json={ "manifestStr": manifest_str, "connectionInfo": connection_info, "sql": "select cast(null as JSON), 1", }, ) assert response.status_code == 200 result = response.json() assert len(result["data"]) == 1 assert result["data"][0][0] is None assert result["data"][0][1] == 1 assert result["dtypes"] == {"f0_": "object", "f1_": "int64"}
- Add test cases for:
- Multiple JSON columns in empty result
- Error scenarios (e.g., invalid JSON)
- Schema validation
There are some issues about BigQuery CI. I tested it locally instead. All passed.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (3)
ibis-server/app/model/connector.py (2)
115-129
: Consider adding error handling for BigQuery client creationThe error handling for schema issues is good, but we should also handle potential errors during BigQuery client creation.
def query(self, sql: str, limit: int) -> pd.DataFrame: try: return super().query(sql, limit) except ValueError as e: # Import here to avoid override the custom datatypes import ibis.backends.bigquery # Try to match the error message from the google cloud bigquery library matching Arrow type error. # If the error message matches, requries to get the schema from the result and generate a empty pandas dataframe with the mapped schema # # It's a workaround for the issue that the ibis library does not support empty result for some special types (e.g. JSON or Interval) # see details: # - https://github.com/Canner/wren-engine/issues/909 # - https://github.com/ibis-project/ibis/issues/10612 if "Must pass schema" in str(e): + try: credits_json = loads( base64.b64decode( self.connection_info.credentials.get_secret_value() ).decode("utf-8") ) credentials = service_account.Credentials.from_service_account_info( credits_json ) client = bigquery.Client(credentials=credentials) ibis_schema_mapper = ibis.backends.bigquery.BigQuerySchema() bq_fields = client.query(sql).result() ibis_fields = ibis_schema_mapper.to_ibis(bq_fields.schema) return pd.DataFrame(columns=ibis_fields.names) + except Exception as client_error: + raise UnprocessableEntityError(f"Failed to create BigQuery client: {client_error}") + finally: + if 'client' in locals(): + client.close() else: raise e
130-142
: Consider extracting BigQuery client creation to a separate methodThe client creation logic could be extracted to improve readability and reusability.
+ def _create_bigquery_client(self) -> bigquery.Client: + credits_json = loads( + base64.b64decode( + self.connection_info.credentials.get_secret_value() + ).decode("utf-8") + ) + credentials = service_account.Credentials.from_service_account_info( + credits_json + ) + return bigquery.Client(credentials=credentials) def query(self, sql: str, limit: int) -> pd.DataFrame: try: return super().query(sql, limit) except ValueError as e: if "Must pass schema" in str(e): - credits_json = loads( - base64.b64decode( - self.connection_info.credentials.get_secret_value() - ).decode("utf-8") - ) - credentials = service_account.Credentials.from_service_account_info( - credits_json - ) - client = bigquery.Client(credentials=credentials) + client = self._create_bigquery_client() ibis_schema_mapper = ibis.backends.bigquery.BigQuerySchema() bq_fields = client.query(sql).result() ibis_fields = ibis_schema_mapper.to_ibis(bq_fields.schema) return pd.DataFrame(columns=ibis_fields.names)ibis-server/tests/routers/v2/connector/test_bigquery.py (1)
201-231
: LGTM! Well-structured test casesThe test cases effectively cover empty results and null handling for JSON columns. Consider adding these additional test cases:
- Multiple JSON columns in the same query
- JSON column with complex nested structures
# Example additional test cases: """Test multiple JSON columns.""" response = await client.post( url=f"{base_url}/query", json={ "manifestStr": manifest_str, "connectionInfo": connection_info, "sql": "select json_object('a', 1) as j1, json_object('b', 2) as j2 limit 0", }, ) """Test nested JSON structures.""" response = await client.post( url=f"{base_url}/query", json={ "manifestStr": manifest_str, "connectionInfo": connection_info, "sql": "select json_object('a', json_object('b', array[1,2,3])) limit 0", }, )
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
ibis-server/app/model/connector.py
(3 hunks)ibis-server/tests/routers/v2/connector/test_bigquery.py
(2 hunks)
🔇 Additional comments (4)
ibis-server/app/model/connector.py (2)
31-32
: LGTM! Clean integration of BigQueryConnector
The new condition follows the existing pattern for connector initialization.
110-114
: LGTM! Clean class initialization
The class properly inherits from SimpleConnector and stores connection info for potential fallback scenarios.
ibis-server/tests/routers/v2/connector/test_bigquery.py (2)
263-291
: LGTM! Comprehensive type handling verification
The test effectively verifies the interaction between official and custom BigQuery types, ensuring that:
- Empty JSON results work with official types
- Custom type handling is properly restored for INTERVAL data
Line range hint 201-291
: Verify the impact on query performance
The implementation adds an additional query execution when schema issues occur. While this is necessary for handling empty results, we should verify the performance impact.
✅ Verification successful
No performance concerns found with the additional query execution
Based on the codebase analysis, the implementation follows the existing pattern in app/model/connector.py
where an additional query is executed only in exceptional cases (schema issues with empty results). This is not a frequent operation and serves as a fallback mechanism rather than the primary execution path.
Key findings:
- The additional query is only triggered when schema issues occur with empty results
- Similar pattern is already established in the connector implementation
- No performance-related issues or complaints found in the codebase regarding this approach
- The implementation maintains consistency with the existing error handling patterns
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check for potential performance impact by analyzing query patterns
# Look for similar workarounds or patterns in the codebase that might indicate performance considerations
# Search for similar patterns of executing additional queries for schema retrieval
rg -A 5 "client\.query.*\.result\(\)" --type py
# Look for performance-related comments or issues
rg -i "performance|slow query|optimization" --type py
Length of output: 691
Close #909
Description
This PR implements a workaround for this issue to avoid query failure by the special cases.
If the type mapping fails, using the native BigQuery client to get the schema and create an empty pandas DF instead.
Summary by CodeRabbit
New Features
Bug Fixes
Tests