feat(datasets): support setting Ibis table schemas #833

deepyaman · 2024-09-14T01:39:44Z

Description

I want to set the database schema for the Ibis table referenced by the ibis.TableDataset.

Context

Has anyone used ibis.TableDataset with duckdb schemas? If I set a schema on a data catalog entry I get the error Invalid Input Error: Could not set option "schema" as a global option .

bronze_x:
  type: ibis.TableDataset
  filepath: x.csv
  file_format: csv
  table_name: x
  backend: duckdb
  database: data.duckdb
  schema: bronze

I can reproduce this error with vanilla ibis:

con = ibis.duckdb.connect(database="data.duckdb", schema = "bronze")

Found a related question on ibis' github, it sounds like duckdb can't set the schema globally so it has to be done in the table functions. Wondering if this would require a change to ibis.TableDataset, and if so, would this pattern work the same with other backends?

Possible Implementation

Wondering if this would require a change to ibis.TableDataset,
If I understand correctly, this would be a request to pass schema (actually, database, since schema is deprecated as an argument to table) as a table_arg or something in the dataset.

The text was updated successfully, but these errors were encountered:

deepyaman · 2024-09-14T01:40:00Z

and if so, would this pattern work the same with other backends?

I think so, because https://github.com/ibis-project/ibis/blob/main/ibis/backends/sql/__init__.py#L47 for example (called in the table() function) is generic to SQL backends).

deepyaman · 2024-09-14T01:45:10Z

From @mark-druffel:

Yea I was just looking through some of the other backends, agree. I'm trying to check pyspark now to recall if table(database) is actually equivalent to catalog, database, or schema in hive.
On the ibis side, it feels like do_connect using a database parameter is confusing. For example:
con=ibis.duckdb.connect(database = "data/db/spotify/spotify.duckdb")
con.create_database(name = "bronze")
con.create_database(name = "silver")
con.create_database(name = "gold")
con.table("x", database = "bronze")
The create_database & table calls use database to mean something completely different.
On the kedro-datasets side, my question becomes would it make sense to accept an argument called "schema" and just pass that to table(database = schema) since the "database" argument is already used in the connection string for do_connect?

I'd be inclined to match the Ibis name, but create a section for table_args (i.e. arguments that get passed to the underlying table() call), since that's more in line with how most Kedro datasets are structured in my experience, but I haven't thought about it that much.

deepyaman · 2024-09-14T22:22:46Z

I'd be inclined to match the Ibis name, but create a section for table_args (i.e. arguments that get passed to the underlying table() call)

Actually, looking at the code, it's more like load_args and save_args. We already pass save_args to create_table or create_view, which is where you can specify database="bronze" (same will hold true for inserts, pending #834). The table call is just on load, so it makes sense to be load_args that we pass there (right now, load_args aren't being passed to the table call, only the read_* call, but I expect read_* to move into FileDataset after #828).

So this would be very trivial to do. It just seems a bit dumb that you'd pass the schema name twice (to load_args and save_args), so I guess we can just specify database as a top-level key (not under connection), and feed them to load_args/save_args in the __init__ method.

@mark-druffel curious to know what you think of this approach? This is what it would look like in your example:

bronze_x:
  type: ibis.TableDataset
  filepath: x.csv
  file_format: csv
  table_name: x
  connection:  # Nit: Moved the keys under `connection`; assume that was just an oversight in your example
    backend: duckdb
    database: data.duckdb
  database: bronze  # Won't bother with `schema`, to be consistent with Ibis

mark-druffel · 2024-09-16T15:42:52Z

@deepyaman Staying consistent with ibis definitely makes sense. I do think the differing use of database in do_connect() and table() is confusing, but that's an ibis question/discussion and I'm not sure what parameter names would be better...

mark-druffel · 2024-10-22T17:34:12Z

@deepyaman I've been looking back through this given my pyspark example over slack. I'm still thrown by the duplicative use of database as a parameter in Ibis. I've always used language aligned with the image below and find myself getting a bit confused in Ibis because schema is different, but the confluence of database and data frame language makes thing a bit more complicated in Ibis I think 🤷

^Referenced SO Post

Given that, I think using your prior suggestion of table_args might be a bit clearer to end users. I could be wrong, but it also feels more akin to FileDataset's load_args / save_args as you mentioned so I think it would be more natural:

# Duckdb
bronze_x:
  type: ibis.TableDataset
  table_name: x
  connection:  
    backend: duckdb
    database: data.duckdb
  table_args:
    database: schema

# Databricks w/ Unity Catalog
bronze_x:
  type: ibis.TableDataset
  table_name: x
  connection:  
    backend: pyspark
    session: session_name #Not sure if this actually works yet, databricks default works with no named session but will test
  table_args:
    database: catalog.schema

I'm updating TableDataset and doing some testing on my side, but would love to open a PR if the change makes sense to others?

deepyaman · 2024-10-22T19:58:03Z

I'm updating TableDataset and doing some testing on my side, but would love to open a PR if the change makes sense to others?

That makes sense and would be much appreciated! Appreciate any help in making the dataset as useful as possible for people such as yourself. :)

I'm sorry for not getting back to you on the Slack thread (persistent reference: https://kedro.hall.community/loading-ibis-tabledataset-from-unity-catalog-fails-with-cannot-be-found-error-0m1YHF44RYjD), but would the update you're making address this? I know you mentioned some nuances with DuckDB, which I haven't parsed myself.

mark-druffel · 2024-10-22T23:13:14Z

I'm updating TableDataset and doing some testing on my side, but would love to open a PR if the change makes sense to others?

That makes sense and would be much appreciated! Appreciate any help in making the dataset as useful as possible for people such as yourself. :)

I'm sorry for not getting back to you on the Slack thread (persistent reference: https://kedro.hall.community/loading-ibis-tabledataset-from-unity-catalog-fails-with-cannot-be-found-error-0m1YHF44RYjD), but would the update you're making address this? I know you mentioned some nuances with DuckDB, which I haven't parsed myself.

@deepyaman no worries, you've been so helpful with every question I've had. It's so appreciated!

To answer your direct question, yes. My main goal of the PR for me would be to make working with pyspark in databricks easier.

Additional Context

Basically, using the table() method works fine in Databricks, but do_connect() does not.

Using table method

We're using the pyspark backend with Unity Catalog (UC) so we need to be able to call table() with the database parameter. table() works fine with UC:

# Ibis
import ibis 
from ibis import _
ibis.options.interactive = True
con = ibis.pyspark.connect()
con.table(name = "raw_tracks", database = "comms_media_dev.dart_extensions")

# pyspark equivalent
spark.read.table("comms_media_dev.dart_extensions.raw_tracks")

Using do_connect method

do_connect() accepts **kwargs, "used to configure the SparkSession", but the required configs spark.sql.defaultCatalog and park.sql.catalog.spark_catalog.defaultDatabase are not accepted by Databricks.

# Error: "Cannot modify the value of a static config: spark.sql.catalog.spark_catalog.defaultDatabase."
con = ibis.pyspark.connect(
  **{"spark.sql.defaultCatalog": "comms_media_dev", 
       "spark.sql.catalog.spark_catalog.defaultDatabase": "dart_extensions"
  })

If I use the spark.sql.catalog.spark_catalog configuration the code runs, but doesn't work:

con = ibis.pyspark.connect(**{"spark.sql.catalog.spark_catalog": "comms_media_dev"})

Databricks documentation says:

Severless compute does not support setting most Spark properties for notebooks or jobs. The following are the properties you can configure:

spark.sql.legacy.timeParserPolicy (Default value is EXCEPTION)

spark.sql.session.timeZone (Default value is Etc/UTC)

spark.sql.shuffle.partitions (Default value is auto)

spark.sql.ansi.enabled (Default value is true)

Given that, it seems the only way to update Unity Catalog's default catalog & database would be to do so in the cluster configurations. That would mean pipelines couldn't use multiple catalogs and / or databases (maybe could with hooks, but seems overly complicated given table() works without any of that).

mark-druffel · 2024-10-24T22:19:55Z

@deepyaman I added table_args to table dataset and got it working, but I'm having second thoughts. I think it's confusing where things go between table_args and save_args. I'm wondering though, will TableDataset keep save_args & load_args in the future or is this just to avoid a breaking change on the initial release of FileDataset?

# With save_args
bronze_x:
  type: ibis.TableDataset
  table_name: x
  connection:
    backend: pyspark
  table_args:
    database:  my_catalog.bronze
  save_args:
    materialized: table
    overwrite: True

# Without save_args
bronze_x:
  type: ibis.TableDataset
  connection:
    backend: pyspark
  table_args:
    name: x # Maybe move table_name into table_args if there's no save_args just so that all arguments passed are in one place? 
    database: my_catalog.bronze
    materialized: table
    overwrite: True

deepyaman mentioned this issue Sep 24, 2024

Enhance current integration between Kedro and Ibis kedro-org/kedro#4190

Open

10 tasks

deepyaman mentioned this issue Oct 22, 2024

feat: add databricks support ibis-project/ibis#9248

Closed

1 task

deepyaman mentioned this issue Dec 4, 2024

feat(datasets): support configuring the "database" #909

Merged

5 tasks

deepyaman closed this as completed in #909 Dec 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(datasets): support setting Ibis table schemas #833

feat(datasets): support setting Ibis table schemas #833

deepyaman commented Sep 14, 2024

deepyaman commented Sep 14, 2024 •

edited

Loading

deepyaman commented Sep 14, 2024

deepyaman commented Sep 14, 2024

mark-druffel commented Sep 16, 2024

mark-druffel commented Oct 22, 2024

deepyaman commented Oct 22, 2024

mark-druffel commented Oct 22, 2024 •

edited

Loading

mark-druffel commented Oct 24, 2024 •

edited

Loading

feat(datasets): support setting Ibis table schemas #833

feat(datasets): support setting Ibis table schemas #833

Comments

deepyaman commented Sep 14, 2024

Description

Context

Possible Implementation

deepyaman commented Sep 14, 2024 • edited Loading

deepyaman commented Sep 14, 2024

deepyaman commented Sep 14, 2024

mark-druffel commented Sep 16, 2024

mark-druffel commented Oct 22, 2024

deepyaman commented Oct 22, 2024

mark-druffel commented Oct 22, 2024 • edited Loading

Additional Context

Using table method

Using do_connect method

mark-druffel commented Oct 24, 2024 • edited Loading

deepyaman commented Sep 14, 2024 •

edited

Loading

mark-druffel commented Oct 22, 2024 •

edited

Loading

mark-druffel commented Oct 24, 2024 •

edited

Loading