Skip to content

Commit

Permalink
Convert df to pyspark DataFrame if it is koalas before writing (dbt-l…
Browse files Browse the repository at this point in the history
…abs#474)

* Temporarily update dev-requirements.txt

* Changelog entry

* Temporarily update dev-requirements.txt

* Convert df to pyspark DataFrame if it is koalas before writing

* Restore original version of dev-requirements.txt

* Preferentially convert Koalas DataFrames to pandas-on-Spark DataFrames first

* Fix explanation

Co-authored-by: Takuya UESHIN <ueshin@databricks.com>
  • Loading branch information
dbeatty10 and ueshin authored Sep 27, 2022
1 parent 23d17a0 commit 80dc029
Show file tree
Hide file tree
Showing 2 changed files with 18 additions and 1 deletion.
7 changes: 7 additions & 0 deletions .changes/unreleased/Under the Hood-20220924-143713.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
kind: Under the Hood
body: Convert df to pyspark DataFrame if it is koalas before writing
time: 2022-09-24T14:37:13.100404-06:00
custom:
Author: dbeatty10 ueshin
Issue: "473"
PR: "474"
12 changes: 11 additions & 1 deletion dbt/include/spark/macros/materializations/table.sql
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,7 @@ import importlib.util

pandas_available = False
pyspark_available = False
koalas_available = False

# make sure pandas exists before using it
if importlib.util.find_spec("pandas"):
Expand All @@ -57,17 +58,26 @@ if importlib.util.find_spec("pyspark.pandas"):
import pyspark.pandas
pyspark_available = True

# preferentially convert pandas DataFrames to pandas-on-Spark DataFrames first
# make sure databricks.koalas exists before using it
if importlib.util.find_spec("databricks.koalas"):
import databricks.koalas
koalas_available = True

# preferentially convert pandas DataFrames to pandas-on-Spark or Koalas DataFrames first
# since they know how to convert pandas DataFrames better than `spark.createDataFrame(df)`
# and converting from pandas-on-Spark to Spark DataFrame has no overhead
if pyspark_available and pandas_available and isinstance(df, pandas.core.frame.DataFrame):
df = pyspark.pandas.frame.DataFrame(df)
elif koalas_available and pandas_available and isinstance(df, pandas.core.frame.DataFrame):
df = databricks.koalas.frame.DataFrame(df)

# convert to pyspark.sql.dataframe.DataFrame
if isinstance(df, pyspark.sql.dataframe.DataFrame):
pass # since it is already a Spark DataFrame
elif pyspark_available and isinstance(df, pyspark.pandas.frame.DataFrame):
df = df.to_spark()
elif koalas_available and isinstance(df, databricks.koalas.frame.DataFrame):
df = df.to_spark()
elif pandas_available and isinstance(df, pandas.core.frame.DataFrame):
df = spark.createDataFrame(df)
else:
Expand Down

0 comments on commit 80dc029

Please sign in to comment.