java.io.NotSerializableException when running df.show() #845

couchotato · 2023-01-03T22:58:49Z

Running bigquery connector on spark2.4 in a cluster mode.

The below is being executed without any issues

val df = spark
        .read
        .format("bigquery")
        .option("table", "xxx")
        .option("dataset", "xxx")
        .option("parentProject", "xxx")
        .option("credentials", "base 64 encoded json file")
        .load()

The below works fine yielding expected results

df.schema()
df.count()
df.filter(...).count()

However df.show() throws the below exception and it's very difficult to understand what fails upstream
Here's the full stack trace

java.lang.RuntimeException: java.io.NotSerializableException: com.google.cloud.spark.bigquery.repackaged.com.google.auth.oauth2.AwsCredentials$AwsCredentialSource
  at com.google.cloud.bigquery.connector.common.BigQueryUtil.getCredentialsByteArray(BigQueryUtil.java:111)
  at com.google.cloud.bigquery.connector.common.BigQueryClientFactory.hashCode(BigQueryClientFactory.java:88)
  at java.util.HashMap.hash(HashMap.java:340)
  at java.util.HashMap.containsKey(HashMap.java:597)
  at com.google.cloud.bigquery.connector.common.BigQueryClientFactory.getBigQueryReadClient(BigQueryClientFactory.java:53)
  at com.google.cloud.bigquery.connector.common.ReadSessionCreator.create(ReadSessionCreator.java:79)
  at com.google.cloud.spark.bigquery.v2.context.BigQueryDataSourceReaderContext.planBatchInputPartitionContexts(BigQueryDataSourceReaderContext.java:197)
  at com.google.cloud.spark.bigquery.v2.BigQueryDataSourceReader.planBatchInputPartitions(BigQueryDataSourceReader.java:66)
  at org.apache.spark.sql.execution.datasources.v2.DataSourceV2ScanExec.batchPartitions$lzycompute(DataSourceV2ScanExec.scala:84)
  at org.apache.spark.sql.execution.datasources.v2.DataSourceV2ScanExec.batchPartitions(DataSourceV2ScanExec.scala:80)
  at org.apache.spark.sql.execution.datasources.v2.DataSourceV2ScanExec.outputPartitioning(DataSourceV2ScanExec.scala:60)
  at org.apache.spark.sql.execution.exchange.EnsureRequirements.$anonfun$ensureDistributionAndOrdering$1(EnsureRequirements.scala:149)
  at scala.collection.immutable.List.map(List.scala:293)
  at org.apache.spark.sql.execution.exchange.EnsureRequirements.org$apache$spark$sql$execution$exchange$EnsureRequirements$$ensureDistributionAndOrdering(EnsureRequirements.scala:148)
  at org.apache.spark.sql.execution.exchange.EnsureRequirements$$anonfun$apply$1.applyOrElse(EnsureRequirements.scala:312)
  at org.apache.spark.sql.execution.exchange.EnsureRequirements$$anonfun$apply$1.applyOrElse(EnsureRequirements.scala:304)
  at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUp$2(TreeNode.scala:291)
  at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:73)
  at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:291)
  at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUp$1(TreeNode.scala:288)
  at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:339)
  at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:197)
  at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:337)
  at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:288)
  at org.apache.spark.sql.execution.exchange.EnsureRequirements.apply(EnsureRequirements.scala:304)
  at org.apache.spark.sql.execution.exchange.EnsureRequirements.apply(EnsureRequirements.scala:37)
  at org.apache.spark.sql.execution.QueryExecution.$anonfun$prepareForExecution$1(QueryExecution.scala:108)
  at scala.collection.LinearSeqOptimized.foldLeft(LinearSeqOptimized.scala:126)
  at scala.collection.LinearSeqOptimized.foldLeft$(LinearSeqOptimized.scala:122)
  at scala.collection.immutable.List.foldLeft(List.scala:91)
  at org.apache.spark.sql.execution.QueryExecution.prepareForExecution(QueryExecution.scala:108)
  at org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:92)
  at org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:92)
  at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3442)
  at org.apache.spark.sql.Dataset.head(Dataset.scala:2627)
  at org.apache.spark.sql.Dataset.take(Dataset.scala:2841)
  at org.apache.spark.sql.Dataset.getRows(Dataset.scala:255)
  at org.apache.spark.sql.Dataset.showString(Dataset.scala:292)
  at org.apache.spark.sql.Dataset.show(Dataset.scala:752)
  at org.apache.spark.sql.Dataset.show(Dataset.scala:711)
  at org.apache.spark.sql.Dataset.show(Dataset.scala:720)
  ... 49 elided
Caused by: java.io.NotSerializableException: com.google.cloud.spark.bigquery.repackaged.com.google.auth.oauth2.AwsCredentials$AwsCredentialSource
  at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1184)
  at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
  at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
  at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
  at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
  at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
  at com.google.cloud.bigquery.connector.common.BigQueryUtil.getCredentialsByteArray(BigQueryUtil.java:108)
  ... 89 more

The text was updated successfully, but these errors were encountered:

couchotato · 2023-01-04T16:31:43Z

one update: I turned on extendedDebugInfo and got another piece of information that might shed some light

Caused by: java.io.NotSerializableException: com.google.cloud.spark.bigquery.repackaged.com.google.auth.oauth2.AwsCredentials$AwsCredentialSource
        - field (class "com.google.cloud.spark.bigquery.repackaged.com.google.auth.oauth2.ExternalAccountCredentials", name: "credentialSource", type: "class com.google.cloud.spark.bigquery.repackaged.com.google.auth.oauth2.ExternalAccountCredentials$CredentialSource")
        - root object (class "com.google.cloud.spark.bigquery.repackaged.com.google.auth.oauth2.AwsCredentials", AwsCredentials{requestMetadata={Authorization=[Bearer ... ] }})

davidrabinowitz · 2023-01-04T17:18:39Z

We are sending the credentials to the executors so they can read/write the data. AwsCredentials is not serializable, hence the issue. I'll check what can be done here

davidrabinowitz · 2023-01-06T18:23:53Z

ghifar · 2023-02-14T07:13:14Z

Hello, I was wondering if there have been any updates or solutions regarding this issue?
I have encountered the same error as previously mentioned, and I am still searching for a potential solution or workaround for this issue

davidrabinowitz · 2023-02-14T19:27:28Z

I've nudged the relevant team, as we depend on googleapis/google-auth-library-java#1113

ghifar · 2023-04-18T10:23:11Z

Hi @davidrabinowitz , It seems that the fix has already been released. Can you confirm when the fixed version of this library will be released?

davidrabinowitz · 2023-06-07T21:51:10Z

Fixed in version 0.31.1

couchotato changed the title ~~java.io.NotSerializableException: com.google.cloud.spark.bigquery.repackaged.com.google.auth.oauth2.AwsCredentials$AwsCredentialSource when running df.show()~~ java.io.NotSerializableException when running df.show() Jan 3, 2023

couchotato closed this as not planned Won't fix, can't repro, duplicate, stale Jan 6, 2023

davidrabinowitz self-assigned this Jan 6, 2023

davidrabinowitz mentioned this issue Jan 6, 2023

fix: Make supporting classes of AwsCredentials serializable googleapis/google-auth-library-java#1113

Merged

4 tasks

davidrabinowitz reopened this Jan 10, 2023

davidrabinowitz added a commit to davidrabinowitz/spark-bigquery-connector that referenced this issue Apr 19, 2023

Issue GoogleCloudDataproc#845: Upgrading auth library to version 1.16.1

39cde2c

davidrabinowitz added a commit that referenced this issue Apr 19, 2023

Issue #845: Upgrading auth library to version 1.16.1 (#952)

e4ca26e

davidrabinowitz closed this as completed Jun 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

java.io.NotSerializableException when running df.show() #845

java.io.NotSerializableException when running df.show() #845

couchotato commented Jan 3, 2023 •

edited

Loading

couchotato commented Jan 4, 2023

davidrabinowitz commented Jan 4, 2023

davidrabinowitz commented Jan 6, 2023

ghifar commented Feb 14, 2023 •

edited

Loading

davidrabinowitz commented Feb 14, 2023

ghifar commented Apr 18, 2023

davidrabinowitz commented Jun 7, 2023

java.io.NotSerializableException when running df.show() #845

java.io.NotSerializableException when running df.show() #845

Comments

couchotato commented Jan 3, 2023 • edited Loading

couchotato commented Jan 4, 2023

davidrabinowitz commented Jan 4, 2023

davidrabinowitz commented Jan 6, 2023

ghifar commented Feb 14, 2023 • edited Loading

davidrabinowitz commented Feb 14, 2023

ghifar commented Apr 18, 2023

davidrabinowitz commented Jun 7, 2023

couchotato commented Jan 3, 2023 •

edited

Loading

ghifar commented Feb 14, 2023 •

edited

Loading