Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

java.io.NotSerializableException when running df.show() #845

Closed
couchotato opened this issue Jan 3, 2023 · 7 comments
Closed

java.io.NotSerializableException when running df.show() #845

couchotato opened this issue Jan 3, 2023 · 7 comments
Assignees

Comments

@couchotato
Copy link

couchotato commented Jan 3, 2023

Running bigquery connector on spark2.4 in a cluster mode.

The below is being executed without any issues

val df = spark
        .read
        .format("bigquery")
        .option("table", "xxx")
        .option("dataset", "xxx")
        .option("parentProject", "xxx")
        .option("credentials", "base 64 encoded json file")
        .load()

The below works fine yielding expected results

df.schema()
df.count()
df.filter(...).count()

However df.show() throws the below exception and it's very difficult to understand what fails upstream
Here's the full stack trace

java.lang.RuntimeException: java.io.NotSerializableException: com.google.cloud.spark.bigquery.repackaged.com.google.auth.oauth2.AwsCredentials$AwsCredentialSource
  at com.google.cloud.bigquery.connector.common.BigQueryUtil.getCredentialsByteArray(BigQueryUtil.java:111)
  at com.google.cloud.bigquery.connector.common.BigQueryClientFactory.hashCode(BigQueryClientFactory.java:88)
  at java.util.HashMap.hash(HashMap.java:340)
  at java.util.HashMap.containsKey(HashMap.java:597)
  at com.google.cloud.bigquery.connector.common.BigQueryClientFactory.getBigQueryReadClient(BigQueryClientFactory.java:53)
  at com.google.cloud.bigquery.connector.common.ReadSessionCreator.create(ReadSessionCreator.java:79)
  at com.google.cloud.spark.bigquery.v2.context.BigQueryDataSourceReaderContext.planBatchInputPartitionContexts(BigQueryDataSourceReaderContext.java:197)
  at com.google.cloud.spark.bigquery.v2.BigQueryDataSourceReader.planBatchInputPartitions(BigQueryDataSourceReader.java:66)
  at org.apache.spark.sql.execution.datasources.v2.DataSourceV2ScanExec.batchPartitions$lzycompute(DataSourceV2ScanExec.scala:84)
  at org.apache.spark.sql.execution.datasources.v2.DataSourceV2ScanExec.batchPartitions(DataSourceV2ScanExec.scala:80)
  at org.apache.spark.sql.execution.datasources.v2.DataSourceV2ScanExec.outputPartitioning(DataSourceV2ScanExec.scala:60)
  at org.apache.spark.sql.execution.exchange.EnsureRequirements.$anonfun$ensureDistributionAndOrdering$1(EnsureRequirements.scala:149)
  at scala.collection.immutable.List.map(List.scala:293)
  at org.apache.spark.sql.execution.exchange.EnsureRequirements.org$apache$spark$sql$execution$exchange$EnsureRequirements$$ensureDistributionAndOrdering(EnsureRequirements.scala:148)
  at org.apache.spark.sql.execution.exchange.EnsureRequirements$$anonfun$apply$1.applyOrElse(EnsureRequirements.scala:312)
  at org.apache.spark.sql.execution.exchange.EnsureRequirements$$anonfun$apply$1.applyOrElse(EnsureRequirements.scala:304)
  at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUp$2(TreeNode.scala:291)
  at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:73)
  at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:291)
  at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUp$1(TreeNode.scala:288)
  at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:339)
  at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:197)
  at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:337)
  at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:288)
  at org.apache.spark.sql.execution.exchange.EnsureRequirements.apply(EnsureRequirements.scala:304)
  at org.apache.spark.sql.execution.exchange.EnsureRequirements.apply(EnsureRequirements.scala:37)
  at org.apache.spark.sql.execution.QueryExecution.$anonfun$prepareForExecution$1(QueryExecution.scala:108)
  at scala.collection.LinearSeqOptimized.foldLeft(LinearSeqOptimized.scala:126)
  at scala.collection.LinearSeqOptimized.foldLeft$(LinearSeqOptimized.scala:122)
  at scala.collection.immutable.List.foldLeft(List.scala:91)
  at org.apache.spark.sql.execution.QueryExecution.prepareForExecution(QueryExecution.scala:108)
  at org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:92)
  at org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:92)
  at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3442)
  at org.apache.spark.sql.Dataset.head(Dataset.scala:2627)
  at org.apache.spark.sql.Dataset.take(Dataset.scala:2841)
  at org.apache.spark.sql.Dataset.getRows(Dataset.scala:255)
  at org.apache.spark.sql.Dataset.showString(Dataset.scala:292)
  at org.apache.spark.sql.Dataset.show(Dataset.scala:752)
  at org.apache.spark.sql.Dataset.show(Dataset.scala:711)
  at org.apache.spark.sql.Dataset.show(Dataset.scala:720)
  ... 49 elided
Caused by: java.io.NotSerializableException: com.google.cloud.spark.bigquery.repackaged.com.google.auth.oauth2.AwsCredentials$AwsCredentialSource
  at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1184)
  at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
  at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
  at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
  at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
  at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
  at com.google.cloud.bigquery.connector.common.BigQueryUtil.getCredentialsByteArray(BigQueryUtil.java:108)
  ... 89 more

@couchotato couchotato changed the title java.io.NotSerializableException: com.google.cloud.spark.bigquery.repackaged.com.google.auth.oauth2.AwsCredentials$AwsCredentialSource when running df.show() java.io.NotSerializableException when running df.show() Jan 3, 2023
@couchotato
Copy link
Author

one update: I turned on extendedDebugInfo and got another piece of information that might shed some light

Caused by: java.io.NotSerializableException: com.google.cloud.spark.bigquery.repackaged.com.google.auth.oauth2.AwsCredentials$AwsCredentialSource
        - field (class "com.google.cloud.spark.bigquery.repackaged.com.google.auth.oauth2.ExternalAccountCredentials", name: "credentialSource", type: "class com.google.cloud.spark.bigquery.repackaged.com.google.auth.oauth2.ExternalAccountCredentials$CredentialSource")
        - root object (class "com.google.cloud.spark.bigquery.repackaged.com.google.auth.oauth2.AwsCredentials", AwsCredentials{requestMetadata={Authorization=[Bearer ... ] }})

@davidrabinowitz
Copy link
Member

We are sending the credentials to the executors so they can read/write the data. AwsCredentials is not serializable, hence the issue. I'll check what can be done here

@ghifar
Copy link

ghifar commented Feb 14, 2023

Hello, I was wondering if there have been any updates or solutions regarding this issue?
I have encountered the same error as previously mentioned, and I am still searching for a potential solution or workaround for this issue

@davidrabinowitz
Copy link
Member

I've nudged the relevant team, as we depend on googleapis/google-auth-library-java#1113

@ghifar
Copy link

ghifar commented Apr 18, 2023

Hi @davidrabinowitz , It seems that the fix has already been released. Can you confirm when the fixed version of this library will be released?

@davidrabinowitz
Copy link
Member

Fixed in version 0.31.1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants