Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NullPointerException while sync when source-s3 CSV -> destination-s3 PARQUET #6871

Closed
amorskoy opened this issue Oct 7, 2021 · 7 comments
Closed

Comments

@amorskoy
Copy link

amorskoy commented Oct 7, 2021

Enviroment

  • Airbyte version: fresh master, commit 11645689431a69c689a15b620e4a2b6bc7b045c3
  • OS Version / Instance: Ubuntu 18.04
  • Deployment: Docker
  • Source Connector and version: source-s3:0.1.5
  • Destination Connector and version: destination-s3:0.1.12
  • Severity: Critical
  • Step where error happened: Sync job

Current Behavior

I have small CSV file on S3: 3.7MB, 4k rows x 150 columns generated by Python Faker lib.
File is attached
sample_synth_4K_150.csv

I want to save it on S3 as Parquet file using destination-s3.
I get java.lang.NullPointerException for io.airbyte.integrations.destination.s3.avro.JsonToAvroSchemaConverter.getAvroSchema(JsonToAvroSchemaConverter.java:139)

Expected Behavior

Sync should infer schema correctly and finish correctly with Parquet file output

Logs

Please see attached
logs-2-0.txt

@amorskoy amorskoy added the type/bug Something isn't working label Oct 7, 2021
@sherifnada sherifnada added the area/connectors Connector related issues label Oct 7, 2021
@tuliren
Copy link
Contributor

tuliren commented Oct 15, 2021

The line 139 in JsonToAvroSchemaConverter is actually this line:
https://github.com/airbytehq/airbyte/blob/master/airbyte-integrations/connectors/destination-s3/src/main/java/io/airbyte/integrations/destination/s3/avro/JsonToAvroSchemaConverter.java#L119

The line number is wrong due to the license update.

This means the json schema passed into the s3 destination misses a properties field.

Still investigating.

@tuliren
Copy link
Contributor

tuliren commented Oct 16, 2021

@Phlair, is it possible that the json schema generated by the s3 source misses the properties field for some objects?

Looks like this line can be the root cause?

https://github.com/airbytehq/airbyte/blob/master/airbyte-integrations/connectors/source-s3/source_s3/source_files_abstract/stream.py#L184

@Phlair
Copy link
Contributor

Phlair commented Oct 18, 2021

@tuliren the self.ab_additional_col field is where any additional column/values over time are put to keep the schema consistent. In that sense it has no defined properties but has additionalProperties as the default True so it can hold anything. Does that cause problems with parquet/avro destination because of the typing?

@tuliren
Copy link
Contributor

tuliren commented Oct 18, 2021

I see. Avro schema requires a definitive type for each field. Currently our Avro to Json schema converter does not support additionalProperties yet. So this should be the root cause of this NPE.

@VitaliiMaltsev VitaliiMaltsev self-assigned this Dec 1, 2021
@VitaliiMaltsev
Copy link
Contributor

Can not reproduce at the moment
I believe this issue was fixed in scope of #7288
@tuliren please verify

@VitaliiMaltsev
Copy link
Contributor

@tuliren can we close this issue?

@tuliren
Copy link
Contributor

tuliren commented Dec 7, 2021

Yes, we can close. Sorry that I missed your comment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Archived in project
Development

No branches or pull requests

7 participants