Skip to content
This repository has been archived by the owner on Nov 11, 2022. It is now read-only.

Dataflow jobs using the SDK for Java 1.6.0 and reading compressed files from TextIO with compression mode set may be subject to data loss. #356

Open
dhalperi opened this issue Aug 5, 2016 · 5 comments
Labels

Comments

@dhalperi
Copy link
Contributor

dhalperi commented Aug 5, 2016

We have identified an issue with Dataflow jobs reading from TextIO with compression type set to GZIP or BZIP2, potentially losing data during processing.

Specifically, using TextIO:

  • TextIO.from(...).withCompressionType(CompressionType.GZIP) or
  • TextIO.from(...).withCompressionType(CompressionType.BZIP2)

This is a silent issue so you will not see any error messages or visible symptoms. The problem occurs under the following circumstances: Using the Dataflow SDK for Java 1.6.0, reading compressed files, and setting the compression mode using withCompressionType to either GZIP or BZIP2.

Current known workarounds:

  • Recommended option: Use AUTO mode instead of GZIP or BZIP2 mode.

    Use withCompressionType(CompressionType.AUTO) or leave it unset (it is the default) with the TextIO source. NOTE: compressed files must have .gz or .bz2 (case-insensitive) extension for this to work.

  • Switch to version 1.5.1 of the Dataflow SDK for Java. If you are using mvn, this can be done by specifying version 1.5.1 in your pom.xml

We are actively working to resolve this and will update this issue with all developments.

@dhalperi dhalperi added the bug label Aug 5, 2016
@polleyg
Copy link

polleyg commented Aug 5, 2016

Hi Dan.

How can we identify which pipelines, if any, have lost data? We have many pipelines reading GZIP files from GCS.

Graham

@dhalperi
Copy link
Contributor Author

dhalperi commented Aug 5, 2016

Hi Graham,

The vast majority of customers will not be affected, because the default TextIO.Read.from("filepattern") will automatically notice .gz files and decompress them.

Affected jobs are only those using version 1.6.0 and manually calling withCompressionType(CompressionType.GZIP) or withCompressionType(CompressionType.BZIP2).

If you use the Cloud Console, you can inspect the Display Data of the TextIO.Read to see the compression mode.

An example of a TextIO.Read that is affected (Compression Mode is GZIP):
image

An example of a normal TextIO.Read that is not affected (AUTO mode shows up as DecompressAccordingToFilename):
image

Our support team is tracking affected jobs submitted to the Cloud Dataflow service and has already reached out to affected customers.

The DirectPipelineRunner also exhibits this bug locally.

Dan

@polleyg
Copy link

polleyg commented Aug 8, 2016

Thanks Dan. As recommended, we're going to roll out a hotfix with the compression type set to AUTO for now.

@dhalperi
Copy link
Contributor Author

Cloud Dataflow SDK for Java 1.6.1 has been released with a fix to this issue.

See Downloads for instructions on how to obtain and install the Cloud Dataflow SDK for Java.

@polleyg
Copy link

polleyg commented Aug 11, 2016

Awesome stuff Dan. Great turnaround time.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

2 participants