This repository has been archived by the owner on Nov 11, 2022. It is now read-only.
Dataflow jobs using the SDK for Java 1.6.0 and reading compressed files from TextIO with compression mode set may be subject to data loss. #356
Labels
We have identified an issue with Dataflow jobs reading from
TextIO
with compression type set toGZIP
orBZIP2
, potentially losing data during processing.Specifically, using TextIO:
TextIO.from(...).withCompressionType(CompressionType.GZIP)
orTextIO.from(...).withCompressionType(CompressionType.BZIP2)
This is a silent issue so you will not see any error messages or visible symptoms. The problem occurs under the following circumstances: Using the Dataflow SDK for Java 1.6.0, reading compressed files, and setting the compression mode using
withCompressionType
to eitherGZIP
orBZIP2
.Current known workarounds:
Recommended option: Use
AUTO
mode instead ofGZIP
orBZIP2
mode.Use
withCompressionType(CompressionType.AUTO)
or leave it unset (it is the default) with theTextIO
source. NOTE: compressed files must have.gz
or.bz2
(case-insensitive) extension for this to work.Switch to version 1.5.1 of the Dataflow SDK for Java. If you are using
mvn
, this can be done by specifying version1.5.1
in your pom.xmlWe are actively working to resolve this and will update this issue with all developments.
The text was updated successfully, but these errors were encountered: