Releases: GoogleCloudPlatform/DataflowJavaSDK
Version 2.1.0
Version 2.1.0 is based on a subset of Apache Beam 2.1.0. See the Apache Beam
2.1.0 release notes for additional change information.
Issues
Known issue: When running in batch mode, Gauge metrics are not reported.
Updates and improvements
- Added Metrics support for
DataflowRunner
in streaming mode. - Added
OnTimeBehavior
toWindowinStrategy
to control emitting ofON_TIME
panes. - Added default file name policy for windowed file
FileBasedSink
s which consume windowed input. - Fixed an issue in which processing time timers for expired windows were ignored.
- Fixed an issue in which
DatastoreIO
failed to make progress when Datastore was slow to respond. - Fixed an issue in which
bzip2
files were being partially read; added support for concatenatedbzip2
files. - Improved several stability, performance, and documentation issues.
Version 1.9.1
- Fixed an issue with Dataflow jobs that read from
CompressedSource
s with compression type set toBZIP2
are potentially losing data during processing. For more information, see Issue #596.
Version 2.0.0
The Dataflow SDK for Java 2.0.0 is the first stable 2.x release of the Dataflow SDK for Java, based on a subset of Apache Beam 2.0.0. See the Apache Beam 2.0.0 release notes for additional change information.
Note for users upgrading from version 1.x
This is a new major version, and therefore comes with the following caveats:
- Breaking Changes: The Dataflow SDK 2.x for Java has a number of breaking changes from the 1.x series of releases.
- Update Incompatibility: The Dataflow SDK 2.x for Java is update-incompatible with Dataflow 1.x. Streaming jobs using a Dataflow 1.x SDK cannot be updated to use a Dataflow 2.x SDK. Dataflow 2.x pipelines may only be updated across versions starting with SDK version 2.0.0.
Updates and improvements since 2.0.0-beta3
Version 2.0.0 is based on a subset of Apache Beam 2.0.0. The most relevant changes in this release for Cloud Dataflow customers include:
- Added new API in
BigQueryIO
for writing into multiple tables, possibly with different schemas, based on data. See BigQueryIO.Write.to(SerializableFunction) and BigQueryIO.Write.to(DynamicDestinations). - Added new API for writing windowed and unbounded collections to
TextIO
andAvroIO
. For example, see TextIO.Write.withWindowedWrites() and TextIO.Write.withFilenamePolicy(FilenamePolicy). - Added
TFRecordIO
to read and write TensorFlow TFRecord files. - Added the ability to automatically register
CoderProvider
s in the defaultCoderRegistry
.CoderProvider
s are registered by aServiceLoader
via concrete implementations of aCoderProviderRegistrar
. - Changed order of parameters for
ParDo
with side inputs and outputs. - Changed order of parameters for
MapElements
andFlatMapElements
transforms when specifying an output type. - Changed the pattern for reading and writing custom types to
PubsubIO
andKafkaIO
. - Changed the syntax for reading to and writing from
TextIO
,AvroIO
,TFRecordIO
,KinesisIO
,BigQueryIO
. - Changed syntax for configuring windowing parameters other than the
WindowFn
itself using theWindow
transform. - Consolidated
XmlSource
andXmlSink
intoXmlIO
. - Renamed
CountingInput
toGenerateSequence
and unified the syntax for producing bounded and unbounded sequences. - Renamed
BoundedSource#splitIntoBundles
to#split
. - Renamed
UnboundedSource#generateInitialSplits
to#split
. - Output from
@StartBundle
is no longer possible. Instead of accepting a parameter of typeContext
, this method may optionally accept an argument of typeStartBundleContext
to accessPipelineOptions
. - Output from
@FinishBundle
now always requires an explicit timestamp and window. Instead of accepting a parameter of typeContext
, this method may optionally accept an argument of typeFinishBundleContext
to accessPipelineOptions
and emit output to specific windows. XmlIO
is no longer part of the SDK core. It must be added manually using the newxml-io
package.
More information
Please see Cloud Dataflow documentation and release notes for version 2.0.
Version 2.0.0-beta3
The Dataflow SDK for Java 2.0.0-beta3 is the third 2.x release of the Dataflow SDK for Java, based on a subset of the Apache Beam code base.
- Breaking Changes: The Dataflow SDK 2.x for Java releases have a number of breaking changes from the 1.x series of releases and from earlier 2.x beta releases. Please see below for details.
- Update Incompatibility: The Dataflow SDK 2.x for Java is update-incompatible with Dataflow 1.x. Streaming jobs using a Dataflow 1.x SDK cannot be updated to use a Dataflow 2.x SDK. Additionally, beta releases of 2.x may not be update-compatible with each other or with 2.0.0.
Beta
This is a Beta release of the Dataflow SDK 2.x for Java and includes the following caveats:
- No API Stability: This release does not guarantee a stable API. The next release in the 2.x series may make breaking API changes that require you to modify your code when you upgrade. API stability guarantees will begin with the 2.0.0 release.
- Limited Support Timeline: This release is an early preview of the upcoming 2.0.0 release. It’s intended to let you start the eventual transition to the 2.x series as convenient for you. Beta release are supported by the Dataflow service, but obtaining bugfixes and new features will require you to upgrade to a newer release that may have backwards-incompatible changes. Once 2.0.0 is released, you should plan to upgrade from any 2.0.0-betaX releases within 3 months.
- Documentation and Code Samples: The SDK documentation on the Dataflow site continues to use code samples from the original 1.x SDKs. For the time being, please see the Apache Beam Documentation for background on the APIs in this release.
Updates since 2.0.0-beta2
Version 2.0.0-beta3 is based on a subset of Apache Beam 0.6.0. The most relevant changes in this release for Cloud Dataflow customers include:
- Changed
TextIO
to only operate on strings. - Changed
KafkaIO
to specify type parameters explicitly. - Renamed factory functions of
ToString
. - Changed
Count
,Latest
,Sample
,SortValues
transforms. - Renamed
Write.Bound
toWrite
. - Renamed
Flatten
transform classes. - Split
GroupByKey.create
method intocreate
andcreateWithFewKeys
methods.
Additional breaking changes
Please see the official Dataflow SDK 2.x for Java release notes for an updated list of additional breaking changes and updated information on the Dataflow SDK 2.x for Java releases.
Version 2.0.0-beta2
The Dataflow SDK for Java 2.0.0-beta2 is the second 2.x release of the Dataflow SDK for Java, based on a subset of the Apache Beam code base.
- Breaking Changes: The Dataflow SDK 2.x for Java releases have a number of breaking changes from the 1.x series of releases and from earlier 2.x beta releases. Please see below for details.
- Update Incompatibility: The Dataflow SDK 2.x for Java is update-incompatible with Dataflow 1.x. Streaming jobs using a Dataflow 1.x SDK cannot be updated to use a Dataflow 2.x SDK. Additionally, beta releases of 2.x may not be update-compatible with each other or with 2.0.0.
Beta
This is a Beta release of the Dataflow SDK 2.x for Java and includes the following caveats:
- No API Stability: This release does not guarantee a stable API. The next release in the 2.x series may make breaking API changes that require you to modify your code when you upgrade. API stability guarantees will begin with the 2.0.0 release.
- Limited Support Timeline: This release is an early preview of the upcoming 2.0.0 release. It’s intended to let you start the eventual transition to the 2.x series as convenient for you. Beta release are supported by the Dataflow service, but obtaining bugfixes and new features will require you to upgrade to a newer release that may have backwards-incompatible changes. Once 2.0.0 is released, you should plan to upgrade from any 2.0.0-betaX releases within 3 months.
- Documentation and Code Samples: The SDK documentation on the Dataflow site continues to use code samples from the original 1.x SDKs. For the time being, please see the Apache Beam Documentation for background on the APIs in this release.
Updates since 2.0.0-beta1
This release is based on a subset of Apache Beam 0.5.0. The most relevant changes in this release for Cloud Dataflow customers include:
PubsubIO
functionality:Read
andWrite
now provide access to Cloud Pubsub message attributes.- New scenario: support for stateful pipelines via the new State API.
- New scenario: support for timer via the new Timer API (limited to the
DirectRunner
in this release). - Change to
PubsubIO
construction:PubsubIO.Read
andPubsubIO.Write
must now be instantiated usingPubsubIO.<T>read()
andPubsubIO.<T>write()
instead of the static factory methods such asPubsubIO.Read.topic(String)
. Specifying a coder via.withCoder(Coder)
for the output type is required forRead
. Specifying a coder for the input type or specifying a format function via.withAttributes(SimpleFunction<T, PubsubMessage>)
is required forWrite
.
Additional breaking changes
Please see the official Dataflow SDK 2.x for Java release notes for an updated list of additional breaking changes and updated information on the Dataflow SDK 2.x for Java releases.
Version 2.0.0-beta1
The Dataflow SDK for Java 2.0.0-beta1 is the first 2.x release of the Dataflow SDK for Java, based on a subset of the Apache Beam code base.
- Breaking Changes: The Dataflow SDK 2.x for Java has a number of breaking changes from the 1.x series of releases. Please see below for details.
- Update Incompatibility: The Dataflow SDK 2.x for Java is update-incompatible with Dataflow 1.x. Streaming jobs using a Dataflow 1.x SDK cannot be updated to use a Dataflow 2.x SDK. Additionally, beta releases of 2.x may not be update-compatible with each other or with 2.0.0.
Beta
This is a Beta release of the Dataflow SDK 2.x for Java and includes the following caveats:
- No API Stability: This release does not guarantee a stable API. The next release in the 2.x series may make breaking API changes that require you to modify your code when you upgrade. API stability guarantees will begin with the 2.0.0 release.
- Limited Support Timeline: This release is an early preview of the upcoming 2.0.0 release. It’s intended to let you start the eventual transition to the 2.x series as convenient for you. Beta release are supported by the Dataflow service, but obtaining bugfixes and new features will require you to upgrade to a newer release that may have backwards-incompatible changes. Once 2.0.0 is released, you should plan to upgrade from any 2.0.0-betaX releases within 3 months.
- Documentation and Code Samples: The SDK documentation on the Dataflow site continues to use code samples from the original 1.x SDKs. For the time being, please see the Apache Beam Documentation for background on the APIs in this release.
New Functionality
This release is based on a subset of Apache Beam 0.4.0. The most relevant changes for Cloud Dataflow customers include:
- Improvements to compression:
CompressedSource
supports reading ZIP-compressed files.TextIO.Write
andAvroIO.Write
support compressed output. AvroIO
functionality:Write
supports the addition of custom user metadata.BigQueryIO
functionality:Write
now splits large (> 12 TiB) bulk imports into multiple BigQuery load jobs enabling it to handle very large datasets.BigtableIO
functionality:Write
supports unboundedPCollection
s and so can be used in theDataflowRunner
in the streaming mode.
For complete details, please see the release notes for Apache Beam 0.3.0-incubating and 0.4.0.
Other Apache Beam modules from version 0.4.0 can be used with this distribution, including additional I/O connectors like Java Message Service (JMS), Apache Kafka, Java Database Connectivity (JDBC), MongoDB, and Amazon Kinesis. Please see the Apache Beam site for details.
Major changes from Dataflow SDK 1.x for Java
This release has a number of significant changes from the 1.x series of releases.
All users need to read and understand these changes in order to upgrade to 2.x versions.
Package rename and restructuring
As part of generalizing Apache Beam to work well with environments beyond Google Cloud Platform, the SDK code has been renamed and restructured.
-
Renamed
com.google.cloud.dataflow
toorg.apache.beam
The SDK is now declared in the package
org.apache.beam
instead ofcom.google.cloud.dataflow
. You need to update all your import statements with this change. -
New subpackages:
runners.dataflow
,runners.direct
, andio.gcp
Runners have been reorganized into their own packages, so many things from
com.google.cloud.dataflow.sdk.runners
have moved into eitherorg.apache.beam.runners.direct
ororg.apache.beam.runners.dataflow
.Pipeline options specific to running on the Dataflow service have moved from
com.google.cloud.dataflow.sdk.options
toorg.apache.beam.runners.dataflow.options
.Most I/O connectors to Google Cloud Platform services have been moved into subpackages. For example,
BigQueryIO
has moved fromcom.google.cloud.dataflow.sdk.io
toorg.apache.beam.sdk.io.gcp.bigquery
.Most IDEs will be able to help identify the new locations. To verify the new location for specific files, you can use
t
to search the code on GitHub. The Dataflow SDK 1.x for Java releases are built from the GoogleCloudPlatform/DataflowJavaSDK repository. The Dataflow SDK 2.x for Java releases correspond to code from the apache/beam repository.
Runners
-
Removed
Pipeline
from Runner namesThe names of all Runners have been shortened by removing
Pipeline
from the names. For example,DirectPipelineRunner
is nowDirectRunner
, andDataflowPipelineRunner
is nowDataflowRunner
. -
Require setting
--tempLocation
to a Google Cloud Storage pathInstead of allowing allowing you to specify only one of
--stagingLocation
or--tempLocation
and then Dataflow inferring the other, the Dataflow Service now requires--gcpTempLocation
to be set to a Google Cloud Storage path, but it can be inferred from the more general--tempLocation
. Unless overridden, this will also be used for the--stagingLocation
. -
DirectRunner
supports unboundedPCollection
sThe
DirectRunner
continues to run on a user's local machine, but now additionally supports multithreaded execution, unboundedPCollection
s, and triggers for speculative and late outputs. It more closely aligns to the documented Beam model, and may (correctly) cause additional unit tests failures.As this functionality is now in the
DirectRunner
, theInProcessPipelineRunner
(Dataflow SDK 1.6+ for Java) has been removed. -
Replaced
BlockingDataflowPipelineRunner
withPipelineResult.waitToFinish()
The
BlockingDataflowPipelineRunner
is now removed. If your code programmatically expects to run a pipeline and wait until it has terminated, then it should use theDataflowRunner
and explicitly callpipeline.run().waitToFinish()
.If you used
--runner BlockingDataflowPipelineRunner
on the command line to interactively induce your main program to block until the pipeline has terminated, then this is a concern of the main program; it should provide an option such as--blockOnRun
that will induce it to callwaitToFinish()
.
ParDo
and DoFn
-
DoFn
s use annotations instead of method overridesIn order to allow for more flexibility and customization,
DoFn
now uses method annotations to customize processing instead of requiring users to override specific methods.The differences between the new
DoFn
and the old are demonstrated in the following code sample. Previously (with Dataflow SDK 1.x for Java), your code would look like this:new DoFn<Foo, Baz>() { @Override public void processElement(ProcessContext c) { … } }
Now (with Dataflow SDK 2.x for Java), your code will look like this:
new DoFn<Foo, Baz>() { @ProcessElement // <-- This is the only difference public void processElement(ProcessContext c) { … } }
If your
DoFn
accessedProcessContext#window()
, then there is a further change. Instead of this:public class MyDoFn extends DoFn<Foo, Baz> implements RequiresWindowAccess { @Override public void processElement(ProcessContext c) { … (MyWindowType) c.window() … } }
you will write this:
public class MyDoFn extends DoFn<Foo, Baz> { @ProcessElement public void processElement(ProcessContext c, MyWindowType window) { … window … } }
or:
return new DoFn<Foo, Baz>() { @ProcessElement public void processElement(ProcessContext c, MyWindowType window) { … window … } }
The runtime will automatically provide the window to your
DoFn
. -
DoFn
s are reused across multiple bundlesIn order to allow for performance improvements, the same
DoFn
may now be reused to process multiple bundles of elements, instead of guaranteeing a fresh instance per bundle. AnyDoFn
that keeps local state (e.g. instance variables) beyond the end of a bundle may encounter behavioral changes, as the next bundle
will now start with that state instead of a fresh copy.To manage the lifecycle, new
@Setup
and@Teardown
methods have been added. The full lifecycle is as follows (while a failure may truncate the lifecycle at any point):@Setup
: Per-instance initialization of theDoFn
, such as opening reusable connections.- Any number of the sequence:
@StartBundle
: Per-bundle initialization, such as resetting the state of theDoFn
.@ProcessElement
: The usual element processing.@FinishBundle
: Per-bundle concluding steps, such as flushing side effects.
@Teardown
: Per-instance teardown of the resources held by theDoFn
, such as closing reusable connections.
Note: This change is expected to have limited impact in practice. However, it does not generate a compile-time error and has the potential to silently cause unexpected results.
PTransforms
-
Removed
.named()
Remove the
.named()
methods from PTransforms and sub-classes. Instead usePCollection.apply(“name”, PTransform)
. -
Renamed
PTransform.apply()
toPTransform.expand()
PTransform.apply()
was renamed toPTransform.expand()
to avoid confusion withPCollection.apply()
. All user-written composite transforms will need to rename the overriddenapply()
method toexpand()
. There is no change to how pipelines are constructed.
Additional breaking changes
Please see the official [Dataflow SDK 2.x for Java release notes](https://cloud.google.com/dataflow/release-notes/re...
Version 1.9.0
- Added the
ValueProvider
interface for use in pipeline options. Making an option of typeValueProvider<T>
instead ofT
allows its value to be supplied at runtime (rather than pipeline construction time) and enables Dataflow templates. Support forValueProvider
has been added toTextIO
,PubSubIO
, andBigQueryIO
and can be added to arbitrary PTransforms as well. - Added the ability to automatically save profiling information to Google Cloud Storage using the
--saveProfilesToGcs
pipeline option. For more information on profiling pipelines executed by theDataflowPipelineRunner
, see issue #72. - Deprecated the
--enableProfilingAgent
pipeline option that saved profiles to the individual worker disks. For more information on profiling pipelines executed by theDataflowPipelineRunner
, see issue #72. - Changed
FileBasedSource
to throw an exception when reading from a file pattern that has no matches. Pipelines will now fail at runtime rather than silently reading no data in this case. This change affectsTextIO.Read
orAvroIO.Read
when configuredwithoutValidation
. - Enhanced
Coder
validation in theDirectPipelineRunner
to catch coders that cannot properly encode and decode their input. - Improved display data throughout core transforms, including properly handling arrays in
PipelineOptions
. - Improved performance for pipelines using the
DataflowPipelineRunner
in streaming mode. - Improved scalability of the
InProcessRunner
, enabling testing with larger datasets. - Improved the cleanup of temporary files created by
TextIO
,AvroIO
, and otherFileBasedSource
implementations. - Modified the default version range in the archetypes to exclude beta releases of Dataflow SDK for Java, version 2.0.0 and later.
Version 1.8.1
- Improved the performance of bounded side inputs in the
DataflowPipelineRunner
.
Version 1.8.0
- Added support to
BigQueryIO.Read
for queries in the new BigQuery Standard SQL dialect using.withStandardSQL()
. - Added support in
BigQueryIO
for the newBYTES
,TIME
,DATE
, andDATETIME
types. - Added support to
BigtableIO.Read
for reading from a restricted key range using.withKeyRange(ByteKeyRange)
. - Improved initial splitting of large uncompressed files in
CompressedSource
, leading to better performance when executing batch pipelines that useTextIO.Read
on the Cloud Dataflow service. - Fixed a performance regression when using
BigQueryIO.Write
in streaming mode.
Version 1.7.0
- Added support for Cloud Datastore API v1 in the new
com.google.cloud.dataflow.sdk.io.datastore.DatastoreIO
. Deprecated the oldDatastoreIO
class that supported only the deprecated Cloud Datastore API v1beta2. - Improved
DatastoreIO.Read
to support dynamic work rebalancing, and added an option to control the number of query splits usingwithNumQuerySplits
. - Improved
DatastoreIO.Write
to work with an unboundedPCollection
, supporting writing to Cloud Datastore when using theDataflowPipelineRunner
in streaming mode. - Added the ability to delete Cloud Datastore
Entity
objects directly usingDatastore.v1().deleteEntity
or to delete entities by key usingDatastore.v1().deleteKey
. - Added support for reading from a
BoundedSource
to theDataflowPipelineRunner
in streaming mode. This enables the use ofTextIO.Read
,AvroIO.Read
and other bounded sources in these pipelines. - Added support for optionally writing a header and/or footer to text files produced with
TextIO.Write
. - Added the ability to control the number of output shards produced when using a
Sink
. - Added
TestStream
to enable testing of triggers with multiple panes and late data with theInProcessPipelineRunner
. - Added the ability to control the rate at which
UnboundedCountingInput
produces elements usingwithRate(long, Duration)
. - Improved performance and stability for pipelines using the
DataflowPipelineRunner
in streaming mode. - To support
TestStream
, reimplementedDataflowAssert
to useGroupByKey
instead ofsideInputs
to check assertions. This is an update-incompatible change toDataflowAssert
for pipelines run on theDataflowPipelineRunner
in streaming mode. - Fixed an issue in which a
FileBasedSink
would produce no files when writing an emptyPCollection
. - Fixed an issue in which
BigQueryIO.Read
could not query a table in a non-US
region when using theDirectPipelineRunner
or theInProcessPipelineRunner
. - Fixed an issue in which the combination of timestamps near the end of the global window and a large
allowedLateness
could cause anIllegalStateException
for pipelines run in theDirectPipelineRunner
. - Fixed a
NullPointerException
that could be thrown during pipeline submission when using anAfterWatermark
trigger with no late firings.