This repository has been archived by the owner on Nov 11, 2022. It is now read-only.
Releases: GoogleCloudPlatform/DataflowJavaSDK
Releases · GoogleCloudPlatform/DataflowJavaSDK
Version 1.6.1
- Fixed an issue with Dataflow jobs reading from
TextIO
with compression type set toGZIP
orBZIP2
. For more information, see Issue #356.
Version 1.6.0
- Added
InProcessPipelineRunner
, an improvement over theDirectPipelineRunner
that better implements the Dataflow model.InProcessPipelineRunner
runs on a user's local machine and supports multithreaded execution, unboundedPCollection
s, and triggers for speculative and late outputs. - Added display data, which allows annotating user functions (
DoFn
,CombineFn
, andWindowFn
),Source
s, andSink
s with static metadata to be displayed in the Dataflow Monitoring Interface. Display data has been implemented for core components and is automatically applied to allPipelineOptions
. - Added the ability to compose multiple
CombineFn
s into a singleCombineFn
usingCombineFns.compose
orCombineFns.composeKeyed
. - Added the methods
getSplitPointsConsumed
andgetSplitPointsRemaining
to theBoundedReader
API to improve Dataflow's ability to automatically scale a job reading from these sources. Default implementations of these functions have been provided, but reader implementers should override them to provide better information when available. - Improved performance of side inputs when using workers with many cores.
- Improved efficiency when using
CombineFnWithContext
. - Fixed several issues related to stability in the streaming mode.
Version 1.5.1
- Fixed an issue that hid
BigtableIO.Read.withRowFilter
, which allows Cloud Bigtable rows to be filtered in theRead
transform. - Fixed support for concatenated GZip files.
- Fixed an issue that prevented
Write.to
to be used with merging windows. - Fixed an issue that caused excessive triggering with repeated composite triggers.
- Fixed an issue with merging windows and triggers that finish before the end of the window.
Version 1.5.0
With this release, we have begun preparing the Dataflow SDK for Java for an eventual move to Apache Beam (incubating). Specifically, we have refactored a number of internal APIs and removed from the SDK classes used only within the worker, which will now be provided by the Google Cloud Dataflow Service during job execution. This refactoring should not affect any user code.
Additionally, the 1.5.0 release includes the following changes:
- Enabled an indexed side input format for batch pipelines executed on the Google Cloud Dataflow service. Indexed side inputs significantly increase performance for
View.asList
,View.asMap
,View.asMultimap
, and any non-globally-windowedPCollectionView
s. - Upgraded to Protocol Buffers version
3.0.0-beta-1
. If you use custom Protocol Buffers, you should recompile them with the corresponding version of theprotoc
compiler. You can continue using both version 2 and 3 of the Protocol Buffers syntax, and no user pipeline code needs to change. - Added
ProtoCoder
, which is aCoder
for Protocol Buffers messages that supports both version 2 and 3 of the Protocol Buffers syntax. This coder can detect when messages can be encoded deterministically.Proto2Coder
is now deprecated; we recommend that all users switch toProtoCoder
. - Added
withoutResultFlattening
toBigQueryIO.Read
to disable flattening query results when reading from BigQuery. - Added
BigtableIO
, enabling support for reading from and writing to Google Cloud Bigtable. - Improved
CompressedSource
to detect compression format according to the file extension. Added support for reading.gz
files that are transparently decompressed by the underlying transport logic.
Version 1.4.0
- Added a series of batch and streaming example pipelines in a mobile gaming domain that illustrate some advanced topics, including windowing and triggers.
- Added support for
Combine
functions to access pipeline options and side inputs through a context. SeeGlobalCombineFn
andPerKeyCombineFn
for further details. - Modified
ParDo.withSideInputs()
such that successive calls are cumulative. - Modified automatic coder detection of Protocol Buffer messages; such classes now have their coders provided automatically.
- Added support for limiting the number of results returned by
DatastoreIO.Source
. However, when this limit is set, the operation that reads from Cloud Datastore is performed by a single worker rather than executing in parallel across the worker pool. - Modified definition of
PaneInfo.{EARLY, ON_TIME, LATE}
so that panes with only late data are alwaysLATE
, and anON_TIME
pane can never cause a later computation to yield aLATE
pane. - Modified
GroupByKey
to drop late data when that late data arrives for a window that has expired. An expired window means the end of the window is passed by more than the allowed lateness. - When using
GlobalWindows
, you are no longer required to specifywithAllowedLateness()
, since no data is ever dropped. - Added support for obtaining the default project ID from the default project configuration produced by newer versions of the
gcloud
utility. If the default project configuration does not exist, Dataflow reverts to using the old project configuration generated by older versions of thegcloud
utility.
Version 1.3.0
- Improved
IterableLikeCoder
to efficiently encode small values. This change is backward compatible; however, if you have a running pipeline that was constructed with SDK version 1.3.0 or later, it may not be possible to "update" that pipeline with a replacement that was constructed using SDK version 1.2.1 or older. Updating a running pipeline with a pipeline constructed using a new SDK version, however, should be successful. - When
TextIO.Write
orAvroIO.Write
outputs to a fixed number of files, added a reshard (shuffle) step immediately prior to the write step. The cost of this reshard is often exceeded by additional parallelism available to the preceding stage. - Added support for RFC 3339 timestamps in
PubsubIO
. This allows reading from Cloud Pub/Sub topics published by Cloud Logging without losing timestamp information. - Improved memory management to help prevent pipelines in the streaming execution mode from stalling when running with high memory utilization. This particularly benefits pipelines with large
GroupByKey
results. - Added ability to customize timestamps of emitted windows. Previously, the watermark was held to the earliest timestamp of any buffered input. With this change, you can choose a later time to allow the watermark to progress further. For example, using the end of the window will prevent long-lived sessions from holding up the output. See
Window.Bound.withOutputTime()
. - Added a simplified syntax for early and late firings with an
AfterWatermark
trigger, as follows:AfterWatermark.pastEndOfWindow().withEarlyFirings(...).withLateFirings(...)
.
Version 1.2.1
- Fixed a regression in
BigQueryIO
that unnecessarily printed a lot of messages when executed usingDirectPipelineRunner
.
Version 1.2.0
- Added Java 8 support. Added new
MapElements
andFlatMapElements
transforms that accept Java 8 lambdas, for those cases when the full power ofParDo
is not required.Filter
andPartition
accept lambdas as well. Java 8 functionality is demonstrated in a newMinimalWordCountJava8
example. - Enabled
@DefaultCoder
annotations for generic types. Previously, a@DefaultCoder
annotation on a generic type was ignored, resulting in diminished functionality and confusing error messages. It now works as expected. DatastoreIO
now supports (parallel) reads within namespaces. Entities can be written to namespaces by setting the namespace in the Entity key.- Limited the
slf4j-jdk14
dependency to thetest
scope. When a Dataflow job is executing, theslf4j-api
,slf4j-jdk14
,jcl-over-slf4j
,log4j-over-slf4j
, andlog4j-to-slf4j
dependencies will be provided by the system.
Version 1.1.0
- Added a coder for type
Set<T>
to the coder registry, when typeT
has its own registered coder. - Added
NullableCoder
, which can be used in conjunction with other coders to encode aPCollection
whose elements may possibly contain null values. - Added
Filter
as a compositePTransform
. Deprecated static methods in the oldFilter
implementation that returnParDo
transforms. - Added
SourceTestUtils
, which is a set of helper functions and test harnesses for testing correctness ofSource
implementations.
Version 1.0.0
- The initial General Availability (GA) version, open to all developers, and considered stable and fully qualified for production use. It coincides with the General Availability of the Dataflow Service.
- Removed the default values for
numWorkers
,maxNumWorkers
, and similar settings. If these are unspecified, the Dataflow Service will pick an appropriate value. - Added checks to
DirectPipelineRunner
to help ensure thatDoFn
s obey the existing requirement that inputs and outputs must not be modified. - Added support in
AvroCoder
for@Nullable
fields with deterministic encoding. - Added a requirement that anonymous
CustomCoder
subclasses overridegetEncodingId
method. - Changed
Source.Reader
,BoundedSource.BoundedReader
,UnboundedSource.UnboundedReader
to be abstract classes, instead of interfaces.AbstractBoundedReader
has been merged intoBoundedSource.BoundedReader
. - Renamed
ByteOffsetBasedSource
andByteOffsetBasedReader
toOffsetBasedSource
andOffsetBasedReader
, introducinggetBytesPerOffset
as a translation layer. - Changed
OffsetBasedReader
, such that the subclass now has to overridestartImpl
andadvanceImpl
, rather thanstart
andadvance
. The protected variablerangeTracker
is now hidden and updated by base class automatically. To indicate split points, use the methodisAtSplitPoint
. - Removed methods for adjusting watermark triggers.
- Removed an unecessary generic parameter from
TimeTrigger
. - Removed generation of empty panes unless explicitly requested.