- New
tft.word_count
mapper to identify the number of tokens for each row (for pre-tokenized strings). - All
tft.scale_to_*
mappers now have per-key variants, along with analyzers formean_and_var_per_key
andmin_and_max_per_key
. - New
tft_beam.AnalyzeDatasetWithCache
allows analyzing ranges of data while producing and utilizing cache.tft.analyzer_cache
can help read and write such cache to a filesystem between runs. This caching feature is worth using when analyzing a rolling range in a continuous pipeline manner. This is an experimental feature. - Added
reduce_instance_dims
support totft.quantiles
andelementwise
totft.bucketize
, while avoiding separate beam calls for each feature.
sparse_tensor_to_dense_with_shape
now accepts an optionaldefault_value
parameter.tft.vocabulary
andtft.compute_and_apply_vocabulary
now supportfingerprint_shuffle
to sort the vocabularies by fingerprint instead of counts. This is useful for load balancing the training parameter servers. This is an experimental feature.- Fix numerical instability in
tft.vocabulary
mutual information calculations. tft.vocabulary
andtft.compute_and_apply_vocabulary
now support computing vocabularies over integer categoricals and multivalent input features, and computing mutual information for non-binary labels.- New numeric normalization method available:
tft.apply_buckets_with_interpolation
. - Changes to make this library more compatible with TensorFlow 2.0.
- Fix sanitizing of vocabulary filenames.
- Emit a friendly error message when context isn't set.
- Analyzer output dtypes are enforced to be TensorFlow dtypes, and by extension
ptransform_analyzer
'soutput_dtypes
is enforced to be a list of TensorFlow dtypes. - Make
tft.apply_buckets_with_interpolation
support SparseTensors. - Adds an experimental api for analyzers to annotate the post-transform schema.
TFTransformOutput.transform_raw_features
now accepts an optionaldrop_unused_features
parameter to exclude unused features in output.- If not specified, the min_diff_from_avg parameter of
tft.vocabulary
now defaults to a reasonable value based on the size of the dataset (relevant only if computing vocabularies using mutual information). - Convert some
tf.contrib
functions to be compatible with TF2.0. - New
tft.bag_of_words
mapper to compute the unique set of ngrams for each row (for pre-tokenized strings). - Fixed a bug in
tf_utils.reduce_batch_count_mean_and_var
, and as a resultmean_and_var
analyzer, was miscalculating variance for the sparse elementwise=True case. - At test utility
tft_unit.cross_named_parameters
for creating parameterized tests that involve the cartesian product of various parameters. - Depends on
tensorflow-metadata>=0.14,<0.15
. - Depends on
apache-beam[gcp]>=2.14,<3
. - Depends on
numpy>=1.16,<2
. - Depends on
absl-py>=0.7,<2
. - Allow
preprocessing_fn
to emit atf.RaggedTensor
. In this case, the outputSchema
proto will not be able to be converted to a feature spec, and so the output data will not be able to be materialized withtft.coders
. - Ability to directly set exact
num_buckets
with new parameteralways_return_num_quantiles
foranalyzers.quantiles
andmappers.bucketize
, defaulting to False in general but True whenreduce_instance_dims
is False.
tf_utils.reduce_batch_count_mean_and_var
, which feeds intotft.mean_and_var
, now returns 0 instead of inf for empty columns of a sparse tensor.tensorflow_transform.tf_metadata.dataset_schema.Schema
class is removed. Wherever adataset_schema.Schema
was used, users should now provide atensorflow_metadata.proto.v0.schema_pb2.Schema
proto. For backwards compatibility,dataset_schema.Schema
is now a factory method that produces aSchema
proto. Updating code should be straightforward because thedataset_schema.Schema
class was already a wrapper around theSchema
proto.- Only explicitly public analyzers are exported to the
tft
module, e.g. combiners are no longer exported and have to be accessed directly throughtft.analyzers
. - Requires pre-installed TensorFlow >=1.14,<2.
DatasetSchema
is now a deprecated factory method (see above).tft.tf_metadata.dataset_schema.from_feature_spec
is now deprecated. Equivalent functionality is provided bytft.tf_metadata.schema_utils.schema_from_feature_spec
.
- Now
AnalyzeDataset
,TransformDataset
andAnalyzeAndTransformDataset
can accept input data that only contains columns needed for that operation as opposed to all columns defined in schema. Utility methods to infer the list of needed columns are added totft.inspect_preprocessing_fn
. This makes it easier to take advantage of columnar projection when data is stored in columnar storage formats. - Python 3.5 is supported.
- Version is now accessible as
tensorflow_transform.__version__
. - Depends on
apache-beam[gcp]>=2.11,<3
. - Depends on
protobuf>=3.7,<4
.
- Coders now return index and value features rather than a combined feature for
SparseFeature
. - Requires pre-installed TensorFlow >=1.13,<2.
- Python 3.5 readiness complete (all tests pass). Full Python 3.5 compatibility is expected to be available with the next version of Transform (after Apache Beam 2.11 is released).
- Performance improvements for vocabulary generation when using top_k.
- New optimized highly experimental API for analyzing a dataset was added,
AnalyzeDatasetWithCache
, which allows reading and writing analyzer cache. - Update
DatasetMetadata
to be a wrapper around thetensorflow_metadata.proto.v0.schema_pb2.Schema
proto. TensorFlow Metadata will be the schema used to define data parsing across TFX. The serializedDatasetMetadata
is now theSchema
proto in ascii format, but the previous format can still be read. - Change
ApplySavedModel
implementation to usetf.Session.make_callable
instead oftf.Session.run
for improved performance.
tft.vocabulary
andtft.compute_and_apply_vocabulary
now support filtering based on adjusted mutual information whenuse_adjusetd_mutual_info
is set to True.tft.vocabulary
andtft.compute_and_apply_vocabulary
now takes regularization term 'min_diff_from_avg' that adjusts mutual information to zero whenever the difference between count of the feature with any label and its expected count is lower than the threshold.- Added an option to
tft.vocabulary
andtft.compute_and_apply_vocabulary
to compute a coverage vocabulary, using the newcoverage_top_k
,coverage_frequency_threshold
andkey_fn
parameters. - Added
tft.ptransform_analyzer
for advanced use cases. - Modified
QuantilesCombiner
to usetf.Session.make_callable
instead oftf.Session.run
for improved performance. - ExampleProtoCoder now also supports non-serialized Example representations.
tft.tfidf
now accepts a scalar Tensor asvocab_size
.assertItemsEqual
in unit tests are replaced byassertCountEqual
.NumPyCombiner
now outputs TF dtypes in output_tensor_infos instead of numpy dtypes.- Adds function
tft.apply_pyfunc
that provides limited support fortf.pyfunc
. Note that this is incompatible with serving. See documentation for more details. CombinePerKey
now adds a dimension for the key.- Depends on
numpy>=1.14.5,<2
. - Depends on
apache-beam[gcp]>=2.10,<3
. - Depends on
protobuf==3.7.0rc2
. ExampleProtoCoder.encode
now converts a feature whose value isNone
to an empty value, where before it did not acceptNone
as a valid value.AnalyzeDataset
,AnalyzeAndTransformDataset
andTransformDataset
can now accept dictionaries which containNone
, and which will be interpreted the same as an empty list. They will never produce an output containingNone
.
ColumnSchema
and related classes (Domain
,Axis
andColumnRepresentation
and their subclasses) have been removed. In order to create a schema, usefrom_feature_spec
. In order to inspect a schema use theas_feature_spec
anddomains
methods ofSchema
. The constructors of these classes are replaced by functions that still work when creating aSchema
but this usage is deprecated.- Requires pre-installed TensorFlow >=1.12,<2.
ExampleProtoCoder.decode
now converts a feature with empty value (e.g.features { feature { key: "varlen" value { } } }
) or missing key for a feature (e.g.features { }
) to aNone
in the output dictionary. Before it would represent these with an empty list. This better reflects the original example proto and is consistent with TensorFlow Data Validation.- Coders now returns a
list
instead of anndarray
for aVarLenFeature
.
- 'tft.vocabulary' and 'tft.compute_and_apply_vocabulary' now support filtering
based on mutual information when
labels
is provided. - Export all package level exports of
tensorflow_transform
, from thetensorflow_transform.beam
subpackage. This allows users to just import thetensorflow_transform.beam
subpackage for all functionality. - Adding API docs.
- Fix bug where Transform returned a different dtype for a VarLenFeature with 0 elements.
- Depends on
apache-beam[gcp]>=2.8,<3
.
- Requires pre-installed TensorFlow >=1.11,<2.
- All functions in
tensorflow_transform.saved.input_fn_maker
are deprecated. See the examples for how to construct theinput_fn
for training and serving. Note that the examples demonstrate the use of thetf.estimator
API. The functions named *_serving_input_fn were for use with thetf.contrib.estimator
API which is now deprecated. We do not provide examples of usage of thetf.contrib.estimator
API, instead users should upgrade to thetf.estimator
API.
- Performance improvements for vocabulary generation when using top_k.
- Utility to deep-copy Beam
PCollection
s was added to avoid unnecessary materialization. - Utilize deep_copy to avoid unnecessary materialization of pcollections when
the input data is immutable. This feature is currently off by default and can
be enabled by setting
tft.Context.use_deep_copy_optimization=True
. - Add bucketize_per_key which computes separate quantiles for each key and then bucketizes each value according to the quantiles computed for its key.
tft.scale_to_z_score
is now implemented with a single pass over the data.- Export schema_utils package to convert from the
tensorflow-metadata
package to the (soon to be deprecated)tf_metadata
subpackage oftensorflow-transform
.
- Memory reduction during vocabulary generation.
- Clarify documentation on return values from
tft.compute_and_apply_vocabulary
andtft.string_to_int
. tft.unit
now explicitly creates Beam PCollections and validates the transformed dataset by writing and then reading it from disk.tft.min
,tft.size
,tft.sum
,tft.scale_to_z_score
andtft.bucketize
now supporttf.SparseTensor
.- Fix to
tft.scale_to_z_score
so it no longer attempts to divide by 0 when the variance is 0. - Fix bug where internal graph analysis didn't handle the case where an operation has control inputs that are operations (as opposed to tensors).
tft.sparse_tensor_to_dense_with_shape
added which allows densifying aSparseTensor
while specifying the resultingTensor
's shape.- Add
load_transform_graph
method toTFTransformOutput
to load the transform graph without applying it. This has the effect of adding variables to the checkpoint when calling it from the traininginput_fn
when usingtf.Estimator
. - 'tft.vocabulary' and 'tft.compute_and_apply_vocabulary' now accept an
optional
weights
argument. Whenweights
is provided, weighted frequencies are used instead of frequencies based on counts. - 'tft.quantiles' and 'tft.bucketize' now accept an optoinal
weights
argument. Whenweights
is provided, weighted count is used for quantiles instead of the counts themselves. - Updated examples to construct the schema using
dataset_schema.from_feature_spec
. - Updated the census example to allow the 'education-num' feature to be missing and fill in a default value when it is.
- Depends on
tensorflow-metadata>=0.9,<1
. - Depends on
apache-beam[gcp]>=2.6,<3
.
- We now validate a
Schema
in its constructor to make sure that it can be converted to a feature spec. In particular onlytf.int64
,tf.string
andtf.float32
types are allowed. - We now disallow default values for
FixedColumnRepresentation
. - It is no longer possible to set a default value in the Schema, and validation of shape parameters will occur earlier.
- Removed Schema.as_batched_placeholders() method.
- Removed all components of DatasetMetadata except the schema, and removed all related classes and code.
- Removed the merge method for DatasetMetadata and related classes.
- read_metadata can now only read from a single metadata directory and
read_metadata and write_metadata no longer accept the
versions
parameter. They now only read/write the JSON format. - Requires pre-installed TensorFlow >=1.9,<2.
apply_function
is no longer needed and is deprecated.apply_function(fn, *args)
is now equivalent tofn(*args)
. tf.Transform is able to handle while loops and tables without the user wrapping the function call inapply_function
.
- Add TFTransformOutput utility class that wraps the output of tf.Transform for use in training. This makes it easier to consume the output written by tf.Transform (see update examples for usage).
- Increase efficiency of
quantiles
(and thereforebucketize
).
- Change
tft.sum
/tft.mean
/tft.var
to only support basic numeric types. - Widen the output type of
tft.sum
for some input types to avoid overflow and/or to preserve precision. - For int32 and int64 input types, change the output type of
tft.mean
/tft.var
/tft.scale_to_z_score
from float64 to float32 . - Change the output type of
tft.size
to be always int64. Context
now accepts passthrough_keys which can be used when additional information should be attached to dataset instances in the pipeline which should not be part of the transformation graph, for example: instance keys.- In addition to using TFTransformOutput, the examples demonstrate new workflows
where a vocabulary is computed, but not applied, in the
preprocessing_fn
. - Added dependency on the absl-py package.
TransformTestCase
test cases can now be parameterized.- Add support for partitioned variables when loading a model.
- Export the
coders
subpackage so that users can access it astft.coders
, e.g.tft.coders.ExampleProtoCoder
. - Setting dtypes for numpy arrays in
tft.coders.ExampleProtoCoder
andtft.coders.CsvCoder
. tft.mean
,tft.max
andtft.var
now supporttf.SparseTensor
.- Update examples to use "core" TensorFlow estimator API (
tf.estimator
). - Depends on
protobuf>=3.6.0<4
.
apply_saved_transform
is removed. See note onpartially_apply_saved_transform
in theDeprecations
section.- No longer set
vocabulary_file
inIntDomain
when usingtft.compute_and_apply_vocabulary
ortft.apply_vocabulary
. - Requires pre-installed TensorFlow >=1.8,<2.
- The
expected_asset_file_contents
ofTransformTestCase.assertAnalyzeAndTransformResults
has been deprecated, useexpected_vocab_file_contents
instead. transform_fn_io.TRANSFORMED_METADATA_DIR
andtransform_fn_io.TRANSFORM_FN_DIR
should not be used, they are now aliases forTFTransformOutput.TRANSFORMED_METADATA_DIR
andTFTransformOutput.TRANSFORM_FN_DIR
respectively.partially_apply_saved_transform
is deprecated, users should use thetransform_raw_features
method ofTFTransformOuptut
instead. These differ in thatpartially_apply_saved_transform
can also be used to return both the input placeholders and the outputs. But users do not need this functionality because they will typically create the input placeholders themselves based on the feature spec.- Renamed
tft.uniques
totft.vocabulary
,tft.string_to_int
totft.compute_and_apply_vocabulary
andtft.apply_vocab
totft.apply_vocabulary
. The existing methods will remain for a few more minor releases but are now deprecated and should get migrated away from.
- Depends on
apache-beam[gcp]>=2.4,<3
. - Trim min/max value in
tft.bucketize
where the computed number of bucket boundaries is more than requested. Updated documentation to clearly indicate that the number of buckets is computed using approximate algorithms, and that computed number can be more or less than requested. - Change the namespace used for Beam metrics from
tensorflow_transform
totfx.Transform
. - Update Beam metrics to also log vocabulary sizes.
CsvCoder
updated to support unicode.- Update examples to not use the
coder
argument for IO, and instead use a separatebeam.Map
to encode/decode data.
- Requires pre-installed TensorFlow >=1.6,<2.
- Batching of input instances is now done automatically and dynamically.
- Added analyzers to compute covariance matrices (
tft.covariance
) and principal components for PCA (tft.pca
). - CombinerSpec and combine_analyzer now accept multiple inputs/outputs.
- Depends on
apache-beam[gcp]>=2.3,<3
. - Fixes a bug where TransformDataset would not return correct output if the output DatasetMetadata contained deferred values (such as vocabularies).
- Added checks that the prepreprocessing function's outputs all have the same size in the batch dimension.
- Added
tft.apply_buckets
which takes an input tensor and a list of bucket boundaries, and returns bucketized data. tft.bucketize
andtft.apply_buckets
now set metadata for the output tensor, which means the resulting tf.Metadata for the output of these functions will contain min and max values based on the number of buckets, and also be set to categorical.- Testing helper function assertAnalyzeAndTransformResults can now also test the content of vocabulary files and other assets.
- Reduces the number of beam stages needed for certain analyzers, which can be a performance bottleneck when transforming many features.
- Performance improvements in
tft.uniques
. - Fix a bug in
tft.bucketize
where the bucket boundary could be same as a min/max value, and was getting dropped. - Allows scaling individual components of a tensor independently with
tft.scale_by_min_max
,tft.scale_to_0_1
, andtft.scale_to_z_score
. - Fix a bug where
apply_saved_transform
could only be applied in the global name scope. - Add warning when
frequency_threshold
that are <= 1. This is a no-op and generally reflects mistakingfrequency_threshold
for a relative frequency where in fact it is an absolute frequency.
- The interfaces of CombinerSpec and combine_analyzer have changed to allow for multiple inputs/outputs.
- Requires pre-installed TensorFlow >=1.5,<2.
- Added a combine_analyzer() that supports user provided combiner, conforming to beam.CombinFn(). This allows users to implement custom combiners (e.g. median), to complement analyzers (like min, max) that are prepackaged in TFT.
- Quantiles Analyzer (
tft.quantiles
), with a correspondingtft.bucketize
mapper.
- Depends on
apache-beam[gcp]>=2.2,<3
. - Fixes some KeyError issues that appeared in certain circumstances when one would call AnalyzeAndTransformDataset (due to a now-fixed Apache Beam [bug] (https://issues.apache.org/jira/projects/BEAM/issues/BEAM-2966)).
- Allow all functions that accept and return tensors, to accept an optional name scope, in line with TensorFlow coding conventions.
- Update examples to construct input functions by hand instead of using helper functions.
- Change scale_by_min_max/scale_to_0_1 to return the average(min, max) of the range in case all values are identical.
- Added export of serving model to examples.
- Use "core" version of feature columns (tf.feature_column instead of tf.contrib) in examples.
- A few bug fixes and improvements for coders regarding Python 3.
- Requires pre-installed TensorFlow >= 1.4.
- No longer distributing a WHL file in PyPI. Only doing a source distribution
which should however be compatible with all platforms (ie you are still able
to
pip install tensorflow-transform
and userequirements.txt
orsetup.py
files for environment setup). - Some functions now introduce a new name scope when they did not before so the names of tensors may change. This will only affect you if you directly lookup tensors by name in the graph produced by tf.Transform.
- Various Analyzer Specs (_NumericCombineSpec, _UniquesSpec, _QuantilesSpec) are now private. Analyzers are accessible only via the top-level TFT functions (min, max, sum, size, mean, var, uniques, quantiles).
- The
serving_input_fn
s ontensorflow_transform/saved/input_fn_maker.py
will be removed on a future version and should not be used on new code, see theexamples
directory for details on how to migrate your code to define their own serving functions.
- We now provide helper methods for creating
serving_input_receiver_fn
for use with tf.estimator. These mirror the existing functions targeting the legacy tf.contrib.learn.estimators-- i.e. for each*_serving_input_fn()
in input_fn_maker there is now also a*_serving_input_receiver_fn()
.
- Introduced
tft.apply_vocab
this allows users to separately apply a single vocabulary (as generated bytft.uniques
) to several different columns. - Provide a source distribution tar
tensorflow-transform-X.Y.Z.tar.gz
.
- The default prefix for
tft.string_to_int
vocab_filename
changed fromvocab_string_to_int
tovocab_string_to_int_uniques
. To make your pipelines resilient to implementation details please setvocab_filename
if you are using the generated vocab_filename on a downstream component.
- Added hash_strings mapper.
- Write vocabularies as asset files instead of constants in the SavedModel.
- 'tft.tfidf' now adds 1 to idf values so that terms in every document in the corpus have a non-zero tfidf value.
- Performance and memory usage improvement when running with Beam runners that use multi-threaded workers.
- Performance optimizations in ExampleProtoCoder.
- Depends on
apache-beam[gcp]>=2.1.1,<3
. - Depends on
protobuf>=3.3<4
. - Depends on
six>=1.9,<1.11
.
- Requires pre-installed TensorFlow >= 1.3.
- Removed
tft.map
usetft.apply_function
instead (as needed). - Removed
tft.tfidf_weights
usetft.tfidf
instead. beam_metadata_io.WriteMetadata
now requires a secondpipeline
argument (see examples).- A Beam bug will now affect users who call AnalyzeAndTransformDataset in
certain circumstances. Roughly speaking, if you call
beam.Pipeline()
at some point (as all our examples do) you will not experience this bug. The bug is characterized by an error similar toKeyError: (u'AnalyzeAndTransformDataset/AnalyzeDataset/ComputeTensorValues/Extract[Maximum:0]', None)
This bug will be fixed in Beam 2.2.
- Add json-example serving input functions to TF.Transform.
- Add variance analyzer to tf.transform.
- Remove duplication in output of
tft.tfidf
. - Ensure ngrams output dense_shape is greater than or equal to 0.
- Alters the behavior and interface of tensorflow_transform.mappers.ngrams.
- Depends on
apache-beam[gcp]=>2,<3
. - Making TF Parallelism runner-dependent.
- Fixes issue with csv serving input function.
- Various performance and stability improvements.
tft.map
will be removed on version 0.2.0, see theexamples
directory for instructions on how to usetft.apply_function
instead (as needed).tft.tfidf_weights
will be removed on version 0.2.0, usetft.tfidf
instead.
- Refactor internals to remove Column and Statistic classes
- Remove collections from graph to avoid warnings
- Return float32 from
tfidf_weights
- Update tensorflow_transform to use
tf.saved_model
APIs. - Add default values on example proto coder.
- Various performance and stability improvements.