Add read_avro and list_avro_columns for rework on Splittable Avro support #399

yongtang · 2019-07-31T04:55:37Z

This PR is part of the effort to rework on Dataset with large files reading into Tensors first to speed up performance. See #382 and #366 for related discussions.

Summary:

read_avro is able to read a avro file within the range of [offset, offset+length] (Splittable)
we use primitive read_avro C++ ops to read in big chunks and then wire up with tf.data.Dataset
read_avro could be used in other places.
AvroDataset automatically find out the dtype in eager mode, in graph mode, user has
to specify the dtype in kwargs.

Signed-off-by: Yong Tang yong.tang.github@outlook.com

…port This PR is part of the effort to rework on Dataset with large files reading into Tensors first to speed up performance. See 382 and 366 for related discussions. Summary: 1) read_avro is able to read a avro file within the range of [offset, offset+length] (Splittable) 2) we use primitive read_avro C++ ops to read in big chunks and then wire up with tf.data.Dataset 3) read_avro could be used in other places. 4) AvroDataset automatically find out the dtype in eager mode, in graph mode, user has to specify the dtype in kwargs. Signed-off-by: Yong Tang <yong.tang.github@outlook.com>

yongtang · 2019-08-04T15:49:05Z

Also plan to merge this PR, as it exposes a primitive op (read_avro) which could be more useful than dataset (unless directly passed to tf.keras).

…port (tensorflow#399) This PR is part of the effort to rework on Dataset with large files reading into Tensors first to speed up performance. See 382 and 366 for related discussions. Summary: 1) read_avro is able to read a avro file within the range of [offset, offset+length] (Splittable) 2) we use primitive read_avro C++ ops to read in big chunks and then wire up with tf.data.Dataset 3) read_avro could be used in other places. 4) AvroDataset automatically find out the dtype in eager mode, in graph mode, user has to specify the dtype in kwargs. Signed-off-by: Yong Tang <yong.tang.github@outlook.com>

yongtang force-pushed the avro branch from 98bfe0f to 1de1d12 Compare July 31, 2019 06:16

yongtang mentioned this pull request Jul 31, 2019

Discuss Batch Standards in TFIO with Keras #382

Open

yongtang force-pushed the avro branch from 1de1d12 to aada0a5 Compare August 4, 2019 00:41

yongtang merged commit 77ee1da into tensorflow:master Aug 4, 2019

yongtang deleted the avro branch August 4, 2019 20:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add read_avro and list_avro_columns for rework on Splittable Avro support #399

Add read_avro and list_avro_columns for rework on Splittable Avro support #399

yongtang commented Jul 31, 2019

yongtang commented Aug 4, 2019

Add read_avro and list_avro_columns for rework on Splittable Avro support #399

Add read_avro and list_avro_columns for rework on Splittable Avro support #399

Conversation

yongtang commented Jul 31, 2019

yongtang commented Aug 4, 2019