-
Notifications
You must be signed in to change notification settings - Fork 237
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support coalescing reading for avro #5306
Conversation
Signed-off-by: Firestarman <firestarmanllc@gmail.com>
build |
Signed-off-by: Firestarman <firestarmanllc@gmail.com>
build |
A follow-up issue #5312 |
|
||
private lazy val hasAvroJar = ExternalSource.hasSparkAvroJar | ||
|
||
test("Use coalescing reading for local files") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like many duplicated test code for avro and MultiReaderTypeSuite ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can wrap the same code into MultiReaderTypeSuite and pass the "assume condition" to it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This implementation is clear and easy to understand, and these tests will be removed once multi-threaded reading is done.
val singleFileInfo = try { | ||
filterHandler.filterBlocks(file) | ||
} catch { | ||
case e: FileNotFoundException if ignoreMissingFiles => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add the related tests for this function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean unit test or IT ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I failed to figure out how to create such case ? Also there are no related tests in ORC and Parquet.
Do you have any suggestion for this ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry for the wrong place. I mean we should add a related test for corrupt files, Please refer to test_parquet_read_with_corrupt_files
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
val headerAndBlocks = BlockInfo(0, header.firstBlockStart, 0, 0) +: blocks | ||
copyBlocksData(headerAndBlocks, in, out) | ||
// check we didn't go over memory | ||
if (out.getPos > estOutSize) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we do re-allocate here instead of throwing an exception?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just copied the code from ORC and Parquet. If this is a corner case, I will do it in a follow-up PR.
out: OutputStream): Seq[BlockInfo] = { | ||
val copyRanges = computeCopyRanges(blocks) | ||
// copy cache: 8MB | ||
val copyCache = new Array[Byte](8 * 1024 * 1024) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you consider to use the same way like parquet?
val copyBufferSize = conf.getInt("parquet.read.allocation.size", 8 * 1024 * 1024)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, nice catch, I am doing it and it will come with the multi-threaded reading code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please file an followup, just in case you have forgot this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sql-plugin/src/main/scala/com/nvidia/spark/rapids/AvroDataReader.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/AvroDataReader.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/AvroDataReader.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/org/apache/spark/sql/rapids/GpuAvroScan.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/org/apache/spark/sql/rapids/GpuAvroScan.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/org/apache/spark/sql/rapids/GpuAvroScan.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/AvroDataReader.scala
Outdated
Show resolved
Hide resolved
Signed-off-by: Firestarman <firestarmanllc@gmail.com>
build |
Signed-off-by: Firestarman <firestarmanllc@gmail.com>
build |
sql-plugin/src/main/scala/com/nvidia/spark/rapids/AvroDataReader.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/AvroDataReader.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/org/apache/spark/sql/rapids/GpuAvroScan.scala
Outdated
Show resolved
Hide resolved
Signed-off-by: Firestarman <firestarmanllc@gmail.com>
build |
sql-plugin/src/main/scala/com/nvidia/spark/rapids/AvroDataFileReader.scala
Outdated
Show resolved
Hide resolved
Signed-off-by: Firestarman <firestarmanllc@gmail.com>
build |
Signed-off-by: Firestarman <firestarmanllc@gmail.com>
build |
Signed-off-by: Firestarman <firestarmanllc@gmail.com>
build |
This PR is to enable the coalescing reading for avro.
It has mainly
Performance on Local
CPU 12 cores, and one GPU (Titan V, with 12GB memory)
Non-partitioned 2000 avro files, 4.4GB in total in LOCAL storage
closes #5149
Signed-off-by: Firestarman firestarmanllc@gmail.com