-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature Request]: Reduce number of byte[] copies in TextSource #23193
Labels
Comments
lukecwik
added a commit
to lukecwik/incubator-beam
that referenced
this issue
Sep 12, 2022
…e copied (fixes apache#23193) This makes TextSource take about 2.3x less CPU resources during decoding. Before this change: ``` TextSourceBenchmark.benchmarkTextSource thrpt 5 0.248 ± 0.029 ops/s ``` After this change: ``` TextSourceBenchmark.benchmarkHadoopLineReader thrpt 5 0.465 ± 0.064 ops/s TextSourceBenchmark.benchmarkTextSource thrpt 5 0.575 ± 0.059 ops/s ```
lukecwik
added a commit
to lukecwik/incubator-beam
that referenced
this issue
Sep 12, 2022
…e copied (fixes apache#23193) This makes TextSource take about 2.3x less CPU resources during decoding. Before this change: ``` TextSourceBenchmark.benchmarkTextSource thrpt 5 0.248 ± 0.029 ops/s ``` After this change: ``` TextSourceBenchmark.benchmarkHadoopLineReader thrpt 5 0.465 ± 0.064 ops/s TextSourceBenchmark.benchmarkTextSource thrpt 5 0.575 ± 0.059 ops/s ```
CC: @bhisevishal |
lukecwik
added a commit
to lukecwik/incubator-beam
that referenced
this issue
Sep 12, 2022
…e copied (fixes apache#23193) This makes TextSource take about 2.3x less CPU resources during decoding. Before this change: ``` TextSourceBenchmark.benchmarkTextSource thrpt 5 0.248 ± 0.029 ops/s ``` After this change: ``` TextSourceBenchmark.benchmarkHadoopLineReader thrpt 5 0.465 ± 0.064 ops/s TextSourceBenchmark.benchmarkTextSource thrpt 5 0.575 ± 0.059 ops/s ```
lukecwik
added a commit
to lukecwik/incubator-beam
that referenced
this issue
Sep 12, 2022
…e copied (fixes apache#23193) This makes TextSource take about 2.3x less CPU resources during decoding. Before this change: ``` TextSourceBenchmark.benchmarkTextSource thrpt 5 0.248 ± 0.029 ops/s ``` After this change: ``` TextSourceBenchmark.benchmarkHadoopLineReader thrpt 5 0.465 ± 0.064 ops/s TextSourceBenchmark.benchmarkTextSource thrpt 5 0.575 ± 0.059 ops/s ```
Merged
4 tasks
lukecwik
added a commit
that referenced
this issue
Sep 15, 2022
…e copied (fixes #23193) (#23196) * Improve the performance of TextSource by reducing how many byte[]s are copied (fixes #23193) This makes TextSource take about 2.3x less CPU resources during decoding. Before this change: ``` TextSourceBenchmark.benchmarkTextSource thrpt 5 0.248 ± 0.029 ops/s ``` After this change: ``` TextSourceBenchmark.benchmarkHadoopLineReader thrpt 5 0.465 ± 0.064 ops/s TextSourceBenchmark.benchmarkTextSource thrpt 5 0.575 ± 0.059 ops/s ``` * Write file in pieces instead of pre-allocating entire buffer * Address PR comments
cushon
pushed a commit
to cushon/beam
that referenced
this issue
Oct 17, 2022
…e copied (fixes apache#23193) (apache#23196) * Improve the performance of TextSource by reducing how many byte[]s are copied (fixes apache#23193) This makes TextSource take about 2.3x less CPU resources during decoding. Before this change: ``` TextSourceBenchmark.benchmarkTextSource thrpt 5 0.248 ± 0.029 ops/s ``` After this change: ``` TextSourceBenchmark.benchmarkHadoopLineReader thrpt 5 0.465 ± 0.064 ops/s TextSourceBenchmark.benchmarkTextSource thrpt 5 0.575 ± 0.059 ops/s ``` * Write file in pieces instead of pre-allocating entire buffer * Address PR comments
johnjcasey
added a commit
to johnjcasey/beam
that referenced
this issue
Nov 21, 2023
…te[]s are copied (fixes apache#23193) (apache#23196)" This reverts commit 30a48f0
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
What would you like to happen?
The current TextSource implementation is spending a lot of time during
byte[]
copying:Hadoop
LineReader.java
implementation is signficantly faster (~2x) when handling typical files due to an implementation that reduces how manybyte[]
s are copied. A simple benchmark reading 10 million lines (60-120 characters long) shows that it takes about ~2.05 seconds to process such a file while the Apache Beam TextSource takes ~4.03 seconds.Issue Priority
Priority: 2
Issue Component
Component: io-java-text
The text was updated successfully, but these errors were encountered: