Reading data using the executeSelect API is slow #2764

o-shevchenko · 2024-11-11T11:56:51Z

We use executeSelect API to run SQL query and read results from BigQuery. We expected a good speed based on this article

Reading data using executeSelectAPI is extremely slow.
Reading of 100_000 rows takes 23930 ms.
The profiling showed no prominent places where we spent most of the time.

Are there any recent changes that might cause performance degradation for such an API?
Do you have a benchmark to understand what performance we should expect?
Thanks!

Environment details

com.google.cloud:google-cloud-bigquery:2.43.3
Mac OS Sonoma M1
Java version: 17

Code example

Mono.fromCallable { bigQueryOptionsBuilder.build().service }
            .flatMap { context ->
                val connectionSettings = ConnectionSettings.newBuilder()
                    .setRequestTimeout(10L)
                    .setUseReadAPI(true)
                    .setMaxResults(1000)
                    .setNumBufferedRows(1000)
                    .setUseQueryCache(true)
                    .build();
                val connection = context.createConnection(connectionSettings)
                val bqResult = connection.executeSelect(sql)
                val result = Flux.usingWhen(
                    Mono.just(bqResult.resultSet),
                    { resultSet -> resultSet.toFlux(bqResult.schema) },
                    { _ -> Mono.fromRunnable<Unit> { connection.close() } }
                )
                Mono.just(Data(result, bqResult.schema.toSchema()))
            }
            ...
            
fun ResultSet.toFlux(schema:Schema): Flux<DataRecord> {
    return Flux.generate<DataRecord> { sink ->
        if (next()) {
            sink.next(toDataRecord(schema))
        } else {
            sink.complete()
        }
    }
}

The text was updated successfully, but these errors were encountered:

o-shevchenko · 2024-11-14T12:41:28Z

I've created a simplified test to show performance:

@Test
    fun `test read`() {
        val sql =
            """
            SELECT *
            FROM `pr`
            """.trimIndent().replace("\n", " ")
        val connectionSettings = ConnectionSettings.newBuilder()
            .setRequestTimeout(300)
            .setUseReadAPI(true)
            .setMaxResults(5000)
            .setUseQueryCache(true)
            .build()
        val connection = bigQueryOptionsBuilder.build().service.createConnection(connectionSettings)
        val bqResult = connection.executeSelect(sql)
        val resultSet = bqResult.resultSet

        var n = 1
        var lastTime = Instant.now()
        while (++n < 1_000_000 && resultSet.next()) {
            if (n % 30_000 == 0) {
                val now = Instant.now()
                val duration = Duration.between(lastTime, now)
                println("ROW $n Time: ${duration.toMillis()} ms ${DateTimeFormatter.ISO_INSTANT.format(now)}")
                lastTime = now
            }
        }
   }

ROW 30000 Time: 5516 ms 2024-11-14T12:35:54.354169Z
ROW 60000 Time: 11230 ms 2024-11-14T12:36:05.585005Z
ROW 90000 Time: 5645 ms 2024-11-14T12:36:11.230378Z
ROW 120000 Time: 5331 ms 2024-11-14T12:36:16.561915Z
ROW 150000 Time: 5458 ms 2024-11-14T12:36:22.019994Z
ROW 180000 Time: 5391 ms 2024-11-14T12:36:27.411807Z

~5sec to read 30000 rows

o-shevchenko · 2024-11-14T18:36:38Z

Related issue with benchmark: googleapis/java-bigquery#3574

o-shevchenko · 2024-11-15T12:25:51Z

After fixing the test I've got the following results.

Benchmark                                            (rowLimit)  Mode  Cnt       Score       Error  Units
ConnImplBenchmark.iterateRecordsUsingReadAPI             500000  avgt    3   76549.893 ± 14496.839  ms/op
ConnImplBenchmark.iterateRecordsUsingReadAPI            1000000  avgt    3  154957.127 ± 25916.110  ms/op
ConnImplBenchmark.iterateRecordsWithBigQuery_Query       500000  avgt    3   82508.807 ± 17930.275  ms/op
ConnImplBenchmark.iterateRecordsWithBigQuery_Query      1000000  avgt    3  165717.219 ± 86960.648  ms/op
ConnImplBenchmark.iterateRecordsWithoutUsingReadAPI      500000  avgt    3   84504.175 ± 36823.590  ms/op
ConnImplBenchmark.iterateRecordsWithoutUsingReadAPI     1000000  avgt    3  165142.367 ± 99899.991  ms/op

That's not what we expected after reading the doc: https://cloud.google.com/blog/topics/developers-practitioners/introducing-executeselect-client-library-method-and-how-use-it/

Comparison with Chart Estimates
From the chart:
1,000,000 rows:
Read Storage API: The speed on the chart is ~50,000 rows/sec, but I have 6,453 rows/sec.
tabledata.list API: Estimated at 5,000 rows/sec and I've got similar result 5,917 rows/sec.

Is there anything I missed?

o-shevchenko · 2024-11-21T12:18:56Z

@Neenu1995 Could you please help address this issue?
Thanks!

o-shevchenko · 2024-11-27T09:56:38Z

@leahecole, do you have any ideas on how to proceed here?
Thanks!

PhongChuong · 2024-12-02T06:45:45Z

Thanks for fixing the benchmark tests. The result is definitely unexpected. Did you see better results for the Read API in a previous version of com.google.cloud:google-cloud-bigquery?

AFAIK, the changes to executeSelect/Connection in the com.google.cloud:google-cloud-bigquery repo should not have caused any performance degradation.

@yirutang , do you know if there are any recent changes/configuration changes that might cause the performance slowdown?

o-shevchenko · 2024-12-02T09:48:41Z

Thanks for the reply @PhongChuong !
We are trying to build a fast and reliable integration with BigQuery to read and write data using 2.44.0, so I can't say when performance degradation started.
I've just tested 2.40.0 and 2.30.0 and see the same result in terms of speed.
Let me know if I need to test any other version.

o-shevchenko · 2024-12-02T09:55:03Z

I've noticed that the first run usually is much faster.
Are there any BigQuery configurations or limits we have to configure?
I've checked all docs that I was able to find and don't see anything else besides enabling BigQuery Storage Read API
https://cloud.google.com/blog/topics/developers-practitioners/introducing-executeselect-client-library-method-and-how-use-it/
https://cloud.google.com/bigquery/docs/reference/storage
I don't see any limits or quotas that were exceeded either.
Do you see the same result when you run the perf test?

o-shevchenko · 2024-12-09T10:16:17Z

@PhongChuong Do you have any insights on that?
Were you able to reproduce such a performance problem?

yirutang · 2024-12-09T20:38:13Z

@kmjung

kmjung · 2024-12-09T20:53:53Z

@o-shevchenko do you have the query job IDs and read session IDs involved in these benchmark runs? @PhongChuong do you know if it's possible to have the client log these?

Also, what is the benchmark configuration that you're using here? Where (e.g in what GCP region or multi-region) is your BigQuery data, where are you running the benchmark, etc?

o-shevchenko · 2024-12-09T21:04:27Z

Hi @kmjung
I use this BQ benchmark with public dataset bigquery-public-data.new_york_taxi_trips.tlc_yellow_trips_2017.
I fixed the test in java-bigquery repo.
It always reproduces. You can just run it as is to see if you have the same read speed.
I just use my creds in GOOGLE_APPLICATION_CREDENTIALS=path/to/service_account.json
googleapis/java-bigquery#3574

Let me know if you need any other details

kmjung · 2024-12-09T21:06:10Z

Where (e.g in what GCP region or multi-region) is your BigQuery data, where are you running the benchmark, etc?

Can you provide this information also? e.g. are you using a GCE VM in the same region as your BQ data? Running in AWS?

o-shevchenko · 2024-12-09T21:21:59Z

My service is deployed in AWS, in the same region. I don't think that network latency is involved that much. The read is very slow.

kmjung · 2024-12-09T21:36:53Z

I am less convinced that this is not related to AWS <--> GCE network latency. I will reach out offline to collect more data from you.

o-shevchenko · 2024-12-11T10:50:43Z

Thanks @kmjung . I've created a support case.
I also run benchmarks from my local machine where I directly read data from BQ, and it's 10x slower than the Snowflake JDBC connector for the same setup.
So, I assumed something was wrong with BQ API or our project configs.

product-auto-label bot added the api: bigquerystorage Issues related to the googleapis/java-bigquerystorage API. label Nov 11, 2024

o-shevchenko changed the title ~~Read data via executeSelect API is slow~~ Read data with executeSelect API is slow Nov 11, 2024

o-shevchenko changed the title ~~Read data with executeSelect API is slow~~ Reading data using the executeSelect API is slow Nov 11, 2024

o-shevchenko mentioned this issue Nov 14, 2024

fix(test): Update schema for broken ConnImplBenchmark test googleapis/java-bigquery#3574

Merged

PhongChuong self-assigned this Dec 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reading data using the executeSelect API is slow #2764

Reading data using the executeSelect API is slow #2764

o-shevchenko commented Nov 11, 2024 •

edited

Loading

o-shevchenko commented Nov 14, 2024 •

edited

Loading

o-shevchenko commented Nov 14, 2024

o-shevchenko commented Nov 15, 2024 •

edited

Loading

o-shevchenko commented Nov 21, 2024

o-shevchenko commented Nov 27, 2024

PhongChuong commented Dec 2, 2024

o-shevchenko commented Dec 2, 2024

o-shevchenko commented Dec 2, 2024 •

edited

Loading

o-shevchenko commented Dec 9, 2024

yirutang commented Dec 9, 2024

kmjung commented Dec 9, 2024

o-shevchenko commented Dec 9, 2024

kmjung commented Dec 9, 2024 •

edited

Loading

o-shevchenko commented Dec 9, 2024

kmjung commented Dec 9, 2024

o-shevchenko commented Dec 11, 2024

Reading data using the executeSelect API is slow #2764

Reading data using the executeSelect API is slow #2764

Comments

o-shevchenko commented Nov 11, 2024 • edited Loading

Environment details

Code example

o-shevchenko commented Nov 14, 2024 • edited Loading

o-shevchenko commented Nov 14, 2024

o-shevchenko commented Nov 15, 2024 • edited Loading

o-shevchenko commented Nov 21, 2024

o-shevchenko commented Nov 27, 2024

PhongChuong commented Dec 2, 2024

o-shevchenko commented Dec 2, 2024

o-shevchenko commented Dec 2, 2024 • edited Loading

o-shevchenko commented Dec 9, 2024

yirutang commented Dec 9, 2024

kmjung commented Dec 9, 2024

o-shevchenko commented Dec 9, 2024

kmjung commented Dec 9, 2024 • edited Loading

o-shevchenko commented Dec 9, 2024

kmjung commented Dec 9, 2024

o-shevchenko commented Dec 11, 2024

o-shevchenko commented Nov 11, 2024 •

edited

Loading

o-shevchenko commented Nov 14, 2024 •

edited

Loading

o-shevchenko commented Nov 15, 2024 •

edited

Loading

o-shevchenko commented Dec 2, 2024 •

edited

Loading

kmjung commented Dec 9, 2024 •

edited

Loading