[C] SQLite driver is very slow when trying to bulk ingest with autocommit = true #466

paleolimbot · 2023-02-16T20:02:16Z

Encountered when working with the SQLite driver in R. I ran

# remotes::install_github("apache/arrow-nanoarrow/r")
# remotes::install_github("apache/arrow-adbc/r/adbcdrivermanager")
# remotes::install_github("paleolimbot/arrow-adbc/r/adbcsqlite@r-sqlite-driver", build = FALSE)

library(adbcdrivermanager)

# Use the driver manager to connect to a database
db <- adbc_database_init(adbcsqlite::adbcsqlite(), uri = "test.db")
con <- adbc_connection_init(db)

# Write a table
flights <- nycflights13::flights
# (timestamp not supported yet)
flights$time_hour <- NULL

stmt <- adbc_statement_init(con, adbc.ingest.target_table = "flights")
adbc_statement_bind(stmt, flights)
adbc_statement_execute_query(stmt)
adbc_statement_release(stmt)

stmt <- adbc_statement_init(con) |> 
  adbc_statement_set_sql_query("COMMIT TRANSACTION")
adbc_statement_execute_query(stmt)
adbc_statement_release(stmt)

# Clean up
adbc_connection_release(con)
adbc_database_release(db)

...which resulted in a "hang" on sqlite3_step()...the debugger seems to indicate that in SQLite3 this is happening when synchronizing the -journal file (I think I see the -journal file being created and deleted rapidly). If I run the same thing with adbc.connection.autocommit = "true" the whole thing runs rather quickly.

I'm not sure this is an actual "hang"...I think it's just a lot of unnecessary commits (maybe even one after each row?). The table I'm trying to insert is ~300000 rows. This also only occurs when using a file...(:memory: does the bulk ingest no problem).

The text was updated successfully, but these errors were encountered:

lidavidm · 2023-02-17T17:32:08Z

Oof, that makes sense. A commit after each row is definitely not necessary. It should explicitly BEGIN/COMMIT in this case.

…910) Fixes #466 by having a single begin/commit txn for ingesting tables, instead of committing once per row. # Testing ## R I first installed the [SQLite R driver](https://arrow.apache.org/adbc/main/driver/sqlite.html) and measured the time it takes to bulk ingest the `nycflights13` dataset, noting the 336776 rowcount: ```r library(adbcdrivermanager) db <- adbc_database_init(adbcsqlite::adbcsqlite(), uri = "test.db") con <- adbc_connection_init(db) flights <- nycflights13::flights flights$time_hour <- NULL stmt <- adbc_statement_init(con, adbc.ingest.target_table = "flights") adbc_statement_bind(stmt, flights) start_time <- Sys.time() adbc_statement_execute_query(stmt) end_time <- Sys.time() [1] 336776 ``` As well as the time it takes to execute the query: ```r end_time - start_time Time difference of 1.711345 mins ``` As a followup, I noticed it takes significantly longer (~30 minutes) to execute the query on my XPS 15 9520: - Ubuntu 22.04.2 - kernel 5.19.0 - i9-12900HK @ 4.90 GHz - Intel Alder Lake-P - 64 GB RAM Compared to my Macbook Air (1.711345 minutes): - macOS Ventura 13.4.1 - Apple M2 - 16 GB RAM Both on the same version of R 4.3.1 (Beagle Scout). After making my changes, I ran `R CMD INSTALL . --preclean` in `arrow-adbc/r/adbcdrivermanager`. I also installed the following R packages: ```r # install.packages("devtools") # install.packages("pkgbuild") ``` After which I ran the following commands to validate no build / compile issues showing up as R packaged my changes: ```r devtools::build() devtools::check() devtools::install() ``` Noting that the file I made changes to showed up as a vendored file: ```r Vendoring files from arrow-adbc to src/: - ../../adbc.h - ../../c/driver/sqlite/sqlite.c - ../../c/driver/sqlite/statement_reader.c - ../../c/driver/sqlite/statement_reader.h - ../../c/driver/sqlite/types.h - ../../c/driver/common/utils.c - ../../c/driver/common/utils.h - ../../c/vendor/nanoarrow/nanoarrow.h - ../../c/vendor/nanoarrow/nanoarrow.c - ../../c/vendor/sqlite3/sqlite3.h - ../../c/vendor/sqlite3/sqlite3.c All files successfully copied to src/ ``` After packaging and installing my changes, I ran through the same bulk ingest commands for the `nycflights13` dataset and verified that the table contained the same number of rows as the previous run, also noting the speedup from 1.7 minutes to 0.2 seconds: ```r ... > start_time <- Sys.time() adbc_statement_execute_query(stmt) end_time <- Sys.time() [1] 336776 > end_time - start_time Time difference of 0.2236128 secs ```

ywc88 mentioned this issue Jul 17, 2023

fix(c/driver/sqlite): Wrap bulk ingests in a single begin/commit txn #910

Merged

lidavidm added this to the ADBC Libraries 0.6.0 milestone Jul 20, 2023

lidavidm closed this as completed in #910 Jul 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[C] SQLite driver is very slow when trying to bulk ingest with autocommit = true #466

[C] SQLite driver is very slow when trying to bulk ingest with autocommit = true #466

paleolimbot commented Feb 16, 2023

lidavidm commented Feb 17, 2023

[C] SQLite driver is very slow when trying to bulk ingest with autocommit = true #466

[C] SQLite driver is very slow when trying to bulk ingest with autocommit = true #466

Comments

paleolimbot commented Feb 16, 2023

lidavidm commented Feb 17, 2023