-
Notifications
You must be signed in to change notification settings - Fork 93
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[C] SQLite driver is very slow when trying to bulk ingest with autocommit = true #466
Milestone
Comments
Oof, that makes sense. A commit after each row is definitely not necessary. It should explicitly BEGIN/COMMIT in this case. |
lidavidm
pushed a commit
that referenced
this issue
Jul 25, 2023
…910) Fixes #466 by having a single begin/commit txn for ingesting tables, instead of committing once per row. # Testing ## R I first installed the [SQLite R driver](https://arrow.apache.org/adbc/main/driver/sqlite.html) and measured the time it takes to bulk ingest the `nycflights13` dataset, noting the 336776 rowcount: ```r library(adbcdrivermanager) db <- adbc_database_init(adbcsqlite::adbcsqlite(), uri = "test.db") con <- adbc_connection_init(db) flights <- nycflights13::flights flights$time_hour <- NULL stmt <- adbc_statement_init(con, adbc.ingest.target_table = "flights") adbc_statement_bind(stmt, flights) start_time <- Sys.time() adbc_statement_execute_query(stmt) end_time <- Sys.time() [1] 336776 ``` As well as the time it takes to execute the query: ```r end_time - start_time Time difference of 1.711345 mins ``` As a followup, I noticed it takes significantly longer (~30 minutes) to execute the query on my XPS 15 9520: - Ubuntu 22.04.2 - kernel 5.19.0 - i9-12900HK @ 4.90 GHz - Intel Alder Lake-P - 64 GB RAM Compared to my Macbook Air (1.711345 minutes): - macOS Ventura 13.4.1 - Apple M2 - 16 GB RAM Both on the same version of R 4.3.1 (Beagle Scout). After making my changes, I ran `R CMD INSTALL . --preclean` in `arrow-adbc/r/adbcdrivermanager`. I also installed the following R packages: ```r # install.packages("devtools") # install.packages("pkgbuild") ``` After which I ran the following commands to validate no build / compile issues showing up as R packaged my changes: ```r devtools::build() devtools::check() devtools::install() ``` Noting that the file I made changes to showed up as a vendored file: ```r Vendoring files from arrow-adbc to src/: - ../../adbc.h - ../../c/driver/sqlite/sqlite.c - ../../c/driver/sqlite/statement_reader.c - ../../c/driver/sqlite/statement_reader.h - ../../c/driver/sqlite/types.h - ../../c/driver/common/utils.c - ../../c/driver/common/utils.h - ../../c/vendor/nanoarrow/nanoarrow.h - ../../c/vendor/nanoarrow/nanoarrow.c - ../../c/vendor/sqlite3/sqlite3.h - ../../c/vendor/sqlite3/sqlite3.c All files successfully copied to src/ ``` After packaging and installing my changes, I ran through the same bulk ingest commands for the `nycflights13` dataset and verified that the table contained the same number of rows as the previous run, also noting the speedup from 1.7 minutes to 0.2 seconds: ```r ... > start_time <- Sys.time() adbc_statement_execute_query(stmt) end_time <- Sys.time() [1] 336776 > end_time - start_time Time difference of 0.2236128 secs ```
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Encountered when working with the SQLite driver in R. I ran
...which resulted in a "hang" on
sqlite3_step()
...the debugger seems to indicate that in SQLite3 this is happening when synchronizing the -journal file (I think I see the -journal file being created and deleted rapidly). If I run the same thing withadbc.connection.autocommit = "true"
the whole thing runs rather quickly.I'm not sure this is an actual "hang"...I think it's just a lot of unnecessary commits (maybe even one after each row?). The table I'm trying to insert is ~300000 rows. This also only occurs when using a file...(
:memory:
does the bulk ingest no problem).The text was updated successfully, but these errors were encountered: