Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C] SQLite driver is very slow when trying to bulk ingest with autocommit = true #466

Closed
paleolimbot opened this issue Feb 16, 2023 · 1 comment · Fixed by #910
Closed

Comments

@paleolimbot
Copy link
Member

Encountered when working with the SQLite driver in R. I ran

# remotes::install_github("apache/arrow-nanoarrow/r")
# remotes::install_github("apache/arrow-adbc/r/adbcdrivermanager")
# remotes::install_github("paleolimbot/arrow-adbc/r/adbcsqlite@r-sqlite-driver", build = FALSE)

library(adbcdrivermanager)

# Use the driver manager to connect to a database
db <- adbc_database_init(adbcsqlite::adbcsqlite(), uri = "test.db")
con <- adbc_connection_init(db)

# Write a table
flights <- nycflights13::flights
# (timestamp not supported yet)
flights$time_hour <- NULL

stmt <- adbc_statement_init(con, adbc.ingest.target_table = "flights")
adbc_statement_bind(stmt, flights)
adbc_statement_execute_query(stmt)
adbc_statement_release(stmt)

stmt <- adbc_statement_init(con) |> 
  adbc_statement_set_sql_query("COMMIT TRANSACTION")
adbc_statement_execute_query(stmt)
adbc_statement_release(stmt)

# Clean up
adbc_connection_release(con)
adbc_database_release(db)

...which resulted in a "hang" on sqlite3_step()...the debugger seems to indicate that in SQLite3 this is happening when synchronizing the -journal file (I think I see the -journal file being created and deleted rapidly). If I run the same thing with adbc.connection.autocommit = "true" the whole thing runs rather quickly.

I'm not sure this is an actual "hang"...I think it's just a lot of unnecessary commits (maybe even one after each row?). The table I'm trying to insert is ~300000 rows. This also only occurs when using a file...(:memory: does the bulk ingest no problem).

@lidavidm
Copy link
Member

Oof, that makes sense. A commit after each row is definitely not necessary. It should explicitly BEGIN/COMMIT in this case.

@lidavidm lidavidm added this to the ADBC Libraries 0.6.0 milestone Jul 20, 2023
lidavidm pushed a commit that referenced this issue Jul 25, 2023
…910)

Fixes #466 by having a single begin/commit txn for ingesting tables,
instead of committing once per row.

# Testing
## R
I first installed the [SQLite R
driver](https://arrow.apache.org/adbc/main/driver/sqlite.html) and
measured the time it takes to bulk ingest the `nycflights13` dataset,
noting the 336776 rowcount:
```r
library(adbcdrivermanager)
db <- adbc_database_init(adbcsqlite::adbcsqlite(), uri = "test.db")
con <- adbc_connection_init(db)

flights <- nycflights13::flights
flights$time_hour <- NULL

stmt <- adbc_statement_init(con, adbc.ingest.target_table = "flights")
adbc_statement_bind(stmt, flights)

start_time <- Sys.time()
adbc_statement_execute_query(stmt)
end_time <- Sys.time()
[1] 336776
```
As well as the time it takes to execute the query:
```r
end_time - start_time
Time difference of 1.711345 mins
```

As a followup, I noticed it takes significantly longer (~30 minutes) to
execute the query on my XPS 15 9520:
- Ubuntu 22.04.2
- kernel 5.19.0
- i9-12900HK @ 4.90 GHz
- Intel Alder Lake-P
- 64 GB RAM

Compared to my Macbook Air (1.711345 minutes):
- macOS Ventura 13.4.1
- Apple M2
- 16 GB RAM

Both on the same version of R 4.3.1 (Beagle Scout).

After making my changes, I ran `R CMD INSTALL . --preclean` in
`arrow-adbc/r/adbcdrivermanager`. I also installed the following R
packages:
```r
# install.packages("devtools")
# install.packages("pkgbuild")
```

After which I ran the following commands to validate no build / compile
issues showing up as R packaged my changes:
```r
devtools::build()
devtools::check()
devtools::install()
```
Noting that the file I made changes to showed up as a vendored file:
```r
Vendoring files from arrow-adbc to src/:
- ../../adbc.h
- ../../c/driver/sqlite/sqlite.c
- ../../c/driver/sqlite/statement_reader.c
- ../../c/driver/sqlite/statement_reader.h
- ../../c/driver/sqlite/types.h
- ../../c/driver/common/utils.c
- ../../c/driver/common/utils.h
- ../../c/vendor/nanoarrow/nanoarrow.h
- ../../c/vendor/nanoarrow/nanoarrow.c
- ../../c/vendor/sqlite3/sqlite3.h
- ../../c/vendor/sqlite3/sqlite3.c
All files successfully copied to src/
```

After packaging and installing my changes, I ran through the same bulk
ingest commands for the `nycflights13` dataset and verified that the
table contained the same number of rows as the previous run, also noting
the speedup from 1.7 minutes to 0.2 seconds:
```r
...
> start_time <- Sys.time()
adbc_statement_execute_query(stmt)
end_time <- Sys.time()
[1] 336776
> end_time - start_time
Time difference of 0.2236128 secs
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants