Skip to content

Commit

Permalink
Speed up GuideImage insert using insert_all (#193)
Browse files Browse the repository at this point in the history
* Use insert_all to speed up imports.

* Include barrier.

* Add some more logging.

* Add logging.

* Try doing import synchronously, maybe it's fast enough.

* Don't track coverage of s3_disk_list.

* Update time estimate.
  • Loading branch information
tpendragon authored Oct 13, 2023
1 parent 4455c1c commit 6fc5b60
Show file tree
Hide file tree
Showing 5 changed files with 42 additions and 30 deletions.
4 changes: 4 additions & 0 deletions .rubocop.yml
Original file line number Diff line number Diff line change
Expand Up @@ -27,3 +27,7 @@ Metrics/AbcSize:

Style/HashSyntax:
EnforcedShorthandSyntax: never

Rails/SkipsModelValidations:
Exclude:
- "app/services/card_image_loading_service.rb"
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,7 @@ To list all import services for the application: `rake -T | grep import`
To import the GuideCard records (takes about 3 minutes): `rake import:import_guide_cards`
To import the SubGuideCard records (takes about 2 minutes): `rake import:import_sub_guide_cards`

The CardImage records are the images that are included in the GuideCard and SubGuideCard records. There are 5,786,727 images. These are estimated to take about 1 day to import.
The CardImage records are the images that are included in the GuideCard and SubGuideCard records. There are 5,780,170 images. These are estimated to take about 9 minutes to import.

To import the CardImage records: `rake import:import_card_images`

Expand Down
54 changes: 26 additions & 28 deletions app/services/card_image_loading_service.rb
Original file line number Diff line number Diff line change
Expand Up @@ -18,62 +18,60 @@ def import
barrier = Async::Barrier.new
Sync do
semaphore = Async::Semaphore.new(22, parent: barrier)

(1..22).map do |disk|
# Fetch all the files first - there's a lot, but not so many that it can't
# sit in memory, and async 22 makes this fast enough.
all_files = (1..22).map do |disk|
semaphore.async do
import_disk(disk)
disk_array(disk)
end
end.map(&:wait)
end.flat_map(&:wait)
progress_bar.total = all_files.count
progress_bar.progress = 0
import_files(all_files)
ensure
barrier.stop
end
end

def import_disk(disk)
logger.info("Fetching disk #{disk} file list")
filenames = disk_array(disk)
progress_bar.total += filenames.count
Sync do
semaphore = Async::Semaphore.new(10_000)
filenames.map do |file_name|
semaphore.async do
progress_bar.increment
find_or_create_card_image(file_name)
end
end.map(&:wait)
def import_files(all_files)
# insert_all in batches of 1000
all_files.each_slice(1000) do |slice|
import_slice(slice)
end
end

def import_slice(slice)
# Create an array of hashes that represent what we want to insert.
insert_slice = slice.map do |file_name|
path = file_name.gsub('imagecat-disk', '').split('-')[0..-2].join('/')
{ path: path, image_name: file_name }
end
result = CardImage.insert_all(insert_slice)
logger.info("Created #{result.count} rows")
progress_bar.progress += slice.count
end

private

# returns something like
# ["imagecat-disk9-0091-A3037-1358.0110.tif", "imagecat-disk9-0091-A3037-1358.0111.tif"]
def disk_array(disk)
logger.info("Fetching disk #{disk} file list")
s3_disk_list(disk).split("\n").map(&:split).map(&:last)
end

# returns something like
# "2023-07-19 14:39:38 3422 imagecat-disk9-0091-A3037-1358.0110.tif\n2023-07-19 14:39:38 7010 imagecat-disk9-0091-A3037-1358.0111.tif\n"
# :nocov:
def s3_disk_list(disk)
`aws s3 ls s3://puliiif-production/imagecat-disk#{disk}-`
end

def find_or_create_card_image(file_name)
path = file_name.gsub('imagecat-disk', '').split('-')[0..-2].join('/')
ci = CardImage.find_by(path: path, image_name: file_name)
return if ci

CardImage.create(path: path, image_name: file_name)
end
# :nocov:

def progress_bar
@progress_bar ||= ProgressBar.create(format: '%a %e %P% Loading: %c from %C', output: progress_output, total: 0, title: 'Image import')
end

def progress_bar_old(total)
ProgressBar.create(format: '%a %e %P% Loading: %c from %C', total: total, output: progress_output)
end

def progress_output
ProgressBar::Outputs::Null if suppress_progress
end
Expand Down
9 changes: 9 additions & 0 deletions db/migrate/20231012212118_add_card_image_unique_index.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
# frozen_string_literal: true

# Adds a unique index for image_name to card_images so that `insert_all` for
# bulk ingest will work.
class AddCardImageUniqueIndex < ActiveRecord::Migration[7.0]
def change
add_index :card_images, :image_name, unique: true
end
end
3 changes: 2 additions & 1 deletion db/schema.rb

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

0 comments on commit 6fc5b60

Please sign in to comment.