Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up GuideImage insert using insert_all #193

Merged
merged 7 commits into from
Oct 13, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions .rubocop.yml
Original file line number Diff line number Diff line change
Expand Up @@ -27,3 +27,7 @@ Metrics/AbcSize:

Style/HashSyntax:
EnforcedShorthandSyntax: never

Rails/SkipsModelValidations:
Exclude:
- "app/services/card_image_loading_service.rb"
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,7 @@ To list all import services for the application: `rake -T | grep import`
To import the GuideCard records (takes about 3 minutes): `rake import:import_guide_cards`
To import the SubGuideCard records (takes about 2 minutes): `rake import:import_sub_guide_cards`

The CardImage records are the images that are included in the GuideCard and SubGuideCard records. There are 5,786,727 images. These are estimated to take about 1 day to import.
The CardImage records are the images that are included in the GuideCard and SubGuideCard records. There are 5,780,170 images. These are estimated to take about 9 minutes to import.

To import the CardImage records: `rake import:import_card_images`

Expand Down
54 changes: 26 additions & 28 deletions app/services/card_image_loading_service.rb
Original file line number Diff line number Diff line change
Expand Up @@ -18,62 +18,60 @@ def import
barrier = Async::Barrier.new
Sync do
semaphore = Async::Semaphore.new(22, parent: barrier)

(1..22).map do |disk|
# Fetch all the files first - there's a lot, but not so many that it can't
# sit in memory, and async 22 makes this fast enough.
all_files = (1..22).map do |disk|
semaphore.async do
import_disk(disk)
disk_array(disk)
end
end.map(&:wait)
end.flat_map(&:wait)
progress_bar.total = all_files.count
progress_bar.progress = 0
import_files(all_files)
ensure
barrier.stop
end
end

def import_disk(disk)
logger.info("Fetching disk #{disk} file list")
filenames = disk_array(disk)
progress_bar.total += filenames.count
Sync do
semaphore = Async::Semaphore.new(10_000)
filenames.map do |file_name|
semaphore.async do
progress_bar.increment
find_or_create_card_image(file_name)
end
end.map(&:wait)
def import_files(all_files)
# insert_all in batches of 1000
all_files.each_slice(1000) do |slice|
import_slice(slice)
end
end

def import_slice(slice)
# Create an array of hashes that represent what we want to insert.
insert_slice = slice.map do |file_name|
path = file_name.gsub('imagecat-disk', '').split('-')[0..-2].join('/')
{ path: path, image_name: file_name }
end
result = CardImage.insert_all(insert_slice)
logger.info("Created #{result.count} rows")
progress_bar.progress += slice.count
end

private

# returns something like
# ["imagecat-disk9-0091-A3037-1358.0110.tif", "imagecat-disk9-0091-A3037-1358.0111.tif"]
def disk_array(disk)
logger.info("Fetching disk #{disk} file list")
s3_disk_list(disk).split("\n").map(&:split).map(&:last)
end

# returns something like
# "2023-07-19 14:39:38 3422 imagecat-disk9-0091-A3037-1358.0110.tif\n2023-07-19 14:39:38 7010 imagecat-disk9-0091-A3037-1358.0111.tif\n"
# :nocov:
def s3_disk_list(disk)
`aws s3 ls s3://puliiif-production/imagecat-disk#{disk}-`
end

def find_or_create_card_image(file_name)
path = file_name.gsub('imagecat-disk', '').split('-')[0..-2].join('/')
ci = CardImage.find_by(path: path, image_name: file_name)
return if ci

CardImage.create(path: path, image_name: file_name)
end
# :nocov:

def progress_bar
@progress_bar ||= ProgressBar.create(format: '%a %e %P% Loading: %c from %C', output: progress_output, total: 0, title: 'Image import')
end

def progress_bar_old(total)
ProgressBar.create(format: '%a %e %P% Loading: %c from %C', total: total, output: progress_output)
end

def progress_output
ProgressBar::Outputs::Null if suppress_progress
end
Expand Down
9 changes: 9 additions & 0 deletions db/migrate/20231012212118_add_card_image_unique_index.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
# frozen_string_literal: true

# Adds a unique index for image_name to card_images so that `insert_all` for
# bulk ingest will work.
class AddCardImageUniqueIndex < ActiveRecord::Migration[7.0]
def change
add_index :card_images, :image_name, unique: true
end
end
3 changes: 2 additions & 1 deletion db/schema.rb

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.