Tips on speeding up database build times? #1110

why-does-ie-still-exist · 2021-03-04T00:35:49Z

why-does-ie-still-exist
Mar 4, 2021

I've been doing some image classification using Ludwig, and so far I've gotten around 75% accuracy. I want to try more configurations for the model, but every time I change my configuration file it gets stuck on Building dataset(it may take a while) For me, this "a while" is 10-15 minutes on a colab remote machine, which hamstrings iteration.
I suspect this is due to the fact that I have to preprocess every image through Ludwig each run, because the hash of the config file doesn't match (even if it has the same preprocessing parameters) and so the database would have to be rebuilt. If I did the pre-processing myself, would this step be faster?

An example of what one of my config files looks like:

input_features:
    -
        name: image_path
        type: image
        encoder: stacked_cnn
        preprocessing:
        resize_method: interpolate
                width: 128
                height: 128

output_features:
    -
        name: class
        type: category

For reference, I have about 105 12MP images that I'm preprocessing each run. Should it take this long?

Answered by w4nderlust

Mar 5, 2021

@why-does-ie-still-exist there are a few things you can do to improve this.

Ludwig actually builds a cache of processed data after you run it the first time specifically to avoid this phenomenon, although there's an open issue (will solve it soon) about a bug that makes it recreate the cache when it is not needed #1078 . So when that issue is solved, this should not happen anymore (unless you change the preprocessing in your model definition).

Once Ludwig does preprocessing it creates a .hdf5 and .json file with the same name of the dataset, if in subsequent runs you provide those instead of the csv as inputs you should not pay the cost of the preprocessing as those files are the actual c…

View full answer

w4nderlust · 2021-03-05T21:22:41Z

w4nderlust
Mar 5, 2021
Maintainer

@why-does-ie-still-exist there are a few things you can do to improve this.

Ludwig actually builds a cache of processed data after you run it the first time specifically to avoid this phenomenon, although there's an open issue (will solve it soon) about a bug that makes it recreate the cache when it is not needed #1078 . So when that issue is solved, this should not happen anymore (unless you change the preprocessing in your model definition).

Once Ludwig does preprocessing it creates a .hdf5 and .json file with the same name of the dataset, if in subsequent runs you provide those instead of the csv as inputs you should not pay the cost of the preprocessing as those files are the actual caches.

Additionally, you can speedup the process by setting multiple threads to run in parallel with the num_processes in the processing section.

Finally, 12mp images may be quite big, until the issue I was telling you before is fixed, I would suggest reducing them to the desired size before.

Hope this helps!

0 replies

why-does-ie-still-exist · 2021-03-07T06:39:12Z

why-does-ie-still-exist
Mar 7, 2021
Author

Lol, just realized I've been running Colab without GPU acceleration. The time went from 25 minutes to less than 2 🤦. That's one way to speed up your database build+training times.

1 reply

w4nderlust Mar 7, 2021
Maintainer

Oh well, that surely helps :) Although it helps training, not preprocessing, so if you run training multiple times you may incur in the preprocessing cost multiple times if you skip saving the preprocessed data

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tips on speeding up database build times? #1110

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{title}}

Select a reply

Tips on speeding up database build times? #1110

why-does-ie-still-exist Mar 4, 2021

Replies: 2 comments · 1 reply

w4nderlust Mar 5, 2021 Maintainer

why-does-ie-still-exist Mar 7, 2021 Author

w4nderlust Mar 7, 2021 Maintainer

why-does-ie-still-exist
Mar 4, 2021

Replies: 2 comments 1 reply

w4nderlust
Mar 5, 2021
Maintainer

why-does-ie-still-exist
Mar 7, 2021
Author

w4nderlust Mar 7, 2021
Maintainer