Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change tootctl search deploy algorithm #14300

Merged
merged 1 commit into from
Jul 14, 2020
Merged

Conversation

Gargron
Copy link
Member

@Gargron Gargron commented Jul 13, 2020

First, performance killer fixes:

  • StatusesIndex definition was missing eager load of preloadable_poll, n+1 queries issue
  • StatusesIndex delete_if was calling searchable_by without crutches, causing lots of n+1 queries **

The actual command change is a divergence from Chewy-provided tasks, to do things our own way.

  • First, we want to control the order in which indices are processed. Before, it went from AccountsIndex, to StatusesIndex, to TagsIndex, which means that the comparatively small TagsIndex was not ready to go until the gigantic StatusesIndex has finished importing. Now we manually define the order: accounts, tags, statuses.
  • Second, when Chewy detected an index schema change, it not only re-created the index, but went ahead to import it at the same time. Realistically, we want indices to be ready for accepting live changes as soon as possible, because any actual import will take a long time, which is what we do now.
  • Third, Chewy fetched all IDs from the database and all IDs from ElasticSearch into memory to calculate which to index and which to de-index. Given the size of our tables, this makes no sense from a memory usage perspective. It's also a long preamble to starting the actual work. Now we just go over what's in the database in batches to index it *
  • Last, Chewy did not provide enough feedback on what it was busy doing. What we want is a progress bar that updates regularly. Now we have that, even with ETA estimation

* We still de-index some stuff, as long as it's in the database. The delete_if check is respected, additionally, we have a special optimization for the statuses index, de-indexing statuses when their searchable_by field is empty. What we don't try to de-index is records that have been removed from the database completely. Having those linger in ElasticSearch is junk, but does not truly impact search results since any ElasticSearch results have to be hydrated from the database. However, their presence in ElasticSearch presumes that the live updates through Sidekiq have failed somehow, which is not a normal situation.

** Removing the delete_if means that live updates through Sidekiq will always index, and never de-index statuses that are not searchable by anyone. This has no search results implications but is suboptimal from an index size perspective. Unfortunately, from my reading of Chewy code, it would be difficult if not impossible to monkey-patch it to pass crutches when calling delete_if.

@Gargron Gargron added the performance Runtime performance label Jul 13, 2020
@Gargron Gargron force-pushed the fix-tootctl-search-deploy branch 3 times, most recently from 1f68824 to d91a9c3 Compare July 14, 2020 12:28
@Gargron Gargron merged commit 4abe3be into master Jul 14, 2020
@Gargron Gargron deleted the fix-tootctl-search-deploy branch July 14, 2020 16:10
shouo1987 pushed a commit to CrossGate-Pawoo/mastodon that referenced this pull request Dec 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Runtime performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants