-
Notifications
You must be signed in to change notification settings - Fork 182
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
load testing ocis tag handling causes 90% failed requests #9821
Comments
There are some things I've noticed around the I'm not sure what is the expected scenario, but I don't think the If we assume 1 millions files, with 500k files tagged and only 30 different tags (seems reasonable numbers), we'd need to gather the information from 500k files in the search service, send it to the graph service, and extract the 30 different tags from those files. On the other hand, if we expect to have only 100-200 files tagged, the current approach might be good enough. Much less data transferred from one service to another, less memory usage and less processing needed. Going deeper to the search service, I see some potential problems in the The search is divided into up to 20 workers, each one doing a piece of the search. This seems fine on paper, but I'm not sure how well it works in practice, mostly because of the additional work we're doing. If we assume 5 workers and each worker returns 100k results, we're copying those results (I assume we're copying the pointers, so probably not so bad, but still 500k pointers) and sort those 500k results (likely expensive). It's unclear to me if splitting the search work into multiple requests is better than letting bleve handle one big request. I'm not sure, but it seems bleve will go through the same data several times (once per request), and that could be slower than going through the data only once despite not doing parallel requests. I'd vote for letting bleve do its work unless there is a good reason. In any case, there is another problem with the workers: the maximum of 20 workers applies only to the specific request. The worker pool isn't shared among the requests. This means that, in the worst scenario, each request would spawn 20 workers; assuming 20 requests in parallel that could be 400 goroutines just for searching. Some ideas to improve the situation here:
Taking into account both previous points, I only see a couple of options to reduce the load caused by the
For the add and remove operations for tags, I'll need a deeper look. I assume that bleve has some write locks somewhere and there could be delays on some operations, but the ones I've found seems to be unlocked fast (I don't think the operations inside the write locks should take a lot of time), so maybe they collide with some read locks from the |
We might use another index for the tags. While it can improve the performance, it also has some drawbacks and we might need to reach a compromise. Advantages of a new index for the tags:
Disadvantages:
A reasonable compromise could be to only let the admin completely remove tags.
Step 4 could take some time to complete, but after that users won't be able to see the "company-only" tag as part of the available tags (although they could still create it again). Note that the admin flow could be implemented through command-line |
when running k6 tests against a kubernetes deployment I see tons of requests failing in the add-remove-tag scenario:
The text was updated successfully, but these errors were encountered: