GpuShuffleCoalesceIterator acquire semaphore after host concat #4396
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Closes #4395
This is a small optimization that was spotted while looking into NDS Q64 traces. With the change, Q64 can save up to 3 seconds (though it changes quite a bit from run to run). When executing this over all of NDS, it saved ~1 minute for the whole run chipping away at times a few hundred ms, up to 5 seconds for q94.
I saw some queries being slower, with the worst case being q42 (which for 1 sample out of 10 was 2x slower). I have not been able to reproduce this case, with all subsequent runs at 1x or above. This was a 3.8 second in the last weekly run, with the patch it's hovering between 3.6 and 4.5.