Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Round up for the batch size to improve par_iter performance #9814

Merged
merged 2 commits into from
Sep 18, 2023

Conversation

MJohnson459
Copy link
Contributor

Objective

The default division for a usize rounds down which means the batch sizes were too small when the max_size isn't exactly divisible by the batch count.

Solution

Changing the division to round up fixes this which can dramatically improve performance when using par_iter.

I created a small example to proof this out and measured some results. I don't know if it's worth committing this permanently so I left it out of the PR for now.

use std::{thread, time::Duration};

use bevy::{
    prelude::*,
    window::{PresentMode, WindowPlugin},
};

fn main() {
    App::new()
        .add_plugins((DefaultPlugins.set(WindowPlugin {
            primary_window: Some(Window {
                present_mode: PresentMode::AutoNoVsync,
                ..default()
            }),
            ..default()
        }),))
        .add_systems(Startup, spawn)
        .add_systems(Update, update_counts)
        .run();
}

#[derive(Component, Default, Debug, Clone, Reflect)]
pub struct Count(u32);

fn spawn(mut commands: Commands) {
    // Worst case
    let tasks = bevy::tasks::available_parallelism() * 5 - 1;
    // Best case
    // let tasks = bevy::tasks::available_parallelism() * 5 + 1;
    for _ in 0..tasks {
        commands.spawn(Count(0));
    }
}

// changing the bounds of the text will cause a recomputation
fn update_counts(mut count_query: Query<&mut Count>) {
    count_query.par_iter_mut().for_each(|mut count| {
        count.0 += 1;
        thread::sleep(Duration::from_millis(10))
    });
}

Results

I ran this four times, with and without the change, with best case (should favour the old maths) and worst case (should favour the new maths) task numbers.

Worst case

Before the change the batches were 9 on each thread, plus the 5 remainder ran on one of the threads in addition. With the change its 10 on each thread apart from one which has 9. The results show a decrease from ~140ms to ~100ms which matches what you would expect from the maths (10 * 10ms vs (9 + 4) * 10ms).

Screenshot from 2023-09-14 20-24-36

Best case

Before the change the batches were 10 on each thread, plus the 1 remainder ran on one of the threads in addition. With the change its 11 on each thread apart from one which has 5. The results slightly favour the new change but are basically identical as the total time is determined by the worse case which is 11 * 10ms for both tests.

Screenshot from 2023-09-14 20-48-51

The default usize division rounds down which leads to faulty batch sizes
when the max_size isn't exactly divisible.
@alice-i-cecile alice-i-cecile added A-ECS Entities, components, systems, and events C-Performance A change motivated by improving speed, memory usage or compile times A-Tasks Tools for parallel and async work labels Sep 14, 2023
Copy link
Member

@alice-i-cecile alice-i-cecile left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great PR description and clear demonstration of the effect. Thanks!

Copy link
Member

@alice-i-cecile alice-i-cecile left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, can you please update the docs to describe this behaviour?

@MJohnson459
Copy link
Contributor Author

Thanks! I added a small clarification to the docs to be explicit about the rounding. Is there anywhere else the rough maths is mentioned? That is the docs the bevy 0.10 release notes point to and the only place I could find it mentioned.

@james7132 james7132 self-requested a review September 14, 2023 20:29
Copy link
Member

@james7132 james7132 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been trying to construct a degenerate case where this might cause a perf regression but cannot think of one at this moment.

@alice-i-cecile alice-i-cecile added the S-Ready-For-Final-Review This PR has been approved by the community. It's ready for a maintainer to consider merging it label Sep 15, 2023
@MJohnson459
Copy link
Contributor Author

I've been trying to construct a degenerate case where this might cause a perf regression but cannot think of one at this moment.

The worst case I can come up with is if I change the test to have multiple different par_iter systems in the same set and each has a different time. The batch calculation above assumes the executor will evenly spread out the tasks across threads and while this is true for the single case, it isn't true for multiple. It seems that we can hit a case where two of the longer tasks are assigned the same thread.

Here is an example update which shows a sub-optimal allocation. I struggled to check tracy to show this nicely, I don't know why the long ones are shown under the short task.
image
image

Anyway, in this case, this change may make things worse (two 10ms vs two 9ms going back to the original example) but the scheduling noise is quite high so I wasn't able to show this well. It also seems the root of the issue here is with the executor, although I don't know if that's something that could even be fixed (I guess it assumes all tasks are equal?). The only thing I can think of would be somehow tell it all tasks from a single par_iter loop are likely to be a similar duration, a hairy assumption.

@alice-i-cecile alice-i-cecile added this pull request to the merge queue Sep 18, 2023
Merged via the queue into bevyengine:main with commit 68fa81e Sep 18, 2023
@cart cart mentioned this pull request Oct 13, 2023
43 tasks
rdrpenguin04 pushed a commit to rdrpenguin04/bevy that referenced this pull request Jan 9, 2024
…ne#9814)

# Objective

The default division for a `usize` rounds down which means the batch
sizes were too small when the `max_size` isn't exactly divisible by the
batch count.

## Solution

Changing the division to round up fixes this which can dramatically
improve performance when using `par_iter`.

I created a small example to proof this out and measured some results. I
don't know if it's worth committing this permanently so I left it out of
the PR for now.

```rust
use std::{thread, time::Duration};

use bevy::{
    prelude::*,
    window::{PresentMode, WindowPlugin},
};

fn main() {
    App::new()
        .add_plugins((DefaultPlugins.set(WindowPlugin {
            primary_window: Some(Window {
                present_mode: PresentMode::AutoNoVsync,
                ..default()
            }),
            ..default()
        }),))
        .add_systems(Startup, spawn)
        .add_systems(Update, update_counts)
        .run();
}

#[derive(Component, Default, Debug, Clone, Reflect)]
pub struct Count(u32);

fn spawn(mut commands: Commands) {
    // Worst case
    let tasks = bevy::tasks::available_parallelism() * 5 - 1;
    // Best case
    // let tasks = bevy::tasks::available_parallelism() * 5 + 1;
    for _ in 0..tasks {
        commands.spawn(Count(0));
    }
}

// changing the bounds of the text will cause a recomputation
fn update_counts(mut count_query: Query<&mut Count>) {
    count_query.par_iter_mut().for_each(|mut count| {
        count.0 += 1;
        thread::sleep(Duration::from_millis(10))
    });
}
```

## Results

I ran this four times, with and without the change, with best case
(should favour the old maths) and worst case (should favour the new
maths) task numbers.

### Worst case

Before the change the batches were 9 on each thread, plus the 5
remainder ran on one of the threads in addition. With the change its 10
on each thread apart from one which has 9. The results show a decrease
from ~140ms to ~100ms which matches what you would expect from the maths
(`10 * 10ms` vs `(9 + 4) * 10ms`).

![Screenshot from 2023-09-14
20-24-36](https://github.com/bevyengine/bevy/assets/1353401/82099ee4-83a8-47f4-bb6b-944f1e87a818)

### Best case

Before the change the batches were 10 on each thread, plus the 1
remainder ran on one of the threads in addition. With the change its 11
on each thread apart from one which has 5. The results slightly favour
the new change but are basically identical as the total time is
determined by the worse case which is `11 * 10ms` for both tests.

![Screenshot from 2023-09-14
20-48-51](https://github.com/bevyengine/bevy/assets/1353401/4532211d-ab36-435b-b864-56af3370d90e)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-ECS Entities, components, systems, and events A-Tasks Tools for parallel and async work C-Performance A change motivated by improving speed, memory usage or compile times S-Ready-For-Final-Review This PR has been approved by the community. It's ready for a maintainer to consider merging it
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants