Add `ParticleIDWrapper::make_invalid()` #3735

ax3l · 2024-01-31T01:48:30Z

Summary

A cheaper way to swap validity sign on particle ids, as needed to select and track particles from one kernel to another (e.g., boundary condition treatment, re-emission physics, scraping of particles, etc.).

With our current encoding, ParticleIDWrapper::make_invalid() is the same as id = -id, but cheaper.

Improvements:

less code emitted, faster execution
less code emitted, less instructions to store for occupancy limits on GPU
faster, streamlined code: no jump calls
explicit valid/invalid calls instead of swaps with input-dependent outcome

Additional background

Host Code

https://godbolt.org/z/KPjzExWz1

CUDA Device Code

PTX: https://godbolt.org/z/6En5rK14o
SASS for SM_80: https://godbolt.org/z/d6zYfxaKG

validation: old/new are same register usage and cost; new code uses a few less logical ops
mark_(in)valid vs. id = -id: now saves 4 registers 🎉
- this is the one we often use in already heavy physics kernels - a win for occupancy

Interesting: there are still no 64bit shifts / but shuffles in CUDA hardware...

ptxas info    : 0 bytes gmem
ptxas info    : Compiling entry function '_Z12new_is_validPmPd' for 'sm_80'
ptxas info    : Function properties for _Z12new_is_validPmPd
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 8 registers, 368 bytes cmem[0]

ptxas info    : Compiling entry function '_Z12old_is_validPmPd' for 'sm_80'
ptxas info    : Function properties for _Z12old_is_validPmPd
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 8 registers, 368 bytes cmem[0]

ptxas info    : Compiling entry function '_Z14new_make_validPm' for 'sm_80'
ptxas info    : Function properties for _Z14new_make_validPm
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 8 registers, 360 bytes cmem[0]

ptxas info    : Compiling entry function '_Z16new_make_invalidPm' for 'sm_80'
ptxas info    : Function properties for _Z16new_make_invalidPm
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 8 registers, 360 bytes cmem[0]

ptxas info    : Compiling entry function '_Z15old_invert_signPm' for 'sm_80'
ptxas info    : Function properties for _Z15old_invert_signPm
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 12 registers, 360 bytes cmem[0]

Checklist

The proposed changes:

fix a bug or incorrect behavior in AMReX
add new capabilities to AMReX
changes answers in the test suite to more than roundoff level
are likely to significantly affect the results of downstream AMReX users
include documentation in the code and/or rst files, if appropriate

Src/Particle/AMReX_Particle.H

A cheaper and explicit way to swap validity sign on particle ids. Not the same as `id = -id`, but also reversible.

ax3l · 2024-02-01T00:50:12Z

@WeiqunZhang @atmyers @AlexanderSinn ready for review now - let me know if this looks legit

WeiqunZhang · 2024-02-01T00:52:14Z

We will wait till the next release tomorrow.

ax3l · 2024-02-01T00:58:06Z

It's just adding and now changing stuff, so it should be pretty safe, but I will also just need it after the release tomorrow, so no rush.

ax3l · 2024-02-01T06:39:42Z

As another optimization, I explored using 32bit registers via tricks like:

    bool is_valid () const noexcept
    {
        // the leftmost bit is our id's inverse sign
        auto const * const i32 = (uint32_t*)&m_idata;
        return *i32 >> 31;

    }

This does what one expects on CPU (DWORD over QWORD, 32bit register used over 64bit one) and on CUDA GPUs (SM_80) it demotes a 64bit load to a 32bit one & reduces one ISETP.GT.* op in SASS code... but since this buys in endianness handling (increment pointer +1 for big endian?), and does not reduce registers and cmem usage in our test according to ptxas, I would leave it be for now.

AlexanderSinn · 2024-02-01T12:55:11Z

Replacing a 64 bit load with a 32 bit one? Don’t do this, the 64 bit load would be coalesced but the 32 bit load not because there would be a gap to the next thread. The 32 bit version might be slower.

ax3l · 2024-02-01T23:06:16Z

Replacing a 64 bit load with a 32 bit one? Don’t do this, the 64 bit load would be coalesced but the 32 bit load not because there would be a gap to the next thread. The 32 bit version might be slower.

You are right. Yeah, I though to load coalesced and then copy into a 32bit register, do rest of ops there... but this micro-optimization seems not worth it.

ax3l added the performance label Jan 31, 2024

ax3l requested review from atmyers and WeiqunZhang January 31, 2024 01:48

ax3l mentioned this pull request Jan 31, 2024

Particle Container to Pure SoA Again ECP-WarpX/WarpX#4653

Merged

4 tasks

ax3l assigned atmyers and WeiqunZhang Jan 31, 2024

ax3l force-pushed the topic-negate-id branch from 28c76a7 to 5ef1a66 Compare January 31, 2024 02:30

ax3l commented Jan 31, 2024

View reviewed changes

Src/Particle/AMReX_Particle.H Show resolved Hide resolved

ax3l force-pushed the topic-negate-id branch 3 times, most recently from c9b0c39 to c5ef706 Compare January 31, 2024 07:33

ax3l mentioned this pull request Jan 31, 2024

[Draft] Simplify idcpu #3737

Closed

5 tasks

ax3l force-pushed the topic-negate-id branch from c5ef706 to eb5d920 Compare January 31, 2024 08:30

ax3l commented Jan 31, 2024

View reviewed changes

Src/Particle/AMReX_Particle.H Outdated Show resolved Hide resolved

ax3l changed the title ~~Add ParticleIDWrapper::negate()~~ Add ParticleIDWrapper::flip_valid() and ::is_valid() Jan 31, 2024

ax3l force-pushed the topic-negate-id branch 3 times, most recently from 1e5d663 to 2d9510f Compare January 31, 2024 16:56

ax3l mentioned this pull request Jan 31, 2024

Update Particle Container to Pure SoA ECP-WarpX/impactx#348

Merged

8 tasks

ax3l changed the title ~~Add ParticleIDWrapper::flip_valid() and ::is_valid()~~ Add ParticleIDWrapper::make_valid() Jan 31, 2024

ax3l changed the title ~~Add ParticleIDWrapper::make_valid()~~ Add ParticleIDWrapper::make_invalid() Jan 31, 2024

ax3l force-pushed the topic-negate-id branch from 2d9510f to a69770b Compare January 31, 2024 18:58

Add ParticleIDWrapper::make_invalid()

9833d7e

A cheaper and explicit way to swap validity sign on particle ids. Not the same as `id = -id`, but also reversible.

ax3l force-pushed the topic-negate-id branch from a69770b to 9833d7e Compare January 31, 2024 19:05

ax3l requested a review from AlexanderSinn January 31, 2024 19:14

WeiqunZhang approved these changes Feb 1, 2024

View reviewed changes

WeiqunZhang merged commit 296ed40 into AMReX-Codes:development Feb 1, 2024
69 checks passed

ax3l deleted the topic-negate-id branch February 1, 2024 23:05

ax3l mentioned this pull request Feb 2, 2024

ParticleContainer::RedistributeCPU for Pure SoA #3744

Open

2 tasks

ax3l mentioned this pull request Feb 10, 2024

convert to pure SoA particle containers quokka-astro/quokka#515

Open

AlexanderSinn mentioned this pull request Feb 13, 2024

Use id().is_valid() Hi-PACE/hipace#1066

Merged

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `ParticleIDWrapper::make_invalid()` #3735

Add `ParticleIDWrapper::make_invalid()` #3735

ax3l commented Jan 31, 2024 •

edited

Loading

ax3l commented Feb 1, 2024

WeiqunZhang commented Feb 1, 2024

ax3l commented Feb 1, 2024 •

edited

Loading

ax3l commented Feb 1, 2024 •

edited

Loading

AlexanderSinn commented Feb 1, 2024

ax3l commented Feb 1, 2024 •

edited

Loading

Add ParticleIDWrapper::make_invalid() #3735

Add ParticleIDWrapper::make_invalid() #3735

Conversation

ax3l commented Jan 31, 2024 • edited Loading

Summary

Additional background

Host Code

CUDA Device Code

Checklist

ax3l commented Feb 1, 2024

WeiqunZhang commented Feb 1, 2024

ax3l commented Feb 1, 2024 • edited Loading

ax3l commented Feb 1, 2024 • edited Loading

AlexanderSinn commented Feb 1, 2024

ax3l commented Feb 1, 2024 • edited Loading

Add `ParticleIDWrapper::make_invalid()` #3735

Add `ParticleIDWrapper::make_invalid()` #3735

ax3l commented Jan 31, 2024 •

edited

Loading

ax3l commented Feb 1, 2024 •

edited

Loading

ax3l commented Feb 1, 2024 •

edited

Loading

ax3l commented Feb 1, 2024 •

edited

Loading