Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ParticleIDWrapper::make_invalid() #3735

Merged
merged 1 commit into from
Feb 1, 2024

Conversation

ax3l
Copy link
Member

@ax3l ax3l commented Jan 31, 2024

Summary

A cheaper way to swap validity sign on particle ids, as needed to select and track particles from one kernel to another (e.g., boundary condition treatment, re-emission physics, scraping of particles, etc.).

With our current encoding, ParticleIDWrapper::make_invalid() is the same as id = -id, but cheaper.

Improvements:

  • less code emitted, faster execution
  • less code emitted, less instructions to store for occupancy limits on GPU
  • faster, streamlined code: no jump calls
  • explicit valid/invalid calls instead of swaps with input-dependent outcome

Additional background

Host Code

https://godbolt.org/z/KPjzExWz1

id_flips

CUDA Device Code

PTX: https://godbolt.org/z/6En5rK14o
SASS for SM_80: https://godbolt.org/z/d6zYfxaKG

  • validation: old/new are same register usage and cost; new code uses a few less logical ops
  • mark_(in)valid vs. id = -id: now saves 4 registers 🎉
    • this is the one we often use in already heavy physics kernels - a win for occupancy

Interesting: there are still no 64bit shifts / but shuffles in CUDA hardware...

ptxas info    : 0 bytes gmem
ptxas info    : Compiling entry function '_Z12new_is_validPmPd' for 'sm_80'
ptxas info    : Function properties for _Z12new_is_validPmPd
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 8 registers, 368 bytes cmem[0]

ptxas info    : Compiling entry function '_Z12old_is_validPmPd' for 'sm_80'
ptxas info    : Function properties for _Z12old_is_validPmPd
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 8 registers, 368 bytes cmem[0]

ptxas info    : Compiling entry function '_Z14new_make_validPm' for 'sm_80'
ptxas info    : Function properties for _Z14new_make_validPm
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 8 registers, 360 bytes cmem[0]

ptxas info    : Compiling entry function '_Z16new_make_invalidPm' for 'sm_80'
ptxas info    : Function properties for _Z16new_make_invalidPm
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 8 registers, 360 bytes cmem[0]

ptxas info    : Compiling entry function '_Z15old_invert_signPm' for 'sm_80'
ptxas info    : Function properties for _Z15old_invert_signPm
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 12 registers, 360 bytes cmem[0]

Checklist

The proposed changes:

  • fix a bug or incorrect behavior in AMReX
  • add new capabilities to AMReX
  • changes answers in the test suite to more than roundoff level
  • are likely to significantly affect the results of downstream AMReX users
  • include documentation in the code and/or rst files, if appropriate

@ax3l ax3l force-pushed the topic-negate-id branch 3 times, most recently from c9b0c39 to c5ef706 Compare January 31, 2024 07:33
@ax3l ax3l mentioned this pull request Jan 31, 2024
5 tasks
Src/Particle/AMReX_Particle.H Outdated Show resolved Hide resolved
@ax3l ax3l changed the title Add ParticleIDWrapper::negate() Add ParticleIDWrapper::flip_valid() and ::is_valid() Jan 31, 2024
@ax3l ax3l force-pushed the topic-negate-id branch 3 times, most recently from 1e5d663 to 2d9510f Compare January 31, 2024 16:56
@ax3l ax3l changed the title Add ParticleIDWrapper::flip_valid() and ::is_valid() Add ParticleIDWrapper::make_valid() Jan 31, 2024
@ax3l ax3l changed the title Add ParticleIDWrapper::make_valid() Add ParticleIDWrapper::make_invalid() Jan 31, 2024
A cheaper and explicit way to swap validity sign on particle ids.
Not the same as `id = -id`, but also reversible.
@ax3l
Copy link
Member Author

ax3l commented Feb 1, 2024

@WeiqunZhang @atmyers @AlexanderSinn ready for review now - let me know if this looks legit

@WeiqunZhang
Copy link
Member

We will wait till the next release tomorrow.

@ax3l
Copy link
Member Author

ax3l commented Feb 1, 2024

It's just adding and now changing stuff, so it should be pretty safe, but I will also just need it after the release tomorrow, so no rush.

@ax3l
Copy link
Member Author

ax3l commented Feb 1, 2024

As another optimization, I explored using 32bit registers via tricks like:

    bool is_valid () const noexcept
    {
        // the leftmost bit is our id's inverse sign
        auto const * const i32 = (uint32_t*)&m_idata;
        return *i32 >> 31;

    }

This does what one expects on CPU (DWORD over QWORD, 32bit register used over 64bit one) and on CUDA GPUs (SM_80) it demotes a 64bit load to a 32bit one & reduces one ISETP.GT.* op in SASS code... but since this buys in endianness handling (increment pointer +1 for big endian?), and does not reduce registers and cmem usage in our test according to ptxas, I would leave it be for now.

@AlexanderSinn
Copy link
Member

Replacing a 64 bit load with a 32 bit one? Don’t do this, the 64 bit load would be coalesced but the 32 bit load not because there would be a gap to the next thread. The 32 bit version might be slower.

@WeiqunZhang WeiqunZhang merged commit 296ed40 into AMReX-Codes:development Feb 1, 2024
69 checks passed
@ax3l ax3l deleted the topic-negate-id branch February 1, 2024 23:05
@ax3l
Copy link
Member Author

ax3l commented Feb 1, 2024

Replacing a 64 bit load with a 32 bit one? Don’t do this, the 64 bit load would be coalesced but the 32 bit load not because there would be a gap to the next thread. The 32 bit version might be slower.

You are right. Yeah, I though to load coalesced and then copy into a 32bit register, do rest of ops there... but this micro-optimization seems not worth it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants