Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Correct a technical flaw with the spinlock locking: #4201

Closed
wants to merge 1 commit into from
Closed

Correct a technical flaw with the spinlock locking: #4201

wants to merge 1 commit into from

Conversation

nbougalis
Copy link
Contributor

The existing spinlock code, used to protect SHAMapInnerNode child lists, has a mistake that can allow the same child to be repeatedly locked under some circumstances.

The bug was in the SpinBitLock::lock loop condition check and would result in the loop terminating early.

This commit fixes this and further simplifies the lock loop making the correctness of the code easier to verify without sacrificing performance.

It also promotes the spinlock class from an implementation detail to a more general purpose, easier to use lock class with clearer semantics. Two different lock types now allow developers to easily grab either a single spinlock from an a group of spinlocks (packed in an unsigned integer) or to
grab all of the spinlocks at once.

While this commit makes spinlocks more widely available to developers, they are rarely the best tool for the job. Use them judiciously and only after careful consideration.

High Level Overview of Change

Context of Change

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Refactor (non-breaking change that only restructures code)
  • Tests (You added tests for code that already exists, or your new feature included in this PR)
  • Documentation Updates
  • Release

Copy link
Collaborator

@scottschurr scottschurr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm glad you found the bug and fixed it.

I audited the std::memory_order* stuff the best I could using information I found on the web. It looks like what you've done aligns with the best practices that I found. That said, using anything other than memory_order_sequential makes me nervous since I've seen world renowned experts get this wrong. And, as far as I can tell, the "best practices" I found were not written by world renowned experts. All I can do is hope you got it right.

I made some suggestions that I hope you'll consider. But I'm not confident enough about any of them to mandate a change. Let me know what you think. Thanks.

std::void_t<
decltype(std::declval<std::atomic<T>&>().fetch_and(0)),
decltype(std::declval<std::atomic<T>&>().fetch_or(0))>>>
class packed_spinlock
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we follow current naming conventions this class should be named PackedSpinlock, right? And the filename should be fixed to PackedSpinlock.h?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with this rename, although the capitalized types "grind" at my eyes a bit. Still, the code style is the code style.

src/ripple/basics/spinlock.h Show resolved Hide resolved
src/ripple/basics/spinlock.h Outdated Show resolved Hide resolved
std::is_unsigned_v<T> && std::atomic<T>::is_always_lock_free,
std::void_t<
decltype(std::declval<std::atomic<T>&>().fetch_and(0)),
decltype(std::declval<std::atomic<T>&>().fetch_or(0))>>>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See enable_if comments on packed_spinlock.

src/ripple/basics/spinlock.h Outdated Show resolved Hide resolved
src/ripple/basics/spinlock.h Outdated Show resolved Hide resolved
Comment on lines 93 to 94
while (!try_lock())
detail::spin_pause();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit surprised that you switched the implementation to a pure busy wait and lost the yield that was in the previous implementation. The lack of a yield seems especially suspect since you've made the packed_spinlock a general facility rather than a tool used only in one specific place.

I'd expect the lock implementation to look more like this:

        // A spin lock is only likely to be helpful for a small number of tries.
        for (int i = 0; i < 100; ++i)
        {
            if (try_lock())
                return;
            detail::spin_pause();
        }

        // If the spinlock didn't work then yield.
        while (!try_lock())
        {
            std::this_thread::yield();
        }

If you don't include a yield in the lock it seems like you should at least include a comment with some measurements justifying why falling back to a yield is never appropriate.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The primary justification was to simplify the codepath. There's also some weird (very low-level) interactions of spin_pause and comparisons that might actually serve to increase latency, which I was hoping to avoid.

I suspect that the yield isn't going to really help, but at the very least I ought to try and collect some additional metrics.

@HowardHinnant
Copy link
Contributor

I audited the std::memory_order* stuff the best I could using information I found on the web. It looks like what you've done aligns with the best practices that I found. That said, using anything other than memory_order_sequential makes me nervous since I've seen world renowned experts get this wrong.

Fwiw, I also reviewed this and will leave a similar comment. I believe you have it right, not because I know what I'm doing, but because I've seen similar uses elsewhere. This is tricky stuff.

@nbougalis
Copy link
Contributor Author

I audited the std::memory_order* stuff the best I could using information I found on the web. It looks like what you've done aligns with the best practices that I found. That said, using anything other than memory_order_sequential makes me nervous since I've seen world renowned experts get this wrong.

Fwiw, I also reviewed this and will leave a similar comment. I believe you have it right, not because I know what I'm doing, but because I've seen similar uses elsewhere. This is tricky stuff.

I'm not a world-renowned expert in memory ordering (although I did read a lot, to try and get this right). I would normally agree with you but the intention here is to squeeze every last bit of performance out of the hardware and I think that the memory ordering will have a (likely small) impact to that.

@HowardHinnant, @scottschurr: who would be a renowned expert that we could ask to review this code?

Copy link
Collaborator

@seelabs seelabs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 nice fix; @scottschurr had some good comments; I'm fine with however those are resolved.

packed_spinlock(std::atomic<T>& lock, int index)
: bits_(lock), mask_(1 << index)
: bits_(lock), mask_(detail::masks<T>[index])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm skeptical this makes a difference. I don't object, but I prefer the bitshift (unless you have a benchmark that shows this matters).

Copy link
Contributor Author

@nbougalis nbougalis Jun 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't really matter, but it is cool, isn't it? 😀

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm with @seelabs on this one. I think an explicit bit shift will make the code easier for the bulk of developers to read. Even though, yes, the constexpr stuff is both fun and cool.

@scottschurr
Copy link
Collaborator

Today I stumbled across this article on spinlocks from 2019:

https://probablydance.com/2019/12/30/measuring-mutexes-spinlocks-and-how-bad-the-linux-scheduler-really-is/

2019 is recent enough that the information is probably still relevant. That author appears to be knowledgable. The timings seem useful and instructive. Notice in particular the implementation of spinlock. Personally, I like that implementation better than the one I suggested. And the author's timings back it up.

Just FYI.

@HowardHinnant
Copy link
Contributor

I've tried but failed to get an outside review on the memory ordering. I still believe it is correct.

@nbougalis
Copy link
Contributor Author

Today I stumbled across this article on spinlocks from 2019:

https://probablydance.com/2019/12/30/measuring-mutexes-spinlocks-and-how-bad-the-linux-scheduler-really-is/

2019 is recent enough that the information is probably still relevant. That author appears to be knowledgable. The timings seem useful and instructive. Notice in particular the implementation of spinlock. Personally, I like that implementation better than the one I suggested. And the author's timings back it up.

Just FYI.

This was a great article. The biggest takeaway is that std::mutex is, at least on Linux, ridiculously good, which is something I found out with my own testing. HOWEVER, I feel the need to point out that the spinlocks here aren't meant to be general purpose locking primitives used instead or in lieu of std::mutex and other std-provided primitives. The latest version of the code documents this pretty clearly, I thought:

Packed spinlocks allow for tremendously space-efficient lock-sharding but they come at a cost.

First, the implementation is necessarily low-level and uses advanced features like memory ordering and highly platform-specific tricks to maximize performance. This imposes a significant and ongoing cost to developers.

Second, and perhaps most important, is that the packing of multiple spinlocks into a single integer, while efficient, can result in poor performance, as it may translate to additional cache-coherency traffic between processors and place heavy load on a processor's load/store units, potentially affecting performance.

To be sure, these locks can have advantages, but they are not general purpose locks and should not be thought or used as such. The use cases for such a construct are likely few and far between, and in the absence of a compelling reason to use it, it's probably best to stay away for spinlocks and use one of the standard locking primitives.

So yes, they are meant for very specific situations, with a good example being the packed_spinlock used by SHAMapInnerNode to provide 16 individual locks, one per child node, without taking up any space at runtime, since the locks fit inside the structure's padding. A single std::mutex would require increasing the size of every instance of the structure by 40 bytes and the equivalent construct with one std::mutex per node would need 160 bytes.

Beyond that, I should point out that my implementation does use the "test-and-test-and-set" mechanism described here (right down to using the right std::memory_order specifications!) although obviously I can't quite use the precise code since I want to allow for "packed" spinlocks.

Ultimately, if you feel strongly enough that this code should just be purely contained within SHAMapInnerNode and only used in that one place for that one very specific and niche case, I can do that.

@scottschurr
Copy link
Collaborator

FWIW, I wrote a smallish test to exercise packed_spinlock today. It fires up some number of threads on my 6 core (with hyperthreading) machine. All threads are contending for a single packed_spinlock. As long as I limit the test to use at most seven threads it runs and shuts down fine (for ten attempts).

However when I use eight threads about half of the time the test hangs when it attempts to join the threads. The hang suggests to me that this implementation of packed_spinlock is not ready for prime time. But maybe I have an unrealistic expectation of a spin lock?

Ultimately I'd like to see this test (or something like it) run for 24 consecutive hours and shutdown with no problems before we call the packed_spinlock implementation adequately exercised.

The test is kind of big (and hacked together). But I'm pasting it here for completeness sake.

int
spinlock_test()
{
    // We need something that gets messed up when it is torn, but is still
    // relatively easy to compute.  We'll use an array of 64 bools that
    // represent an unsigned 64-bit value.  That should get thoroughly hosed
    // if two threads cut loose in there at the same time.
    std::array<bool, 64> total{};

    // Each of the threads keeps track of how many times it has incremented
    // total.  If we sum these values we should  get the same value that is
    // represented by total.
    constexpr int threadCount = 8;
    std::array<std::uint64_t, threadCount> locals{};
    std::cout << "spinlock test with " << threadCount << " threads"<< std::endl;

    // Use stop to signal the threads to stop running.
    std::atomic<bool> stop = false;

    // Store the threads in a vector so they are easy to join.
    std::vector<std::thread> threads;
    threads.reserve(threadCount);

    // The lock that all of the threads will share.
    std::atomic<std::uint16_t> lock;

    // Start the threads.  Each thread has its own index.
    for (int localsIndex = 0; localsIndex < threadCount; ++localsIndex)
    {
        threads.emplace_back([&total, &locals, &lock, &stop, localsIndex] () {

            // The general usage model seems to be
            //  1. create the spinlock
            //  2. lock the spinlock
            //  3. release the spinlock
            //  4. destroy the spinlock
            //  5. repeat.
            // So we'll use that same model.
            while (!stop)
            {
                // Create and acquire the spinlock.
                static constexpr int lockIndex = 13;
                ripple::packed_spinlock sl(lock, lockIndex);
                std::lock_guard lock(sl);

                // Read out total and make sure it matches the sum of locals.
                std::uint64_t t = 0;
                for (unsigned int i = 0; i < total.size(); ++i)
                {
                    if (total[i])
                        t |= 1 << i;
                }

                std::uint64_t l = 0;
                for (std::uint64_t const& local : locals)
                    l += local;

                // Total and locals don't agree.  Probably a locking problem.
                if (t != l)
                {
                    stop = true;
                    std::cout << "Thread " << localsIndex
                        << " detects problem.  Total: " << t
                        << "; Locals: " << l << std::endl;
                }
                else
                {
                    // Advance our local and write out the new total.
                    locals[localsIndex] += 1;

                    ++t;
                    for (unsigned int i = 0; i < total.size(); ++i)
                        total[i] = (t & (1 << i)) != 0;
                }
                // Leaving scope releases the lock.
            }
        });
    }

    // Let the test run for some time and then kill it.
    using namespace std::chrono_literals;
    auto const stopTime = std::chrono::steady_clock::now() + 30s;

    while (!stop && std::chrono::steady_clock::now() < stopTime)
    {
        std::this_thread::sleep_for(1s);
    }
    stop = true;

    // Join all the threads.
    std::cout << "Joining threads" << std::endl;
    for (std::thread& thread : threads)
    {
        if (thread.joinable())
            thread.join();
    }

    std::uint64_t t = 0;
    for (unsigned int i = 0; i < total.size(); ++i)
    {
        if (total[i])
            t |= 1 << i;
    }

    std::uint64_t l = 0;
    for (std::uint64_t const& local : locals)
        l += local;

    std::cout << "Total: " << t
        << "; Locals: " << l << std::endl;

    return t == l ? 0 : 1;
}

Copy link
Collaborator

@scottschurr scottschurr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a few more notes for consideration. But I think the bigger question is about the test sometimes hanging on shutdown with 8 threads.


/** Array representing all possible N-bit values with a single bit set. */
template <class T>
constexpr inline auto const masks = [] constexpr
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, clang tells me that it wants a () before the trailing constexpr declaration. But lambdas are implicitly constexpr when they can be. So I think the trailing constexpr is not necessary.

Comment on lines 53 to 64
/** Array representing all possible N-bit values with a single bit set. */
template <class T>
constexpr inline auto const masks = [] constexpr
{
std::array<T, sizeof(T) * 8> ret{1};

for (std::size_t i = 1; i != ret.size(); ++i)
ret[i] = ret[i - 1] << 1;

return ret;
}
();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The masks array is in a detail namespace, which is good. But, as far as I can tell, masks is really a detail that belongs only to the packed_spinlock class. So I would prefer to see masks as a static constexpr member of packed_spinlock. It might look like this...

private:
    std::atomic<T>& bits_;
    T mask_;

    // Create a compile-time array of valid masks for the spinlock.
    static constexpr auto const masks_ = []()
    {
        std::array<T, sizeof(T) * 8> ret{1};

        for (std::size_t i = 1; i != ret.size(); ++i)
            ret[i] = ret[i - 1] << 1;

        return ret;
    }();

Then the packed_spinlock constructor would look like this:

    packed_spinlock(std::atomic<T>& lock, int index)
        : bits_(lock), mask_(masks_[index])
    {
        assert(index >= 0 && (mask_ != 0));
    }


private:
std::atomic<T>& bits_;
T mask_;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm guessing that your intention is that the packed_spinlock is immovable? I can't currently see any reason for moving it, but I could be missing something.

At any rate, if it is okay for packed_spinlock to be immovable, then consider making T const mask_. It may improve some of the generated code if the optimizer knows the value of mask_ can't change.

packed_spinlock(std::atomic<T>& lock, int index)
: bits_(lock), mask_(1 << index)
: bits_(lock), mask_(detail::masks<T>[index])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm with @seelabs on this one. I think an explicit bit shift will make the code easier for the bulk of developers to read. Even though, yes, the constexpr stuff is both fun and cool.

@scottschurr
Copy link
Collaborator

Well, there was a bug in the test that I pasted above. I had misremembered that an std::atomic<std::uint16_t> would initialize itself with the default value. It doesn't. Thankfully @nbougalis pointed that bug out to me. With that fix, my test no longer fails joining with the proposed spinlock.

I understand that @nbougalis is running more extensive tests now. Thanks!

There are a few other comments I have left on this pull request, and those still stand. But the primary thing I was concerned about is now resolved. Sorry for writing a buggy test, and thanks for figuring out the problem!

The existing spinlock code, used to protect SHAMapInnerNode
child lists, has a mistake that can allow the same child to
be repeatedly locked under some circumstances.

The bug was in the `SpinBitLock::lock` loop condition check
and would result in the loop terminating early.

This commit fixes this and further simplifies the lock loop
making the correctness of the code easier to verify without
sacrificing performance.

It also promotes the spinlock class from an implementation
detail to a more general purpose, easier to use lock class
with clearer semantics. Two different lock types now allow
developers to easily grab either a single spinlock from an
a group of spinlocks (packed in an unsigned integer) or to
grab all of the spinlocks at once.

While this commit makes spinlocks more widely available to
developers, they are rarely the best tool for the job. Use
them judiciously and only after careful consideration.
@scottschurr
Copy link
Collaborator

Looks like all of my comments have been addressed other than the naming issues. For naming compatibility with this code base the new classes should probably be named PackedSpinlock and Spinlock. And the new file should probably be named Spinlock.h.

@nbougalis
Copy link
Contributor Author

Merged as 7e46f53

@nbougalis nbougalis closed this Jul 19, 2022
@nbougalis nbougalis deleted the spinlock branch October 16, 2023 06:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants