Correct a technical flaw with the spinlock locking: #4201

nbougalis · 2022-06-08T19:52:38Z

The existing spinlock code, used to protect SHAMapInnerNode child lists, has a mistake that can allow the same child to be repeatedly locked under some circumstances.

The bug was in the SpinBitLock::lock loop condition check and would result in the loop terminating early.

This commit fixes this and further simplifies the lock loop making the correctness of the code easier to verify without sacrificing performance.

It also promotes the spinlock class from an implementation detail to a more general purpose, easier to use lock class with clearer semantics. Two different lock types now allow developers to easily grab either a single spinlock from an a group of spinlocks (packed in an unsigned integer) or to
grab all of the spinlocks at once.

While this commit makes spinlocks more widely available to developers, they are rarely the best tool for the job. Use them judiciously and only after careful consideration.

High Level Overview of Change

Context of Change

Type of Change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Refactor (non-breaking change that only restructures code)
Tests (You added tests for code that already exists, or your new feature included in this PR)
Documentation Updates
Release

scottschurr

I'm glad you found the bug and fixed it.

I audited the std::memory_order* stuff the best I could using information I found on the web. It looks like what you've done aligns with the best practices that I found. That said, using anything other than memory_order_sequential makes me nervous since I've seen world renowned experts get this wrong. And, as far as I can tell, the "best practices" I found were not written by world renowned experts. All I can do is hope you got it right.

I made some suggestions that I hope you'll consider. But I'm not confident enough about any of them to mandate a change. Let me know what you think. Thanks.

scottschurr · 2022-06-11T00:58:40Z

src/ripple/basics/spinlock.h

+        std::void_t<
+            decltype(std::declval<std::atomic<T>&>().fetch_and(0)),
+            decltype(std::declval<std::atomic<T>&>().fetch_or(0))>>>
+class packed_spinlock


If we follow current naming conventions this class should be named PackedSpinlock, right? And the filename should be fixed to PackedSpinlock.h?

I'm fine with this rename, although the capitalized types "grind" at my eyes a bit. Still, the code style is the code style.

src/ripple/basics/spinlock.h

scottschurr · 2022-06-13T18:07:50Z

src/ripple/basics/spinlock.h

+        std::is_unsigned_v<T> && std::atomic<T>::is_always_lock_free,
+        std::void_t<
+            decltype(std::declval<std::atomic<T>&>().fetch_and(0)),
+            decltype(std::declval<std::atomic<T>&>().fetch_or(0))>>>


See enable_if comments on packed_spinlock.

src/ripple/basics/spinlock.h

scottschurr · 2022-06-13T20:21:45Z

src/ripple/basics/spinlock.h

+        while (!try_lock())
+            detail::spin_pause();


I'm a bit surprised that you switched the implementation to a pure busy wait and lost the yield that was in the previous implementation. The lack of a yield seems especially suspect since you've made the packed_spinlock a general facility rather than a tool used only in one specific place.

I'd expect the lock implementation to look more like this:

// A spin lock is only likely to be helpful for a small number of tries. for (int i = 0; i < 100; ++i) { if (try_lock()) return; detail::spin_pause(); } // If the spinlock didn't work then yield. while (!try_lock()) { std::this_thread::yield(); }

If you don't include a yield in the lock it seems like you should at least include a comment with some measurements justifying why falling back to a yield is never appropriate.

The primary justification was to simplify the codepath. There's also some weird (very low-level) interactions of spin_pause and comparisons that might actually serve to increase latency, which I was hoping to avoid.

I suspect that the yield isn't going to really help, but at the very least I ought to try and collect some additional metrics.

HowardHinnant · 2022-06-14T01:06:44Z

I audited the std::memory_order* stuff the best I could using information I found on the web. It looks like what you've done aligns with the best practices that I found. That said, using anything other than memory_order_sequential makes me nervous since I've seen world renowned experts get this wrong.

Fwiw, I also reviewed this and will leave a similar comment. I believe you have it right, not because I know what I'm doing, but because I've seen similar uses elsewhere. This is tricky stuff.

nbougalis · 2022-06-14T07:17:53Z

I audited the std::memory_order* stuff the best I could using information I found on the web. It looks like what you've done aligns with the best practices that I found. That said, using anything other than memory_order_sequential makes me nervous since I've seen world renowned experts get this wrong.

Fwiw, I also reviewed this and will leave a similar comment. I believe you have it right, not because I know what I'm doing, but because I've seen similar uses elsewhere. This is tricky stuff.

I'm not a world-renowned expert in memory ordering (although I did read a lot, to try and get this right). I would normally agree with you but the intention here is to squeeze every last bit of performance out of the hardware and I think that the memory ordering will have a (likely small) impact to that.

@HowardHinnant, @scottschurr: who would be a renowned expert that we could ask to review this code?

seelabs

👍 nice fix; @scottschurr had some good comments; I'm fine with however those are resolved.

seelabs · 2022-06-16T15:23:45Z

src/ripple/basics/spinlock.h

    packed_spinlock(std::atomic<T>& lock, int index)
-        : bits_(lock), mask_(1 << index)
+        : bits_(lock), mask_(detail::masks<T>[index])


I'm skeptical this makes a difference. I don't object, but I prefer the bitshift (unless you have a benchmark that shows this matters).

It doesn't really matter, but it is cool, isn't it? 😀

I'm with @seelabs on this one. I think an explicit bit shift will make the code easier for the bulk of developers to read. Even though, yes, the constexpr stuff is both fun and cool.

scottschurr · 2022-06-17T17:00:37Z

Today I stumbled across this article on spinlocks from 2019:

https://probablydance.com/2019/12/30/measuring-mutexes-spinlocks-and-how-bad-the-linux-scheduler-really-is/

2019 is recent enough that the information is probably still relevant. That author appears to be knowledgable. The timings seem useful and instructive. Notice in particular the implementation of spinlock. Personally, I like that implementation better than the one I suggested. And the author's timings back it up.

Just FYI.

HowardHinnant · 2022-06-18T15:19:06Z

I've tried but failed to get an outside review on the memory ordering. I still believe it is correct.

nbougalis · 2022-06-19T02:15:25Z

Today I stumbled across this article on spinlocks from 2019:

https://probablydance.com/2019/12/30/measuring-mutexes-spinlocks-and-how-bad-the-linux-scheduler-really-is/

2019 is recent enough that the information is probably still relevant. That author appears to be knowledgable. The timings seem useful and instructive. Notice in particular the implementation of spinlock. Personally, I like that implementation better than the one I suggested. And the author's timings back it up.

Just FYI.

This was a great article. The biggest takeaway is that std::mutex is, at least on Linux, ridiculously good, which is something I found out with my own testing. HOWEVER, I feel the need to point out that the spinlocks here aren't meant to be general purpose locking primitives used instead or in lieu of std::mutex and other std-provided primitives. The latest version of the code documents this pretty clearly, I thought:

Packed spinlocks allow for tremendously space-efficient lock-sharding but they come at a cost.

First, the implementation is necessarily low-level and uses advanced features like memory ordering and highly platform-specific tricks to maximize performance. This imposes a significant and ongoing cost to developers.

Second, and perhaps most important, is that the packing of multiple spinlocks into a single integer, while efficient, can result in poor performance, as it may translate to additional cache-coherency traffic between processors and place heavy load on a processor's load/store units, potentially affecting performance.

To be sure, these locks can have advantages, but they are not general purpose locks and should not be thought or used as such. The use cases for such a construct are likely few and far between, and in the absence of a compelling reason to use it, it's probably best to stay away for spinlocks and use one of the standard locking primitives.

So yes, they are meant for very specific situations, with a good example being the packed_spinlock used by SHAMapInnerNode to provide 16 individual locks, one per child node, without taking up any space at runtime, since the locks fit inside the structure's padding. A single std::mutex would require increasing the size of every instance of the structure by 40 bytes and the equivalent construct with one std::mutex per node would need 160 bytes.

Beyond that, I should point out that my implementation does use the "test-and-test-and-set" mechanism described here (right down to using the right std::memory_order specifications!) although obviously I can't quite use the precise code since I want to allow for "packed" spinlocks.

Ultimately, if you feel strongly enough that this code should just be purely contained within SHAMapInnerNode and only used in that one place for that one very specific and niche case, I can do that.

scottschurr · 2022-06-25T01:40:52Z

FWIW, I wrote a smallish test to exercise packed_spinlock today. It fires up some number of threads on my 6 core (with hyperthreading) machine. All threads are contending for a single packed_spinlock. As long as I limit the test to use at most seven threads it runs and shuts down fine (for ten attempts).

However when I use eight threads about half of the time the test hangs when it attempts to join the threads. The hang suggests to me that this implementation of packed_spinlock is not ready for prime time. But maybe I have an unrealistic expectation of a spin lock?

Ultimately I'd like to see this test (or something like it) run for 24 consecutive hours and shutdown with no problems before we call the packed_spinlock implementation adequately exercised.

The test is kind of big (and hacked together). But I'm pasting it here for completeness sake.

int
spinlock_test()
{
    // We need something that gets messed up when it is torn, but is still
    // relatively easy to compute.  We'll use an array of 64 bools that
    // represent an unsigned 64-bit value.  That should get thoroughly hosed
    // if two threads cut loose in there at the same time.
    std::array<bool, 64> total{};

    // Each of the threads keeps track of how many times it has incremented
    // total.  If we sum these values we should  get the same value that is
    // represented by total.
    constexpr int threadCount = 8;
    std::array<std::uint64_t, threadCount> locals{};
    std::cout << "spinlock test with " << threadCount << " threads"<< std::endl;

    // Use stop to signal the threads to stop running.
    std::atomic<bool> stop = false;

    // Store the threads in a vector so they are easy to join.
    std::vector<std::thread> threads;
    threads.reserve(threadCount);

    // The lock that all of the threads will share.
    std::atomic<std::uint16_t> lock;

    // Start the threads.  Each thread has its own index.
    for (int localsIndex = 0; localsIndex < threadCount; ++localsIndex)
    {
        threads.emplace_back([&total, &locals, &lock, &stop, localsIndex] () {

            // The general usage model seems to be
            //  1. create the spinlock
            //  2. lock the spinlock
            //  3. release the spinlock
            //  4. destroy the spinlock
            //  5. repeat.
            // So we'll use that same model.
            while (!stop)
            {
                // Create and acquire the spinlock.
                static constexpr int lockIndex = 13;
                ripple::packed_spinlock sl(lock, lockIndex);
                std::lock_guard lock(sl);

                // Read out total and make sure it matches the sum of locals.
                std::uint64_t t = 0;
                for (unsigned int i = 0; i < total.size(); ++i)
                {
                    if (total[i])
                        t |= 1 << i;
                }

                std::uint64_t l = 0;
                for (std::uint64_t const& local : locals)
                    l += local;

                // Total and locals don't agree.  Probably a locking problem.
                if (t != l)
                {
                    stop = true;
                    std::cout << "Thread " << localsIndex
                        << " detects problem.  Total: " << t
                        << "; Locals: " << l << std::endl;
                }
                else
                {
                    // Advance our local and write out the new total.
                    locals[localsIndex] += 1;

                    ++t;
                    for (unsigned int i = 0; i < total.size(); ++i)
                        total[i] = (t & (1 << i)) != 0;
                }
                // Leaving scope releases the lock.
            }
        });
    }

    // Let the test run for some time and then kill it.
    using namespace std::chrono_literals;
    auto const stopTime = std::chrono::steady_clock::now() + 30s;

    while (!stop && std::chrono::steady_clock::now() < stopTime)
    {
        std::this_thread::sleep_for(1s);
    }
    stop = true;

    // Join all the threads.
    std::cout << "Joining threads" << std::endl;
    for (std::thread& thread : threads)
    {
        if (thread.joinable())
            thread.join();
    }

    std::uint64_t t = 0;
    for (unsigned int i = 0; i < total.size(); ++i)
    {
        if (total[i])
            t |= 1 << i;
    }

    std::uint64_t l = 0;
    for (std::uint64_t const& local : locals)
        l += local;

    std::cout << "Total: " << t
        << "; Locals: " << l << std::endl;

    return t == l ? 0 : 1;
}

scottschurr

I left a few more notes for consideration. But I think the bigger question is about the test sometimes hanging on shutdown with 8 threads.

scottschurr · 2022-06-24T18:29:55Z

src/ripple/basics/spinlock.h

+
+/** Array representing all possible N-bit values with a single bit set. */
+template <class T>
+constexpr inline auto const masks = [] constexpr


FWIW, clang tells me that it wants a () before the trailing constexpr declaration. But lambdas are implicitly constexpr when they can be. So I think the trailing constexpr is not necessary.

scottschurr · 2022-06-24T18:36:21Z

src/ripple/basics/spinlock.h

+/** Array representing all possible N-bit values with a single bit set. */
+template <class T>
+constexpr inline auto const masks = [] constexpr
+{
+    std::array<T, sizeof(T) * 8> ret{1};
+
+    for (std::size_t i = 1; i != ret.size(); ++i)
+        ret[i] = ret[i - 1] << 1;
+
+    return ret;
+}
+();


The masks array is in a detail namespace, which is good. But, as far as I can tell, masks is really a detail that belongs only to the packed_spinlock class. So I would prefer to see masks as a static constexpr member of packed_spinlock. It might look like this...

private: std::atomic<T>& bits_; T mask_; // Create a compile-time array of valid masks for the spinlock. static constexpr auto const masks_ = []() { std::array<T, sizeof(T) * 8> ret{1}; for (std::size_t i = 1; i != ret.size(); ++i) ret[i] = ret[i - 1] << 1; return ret; }();

Then the packed_spinlock constructor would look like this:

packed_spinlock(std::atomic<T>& lock, int index) : bits_(lock), mask_(masks_[index]) { assert(index >= 0 && (mask_ != 0)); }

scottschurr · 2022-06-24T18:40:20Z

src/ripple/basics/spinlock.h

+
+private:
+    std::atomic<T>& bits_;
+    T mask_;


I'm guessing that your intention is that the packed_spinlock is immovable? I can't currently see any reason for moving it, but I could be missing something.

At any rate, if it is okay for packed_spinlock to be immovable, then consider making T const mask_. It may improve some of the generated code if the optimizer knows the value of mask_ can't change.

scottschurr · 2022-06-24T18:42:33Z

src/ripple/basics/spinlock.h

    packed_spinlock(std::atomic<T>& lock, int index)
-        : bits_(lock), mask_(1 << index)
+        : bits_(lock), mask_(detail::masks<T>[index])


I'm with @seelabs on this one. I think an explicit bit shift will make the code easier for the bulk of developers to read. Even though, yes, the constexpr stuff is both fun and cool.

scottschurr · 2022-07-12T00:37:40Z

Well, there was a bug in the test that I pasted above. I had misremembered that an std::atomic<std::uint16_t> would initialize itself with the default value. It doesn't. Thankfully @nbougalis pointed that bug out to me. With that fix, my test no longer fails joining with the proposed spinlock.

I understand that @nbougalis is running more extensive tests now. Thanks!

There are a few other comments I have left on this pull request, and those still stand. But the primary thing I was concerned about is now resolved. Sorry for writing a buggy test, and thanks for figuring out the problem!

The existing spinlock code, used to protect SHAMapInnerNode child lists, has a mistake that can allow the same child to be repeatedly locked under some circumstances. The bug was in the `SpinBitLock::lock` loop condition check and would result in the loop terminating early. This commit fixes this and further simplifies the lock loop making the correctness of the code easier to verify without sacrificing performance. It also promotes the spinlock class from an implementation detail to a more general purpose, easier to use lock class with clearer semantics. Two different lock types now allow developers to easily grab either a single spinlock from an a group of spinlocks (packed in an unsigned integer) or to grab all of the spinlocks at once. While this commit makes spinlocks more widely available to developers, they are rarely the best tool for the job. Use them judiciously and only after careful consideration.

scottschurr · 2022-07-12T19:11:15Z

Looks like all of my comments have been addressed other than the naming issues. For naming compatibility with this code base the new classes should probably be named PackedSpinlock and Spinlock. And the new file should probably be named Spinlock.h.

nbougalis · 2022-07-19T15:13:35Z

Merged as 7e46f53

nbougalis requested review from greg7mdp, seelabs, HowardHinnant and scottschurr June 8, 2022 19:52

nbougalis mentioned this pull request Jun 8, 2022

Improve AccountID string conversion caching #4181

Closed

7 tasks

nbougalis added the Bug label Jun 9, 2022

HowardHinnant approved these changes Jun 13, 2022

View reviewed changes

scottschurr reviewed Jun 13, 2022

View reviewed changes

seelabs approved these changes Jun 14, 2022

View reviewed changes

seelabs reviewed Jun 16, 2022

View reviewed changes

scottschurr assigned greg7mdp, seelabs, HowardHinnant and scottschurr Jun 23, 2022

scottschurr reviewed Jun 27, 2022

View reviewed changes

nbougalis closed this Jul 19, 2022

nbougalis deleted the spinlock branch October 16, 2023 06:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Correct a technical flaw with the spinlock locking: #4201

Correct a technical flaw with the spinlock locking: #4201

nbougalis commented Jun 8, 2022

scottschurr left a comment

scottschurr Jun 11, 2022

nbougalis Jun 14, 2022

scottschurr Jun 13, 2022

scottschurr Jun 13, 2022

nbougalis Jun 14, 2022

HowardHinnant commented Jun 14, 2022

nbougalis commented Jun 14, 2022

seelabs left a comment

seelabs Jun 16, 2022

nbougalis Jun 16, 2022 •

edited

Loading

scottschurr Jun 24, 2022

scottschurr commented Jun 17, 2022

HowardHinnant commented Jun 18, 2022

nbougalis commented Jun 19, 2022

scottschurr commented Jun 25, 2022

scottschurr left a comment

scottschurr Jun 24, 2022

scottschurr Jun 24, 2022

scottschurr Jun 24, 2022

scottschurr Jun 24, 2022

scottschurr commented Jul 12, 2022

scottschurr commented Jul 12, 2022

nbougalis commented Jul 19, 2022

Correct a technical flaw with the spinlock locking: #4201

Correct a technical flaw with the spinlock locking: #4201

Conversation

nbougalis commented Jun 8, 2022

High Level Overview of Change

Context of Change

Type of Change

scottschurr left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HowardHinnant commented Jun 14, 2022

nbougalis commented Jun 14, 2022

seelabs left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nbougalis Jun 16, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

scottschurr commented Jun 17, 2022

HowardHinnant commented Jun 18, 2022

nbougalis commented Jun 19, 2022

scottschurr commented Jun 25, 2022

scottschurr left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

scottschurr commented Jul 12, 2022

scottschurr commented Jul 12, 2022

nbougalis commented Jul 19, 2022

nbougalis Jun 16, 2022 •

edited

Loading