Spinlocks vs. Mutexes: When to Spin and When to Sleep
Posted by birdculture 3 days ago
Comments
Comment by charleslmunger 3 days ago
If your sections are that short then you can use a hybrid mutex and never actually park. Unless you're wrong about how long things take, in which case you'll save yourself.
>alignas(64) in C++
std::hardware_destructive_interference_size
Exists so you don't have to guess, although in practice it'll basically always be 64.The code samples also don't obey the basic best practices for spinlocks for x86_64 or arm64. Spinlocks should perform a relaxed read in the loop, and only attempt a compare and set with acquire order if the first check shows the lock is unowned. This avoids hammering the CPU with cache coherency traffic.
Similarly the x86 PAUSE instruction isn't mentioned, even though it exist specifically to signal spin sections to the CPU.
Spinlocks outside the kernel are a bad idea in almost all cases, except dedicated nonpreemptable cases; use a hybrid mutex. Spinning for consumer threads can be done in specialty exclusive thread per core cases where you want to minimize wakeup costs, but that's not the same as a spinlock which would cause any contending thread to spin.
Comment by raggi 3 days ago
Very much this. Spins benchmark well but scale poorly.
Comment by magicalhippo 3 days ago
Yeah, pure spinlocks in user-space programs is a big no-no in my book. If you're on the happy path then it costs you nothing extra in terms of performance, and if you for some reason slide off the happy path you have a sensible fall-back.
Comment by charleshn 3 days ago
Unfortunately it's not quite true, do to e.g. spacial prefetching [0]. See e.g. Folly's definition [1].
[0] https://community.intel.com/t5/Intel-Moderncode-for-Parallel...
[1] https://github.com/facebook/folly/blob/d2e6fe65dfd6b30a9d504...
Comment by menaerus 2 days ago
Comment by surajrmal 3 days ago
Comment by menaerus 3 days ago
glibc pthread mutex uses a user-space spinlock to mitigate the syscall cost for uncontended cases.
Comment by charleslmunger 3 days ago
Comment by surajrmal 19 hours ago
Comment by imtringued 2 days ago
Comment by surajrmal 19 hours ago
Spinning on a CAS is far more expensive than spinning on most other instructions as well as it affects all core that may try to access that cache line, which may include things other than the lock itself.
Also consider how the system acts under high CPU load. You will end up with threads holding locks when not running leading to the majority of the time you miss the lock you spin all 100 times. This just exacerbate the CPU load issues even more. Hybrid locks are only helpful under lower CPU load.
Comment by nly 3 days ago
Comment by surajrmal 1 day ago
Comment by saagarjha 3 days ago
Of course, this is just the number the compiler thinks is good. It’s not necessarily the number that is actually good for your target machine.
Comment by nly 3 days ago
Most people using spinlocks really care about latency, and many will have hyperthreading disabled to reduce jitter
Comment by SkiFire13 2 days ago
Comment by menaerus 2 days ago
~10 years ago, on Haswell, it took ~9 cycles to retire, and from Skylake onward, with some exceptions, it takes a magnitude more - ~140 cycles.
These numbers alone suggests that it really messes up hard with the CPU pipeline, perhaps BP (?) or speculative execution (?) or both (?) such that it will basically force the CPU to flush the whole pipeline. This is at least how I read this. I will remember this instruction as "damage control" instruction from now on.
Comment by nly 1 day ago
Lfence is the better choice these days.
Comment by haileys 3 days ago
Your code will look great in your synthetic benchmarks and then it will end up burning CPU for no good reason in the real world.
Comment by nly 3 days ago
Comment by imtringued 2 days ago
What I'm trying to express here is that the spinlock isn't some special tool that you pull out of the toolbox to make something faster and call it a day.
It's like a cryogenic superconductor that requires extreme caution to use properly. It's something you avoid doing because it's a pain in the ass.
Comment by gpderetta 2 days ago
Comment by bob1029 2 days ago
If you adjust the multimedia timer to its highest resolution (1ms on windows), sleeping is still a non-starter. Even if the sleep was magically 0ms whenever needed, you still have risk of context switching wrecking your cache and jacking up memory bandwidth utilization.
Comment by masklinn 2 days ago
Comment by gebdev 3 days ago
Notably the claim about how atomic operations clear the cache line in every cpu. Wow! Shared data can really be a performance limitation.
Comment by EdSchouten 3 days ago
Comment by senderista 3 days ago
Comment by nly 3 days ago
Comment by senderista 2 days ago
Comment by baobun 3 days ago
Comment by markisus 3 days ago
Comment by raggi 3 days ago
It also depends which lock free solutions you're evaluating.
Some are higher order spins (more similar high level problems), others have different secondary costs (such as copying). A common overlap is the inter-core, inter-thread, and inter-package side effects of synchronization points, for a lot of stuff with a strong atomic in the middle that'll be stuff like sync instruction costs, pipeline impacts of barriers/fences, etc.
Comment by bluecalm 3 days ago
Comment by adrian_b 3 days ago
Lock free algorithms for read only access to shared data structures have only seldom disadvantages (when the shared data structure is modified extremely frequently by writers, so the readers never succeed to read it between changes), so for read-only access they are typically the best choice.
On the other hand lock free algorithms for read/write access to shared data structures must be used with great caution, because they frequently have a higher cost than using mutual exclusion. Such lock free algorithms are based on the optimistic assumption that your thread will complete the access before the shared structure is accessed by another competing thread. Whenever this assumption fails (which will happen when there is high contention) the transaction must be retried, which will lead to much more wasted work than the work that is wasted in a spinlock.
Lock free algorithms for read/write access are normally preferable only when it is certain that there is low contention for the shared resource, but in that case also a spinlock may waste negligible time.
The term "lock-free" is properly applied only to the access methods based on optimistic access instead of mutual exclusion (which uses locks).
However, there is a third kind of access methods, which use neither optimistic access nor mutual exclusion with locks, so some authors may conflate such methods together with the lock-free algorithms based on optimistic access.
This third kind of access methods for shared data have properties very different from the other two kinds, so they should better be considered separately. They are based on the partitioning of the shared data structure between the threads that access it concurrently, so such methods are applicable only to certain kinds of data structures, mainly to arrays and queues. Nevertheless, almost any kind of inter-thread communication can be reorganized around message queues and shared buffers, so most of the applications that use either mutual exclusion with locks or lock-free optimistic access can be rewritten to use concurrent accesses to dynamically partitioned shared data structures, where the access is deterministic unlike with lock-free optimistic algorithms, and there are no explicit locks, but the partitioning of the shared data structure must be done with atomic instructions (usually atomic fetch-and-add), which contain implicit locks, but they are equivalent with extremely short locked critical sections.
Comment by gpderetta 2 days ago
Scalability is always going to be poor when writers attempt to modify the same object, no matter the solution you implement.
[1] of course you could imagine a lock-free algorithm where reads actually mutate, but that's not common.
Comment by menaerus 2 days ago
MVCC.
Comment by staticfloat 3 days ago
Comment by Tom1380 3 days ago
Comment by bitexploder 3 days ago
Comment by jcalvinowens 2 days ago
That's not accurate: the scalability improvements in Linux are a result of broadly eliminating serialization, not something as trivial as using a different locking primitive. The BKL didn't go away until 2.6.37! As much as "spinlock madness" might make a nice little story, it's just simply not true.