Put a ring on it: a lock-free MPMC ring buffer
Posted by signa11 21 hours ago
Comments
Comment by atq2119 20 hours ago
The terms relaxed, acquire, and release refer to how an atomic operation is ordered against other accesses to memory.
Counter to what the article states, a relaxed atomic is still atomic, meaning that it cannot tear and, for RMW atomic, no other access can go between the read and the write. But a relaxed atomic does not order other accesses, which can lead to unintuitive outcomes.
By contrast, once you've observed another thread's release store with an acquire load, you're guaranteed that your subsequent memory accesses "happen after" all of the other thread's accesses from before that release store -- which is what you'd intuitively expect, it's just that in modern systems (which are really highly distributed systems even on a single chip) there's a cost to establishing this kind of guarantee, which is why you can opt out of it with relaxed atomics if you know what you're doing.
Comment by viega 20 hours ago
Comment by withinboredom 19 hours ago
Comment by herodoturtle 18 hours ago
<3
Comment by Syzygies 17 hours ago
https://en.wikipedia.org/wiki/False_sharing
Rather than incrementing each counter by one, dither the counters to reduce cache conflicts? So what if the dequeue becomes a bit fuzzy. Make the queue a bit longer, so everyone survives at least as long as they would have survived before.
Or simply use a prime length queue, and change what +1 means, so one's stride is longer than the cache conflict concern. Any stride will generate the full cyclic group, for a prime.
Comment by gpderetta 17 hours ago
edit: the author claims the algorithm is linearizable, so I'll keep reading the article.
edit2: Right, writer acquires ticket, when done attempts to set the done flag on node. Reader acquires ticket, attempts to consume the node. On failure the writer will retry. Even if you have one producer and one consumer, might not you, theoretically, end up in a scenario where the writer never make progress? I guess as technically no node was produced, this is still technically lock-free, but still... ... oh, I should have read it further, the single writer scenario (even in the dynamic case) is special cased! It will still produce the node.
Nice, but it seems exceedingly complex.
Comment by loeg 16 hours ago
It looks like both enqueue and dequeue are RMWing the "epochs" variable? This can pretty easily lead to something like O(N^2) runtime if N processors are banging this cache line at the same time.
Comment by viega 14 hours ago
The epoch isn’t CAS’d; it is FAA. The epoch is then used to determine if there is contention due to the tail meeting the head, or due to a wrap-around due to slow writes.
There’s also a back-off scheme to ease contention for a full queue.
Though, I did originally have a variant that adds a fairly straightforward ‘help’ mechanism that makes the algorithm wait-free and reduces the computational complexity.
However, the extra overhead didn’t seem worth it, so I took it out pretty quickly. Iirc, the only place where the ring in the queue wouldn’t out-perform it are on tiny queues with a huge write imbalance.
If you go run the tests in the repo associated w the article, you probably will see that a ring with only 16 entries will tend to start being non-performant at about a 4:1 writer to reader ratio. But iirc that effect goes away before 128 slots in the ring.
There, the ring still fits in a page, and even with a big imbalance I can’t remember seeing less than 1m ops per second on my Mac laptop.
Real world observational data beats worst case analysis, and I’ve never seen an issue for scenarios I consider reasonable.
But, if my unrealistic is realistic to you, let me know and I can break out the wait free version for you.
Comment by loeg 13 hours ago
fetch_add requires taking a cache line exclusive. If producers and consumers are both doing this on the same cache line, it is very expensive and does not scale.
(withinboredom and gpderetta have raised the same or similar concerns.)
Comment by qbane 20 hours ago
[0]: https://h4x0r.org/futex/ discussion: https://news.ycombinator.com/item?id=44951563
Comment by bitexploder 17 hours ago
Comment by xavierxwang 20 hours ago
maybe that is what you want.
Comment by qbane 17 hours ago
In JS ecosystem, buffers that allow data loss is more common (aka ring buffers), but ringbuf.js [1] is the only complete implementation to my knowledge. In my use case on I/O between WASM modules where data must be transferred as-is, the process must block on buffer overrun/underrun in a synchronous manner. Therefore a circular buffer is required. I could not find such a niche library written in JS, so I decided to bite the bullet and reinvent the wheel [2].
Comment by xavierxwang 13 hours ago
Btw, kaze-core uses a `used` atomic variable, to avoid reading both readPos/writePos in routine - they are not atomic at all.
Comment by drob518 18 hours ago
Comment by WeaselNo7 21 hours ago
Comment by viega 20 hours ago
Comment by atq2119 20 hours ago
Comment by bob1029 20 hours ago
Virtually zero? I have to go out of my way to remind HN it exists while everyone is 300+ comments deep into reinventing the wheel on a ~quarterly basis.
Comment by dpc_01234 15 hours ago
> have to go out of my way
Yeah, that's exactly the annoying part. Can't mention ring buffer ever without someone bring up LMAX. "But do you know, that some Java developers somewhere once wrote something that didn't completely ruin the hardware performance?! It's stunning and amazing."
Comment by bob1029 13 hours ago
Comment by CyberDildonics 10 hours ago
Comment by withinboredom 21 hours ago
However, this is well written and very easy to read.
Comment by viega 21 hours ago
However, I definitely did not see the paper you've sited, but just spent a few minutes with the paper you cited. Section 5.2 seems to cover their version. It's far from clear what their algorithm is, but they are talking about page eviction; it doesn't seem like they're using even a single fixed array for the ring, but I'm not 100% sure because it's pretty light on any detail. a
Comment by jcalvinowens 2 hours ago
Comment by binarycoffee 17 hours ago
Comment by nly 14 hours ago
There are definitely designs that can deal with non-POD data of variable size, although that does imply heterogeneous types, which will need some sort of type erasure to destroy safely.
Comment by auxym 20 hours ago
Which is based on: https://ieeexplore.ieee.org/document/9490347
Comment by JonChesterfield 20 hours ago
Comment by elcritch 18 hours ago
Comment by JonChesterfield 12 hours ago
Comment by viega 20 hours ago
Comment by nly 16 hours ago
While even SPSC ring buffers are simple they are not particularly efficient in the case where the consumer is keeping up with the producer (the ideal case for a queue) due to all the cacheline ping pong. They are not the lowest latency solution
Comment by travisa 8 hours ago
Comment by nly 1 hour ago
Comment by xavierxwang 20 hours ago
Comment by viega 20 hours ago
Comment by viega 20 hours ago
Comment by j_seigh 16 hours ago
Comment by gpderetta 16 hours ago
A CAS implemented with LL/SC (ARM, POWER) is weak as LL/SC an spuriously fail. So it always needs to be retried in a loop. Such a weak CAS might only be lock-free, not wait free as it might not provide global progress guarantees ; in practice some platforms give stronger progress guarantees as they might convert an LL/SC to a strong CAS via idiom recognition.
A strong CAS (x86, SPARC I thnk) is implemented directly in the architecture and it is typically strong. It also usually gives strong fairness guarantees.
If your algorithm needs to CAS in a loop might as well use a weak CAS to avoid a loop-of-loops. Otherwise a strong CAS might generate better code on some architectures.
> 32 bits is not enough space for the epoch if we are building something general-purpose.
Note that as long as your buffer can contain less than 31*2 items, 32 bits is usually enough (that's how TCP works for example) as even after overflow you can sequence before and after, unless you can have stale flight messages of more than one overflow ago.
>However, the num_cells field and last_slot field are not tagged _Atomic. That’s because these should be set by one thread during initialization, and then never changed. As long as the memory has synced before other threads start to use these fields, we definitely don’t need them to be treated specially. Usually, if we do initialization in a proper function, the call boundary is going to be a memory barrier that makes sure they’re sync’d when other threads start getting a handle on our ring.
Your threading library likely guarantees that anything sequenced before the start of your thread happens-before the first instruction of the new thread is executed. So you do not need explicit memory barriers. In any case, a function call is at best a compiler barrier, not a full barrier as required on many architectures.
[sorry, I wasn't really going to do a review, these were my notes when reading the algo].
The algo is quite interesting, a lot of corner cases covered. The biggest issue is that the ticketing system is a single memory location where all producers and consumers attempt to write, so it is never going to scale.
If you really really need a lock-free MPMC queue that guarantees a total order, then it can be useful, but outside some hard-realtime scenarios, I can't really see the benefits. Wouldn't a collection of SPSC queues work for the logging scenario given in the introduction?
Comment by tialaramex 10 hours ago
Without immersion in the "Why" of each technological niche it can be hard to judge whether you're reading advice that really hasn't been relevant in decades ("The ASCII character set is not supported on all computers") or that's still important to your work today ("The file naming conventions may vary from one system to another")
Comment by gpderetta 1 hour ago
Good question actually! ARM64 has a proper CAS it seems, but I think smaller ARMs still have only LL/SC and that's still relevant for embedded. I can't find a definitive answer for POWER, it is possible that is till only has LL/SC (I leave it to you whether POWER is still relevant, but certainly IBM cares that the standard support it, and it was still relevant during C++11 standardization). Can't find a definite answer for RISC-V: it seems that originally it only had LL/SC, but there are AMO extensions that add RISC-V.
I think most GPUs have native CAS instructions.
There are probably other embedded processors that are still relevant for C++, who knows what they support.