Jepsen: NATS 2.12.1
Posted by aphyr 1 day ago
Comments
Comment by stmw 1 day ago
Comment by PeterCorless 1 day ago
> 2. Delayed Sync Mode (Default)
> In the default mode, writes are batched and marked with needSync = true for later synchronization filestore.go:7093-7097 . The actual sync happens during the next syncBlocks() execution.
However, if you read DeepWiki's conclusion, it is far more optimistic than what Aphyr uncovered in real-world testing.
> Durability Guarantees
> Even with delayed fsyncs, NATS provides protection against data loss through:
> 1. Write-Ahead Logging: Messages are written to log files before being acknowledged
> 2. Periodic Sync: The sync timer ensures data is eventually flushed to disk
> 3. State Snapshots: Full state is periodically written to index.db files filestore.go:9834-9850
> 4. Error Handling: If sync operations fail, NATS attempts to rebuild state from existing data filestore.go:7066-7072"
https://deepwiki.com/search/will-nats-lose-uncommitted-wri_b...
Comment by traceroute66 22 hours ago
Well, its an LLM ... of course its going to be optimistic. ;-)
Comment by PeterCorless 16 hours ago
Comment by 63stack 21 hours ago
Comment by staticassertion 17 hours ago
What you draw from that seems entirely up to you. They don't seem to be making any claims or implying anything by doing so, just showing the result.
Comment by PeterCorless 16 hours ago
Comment by asa400 6 hours ago
Comment by awesome_dude 1 day ago
People always think "theory is overrated" or "hacking is better than having a school education"
And then proceed to shoot themselves in the foot with "workarounds" that break well known, well documented, well traversed problem spaces
Comment by whimsicalism 1 day ago
Comment by johncolanduoni 1 day ago
Comment by ownagefool 23 hours ago
I was engaged after one of the worlds biggest data leaks. The Security org was hyper worried about the cloud environment, which was in its infancy, despite the fact their data leak was from on-prem mainframe style system and they hadn't really improved their posture in any significant way despite spending £40m.
As an aside, I use NATs for some workloads where I've obviously spent low effort validating whether it's a great idea, and I'm pretty horrified with the report. (=
Comment by _zoltan_ 1 day ago
Comment by whimsicalism 1 day ago
Comment by awesome_dude 1 day ago
The things that have been "disrupted" haven't delivered - Blockchains are still a scam, Food delivery services are worse than before (Restaurants are worse off, the people making the deliveries are worse off), Taxis still needed to go back and vet drivers to ensure that they weren't fiends.
Comment by hbbio 1 day ago
Did you actually look at the blockchain nodes implementation as of 2025 and what's in the roadmap? Ethereum nodes/L2s with optimistic or zk-proofs are probably the most advanced distributed databases that actually work.
(not talking about "coins" and stuff obviously, another debate)
Comment by otterley 1 day ago
What are you comparing against? Aren't they slower, less convenient, and less available than, say, DynamoDB or Spanner, both of which have been in full-service, reliable operation since 2012?
Comment by derefr 1 day ago
A big DynamoDB/Spanner deployment is great while you can guarantee some benevolent (or just not-malevolent) org around to host the deployment for everyone else. But technologies of this type do not have any answer for the key problem of "ensure the infra survives its own founding/maintaining org being co-opted + enshittified by parties hostile to the central purpose of the network."
Blockchains — and all the overhead and pain that comes with them — are basically what you get when you take the classical small-D distributed database design, and add the components necessary to get that extra property.
Comment by hbbio 23 hours ago
DynamoDB and Spanner are both great, but they're meant to be run by a single admin. It's a considerably simpler problem to solve.
Comment by Agingcoder 1 day ago
Comment by drdrey 1 day ago
Comment by charcircuit 1 day ago
Comment by drdrey 1 day ago
Comment by j16sdiz 1 day ago
You can have multiple replica without extra computation for hash and stuffs.
Comment by whimsicalism 15 hours ago
Comment by MrDarcy 1 day ago
Comment by colechristensen 1 day ago
Comment by johncolanduoni 1 day ago
Comment by stmw 1 day ago
Comment by LaGrange 1 day ago
Comment by staticassertion 18 hours ago
Comment by mzl 18 hours ago
Comment by awesome_dude 14 hours ago
1. School based is supposed to cover all the basics, self directed you have to know what the basics are, or find out, and then cover them.
2. School based study the teachers/lecturers are supposed to have checked all the available text on the subject and then share the best with the students (the teachers are the ones that ensure nobody goes down unproductive rabbitholes)
3. People can see from the qualifications that a person has met a certain standard, understands the subject, has got the knowledge, and can communicate that to a proscribed level.
Personal note, I have done both in different careers, and being "self taught" I realised that whilst I definitely knew more about one topic in the field than qualified individuals, I never knew what the complete set of study for the field was (i never knew how much they really knew, so could never fill the gaps I had)
In CS I gained my qualification in 2010, when i went to find work a lot of places were placing emphasis on self taught people who were deemed to be more creative, or more motivated, etc. When I did work with these individuals, without fail they were missing basic understanding of fundamentals, like data structures, well known algorithms, and so on.
Comment by staticassertion 15 hours ago
Comment by belter 21 hours ago
- ACKed messages can be silently lost due to minority-node corruption.
- A single-bit corruption can cause some replicas to lose up to 78% of stored messages
- Snapshot corruption can propagate and lead to entire stream deletion across the cluster.
- The default lazy-fsync mode can drop minutes of acknowledged writes on a crash.
- A crash combined with network delay can cause persistent split-brain and divergent logs.
- Data loss even with “sync_interval = always” in presence of membership changes or partitions.
- Self-healing and replica convergence did not always work reliably after corruption.
…was not downvoted, but flagged... That is telling. Documented failure modes are apparently controversial. Also raises the question: What level of technical due diligence was performed by organizations like Mastercard, Volvo, PayPal, Baidu, Alibaba, or AT&T before adopting this system?
So what is next? Nominate NATS for the Silent Failure Peace Prize?
Comment by traceroute66 20 hours ago
One or two of the comments on GitHub by the NATS team in response to Issues opened by Kyle are also more than a bit cringeworthy.
Such as this one:
"Most of our production setups, and in fact Synadia Cloud as well is that each replica is in a separate AZ. These have separate power, networking etc. So the possibility of a loss here is extremely low in terms of due to power outages."
Which Kyle had to call them out on:
"Ah, I have some bad news here--placing nodes in separate AZs does not mean that NATS' strategy of not syncing things to disk is safe. See #7567 for an example of a single node failure causing data loss (and split-brain!)."
https://github.com/nats-io/nats-server/issues/7564#issuecomm...
Comment by to11mtm 11 hours ago
I have to note the following as a NATS fan:
- I am horrified at Jespen's reliability findings, however they do vindicate certain design decisions I made in the past
- 'Core NATS' is really mostly 'redis pubsub but better' and Core NATS is honestly awesome, low friction middleware. I've used it as part of eventing systems in the past and it works great.
- FWIW, There's an MQTT bridge that requires Jetstream, but if you're just doing QoS 0 you can work around the other warts.
- If you use Jetstream KV as a cache layer without real persistence (i.e. closer to how one uses Redis KV where it's just memory backed) you don't care about any of this. And again Jetstream KV IMO is better than Redis KV since they added TTL.
All of that is a way to say, I'd bet a lot of them are using Core NATS or other specific features versus something like JetStream.tl;dr - Jetstream's reliability is horrifying apparently but I stand by the statement that Core NATS and Ephermal KV is amazing.
Comment by dboreham 1 day ago
Comment by johncolanduoni 1 day ago
Comment by lnenad 22 hours ago
Comment by jwr 20 hours ago
It provides a dictionary of terms that we can use to have educated discussions, rather than throwing around terms like "ACID".
Comment by plandis 16 hours ago
There is also this [1] which Aphyr collabed on which you might find interesting if you haven’t seen it yet.
Comment by jessekv 1 day ago
https://github.com/nats-io/nats-server/discussions/3312#disc...
(I opened this discussion 2.5 years ago and get an email from github every once in a while ever since. I had given up hope TBH)
Comment by johncolanduoni 1 day ago
Comment by vrnvu 1 day ago
Comment by merb 1 day ago
Why? Why do some databases do that? To have better performance in benchmarks? It’s not like that it’s ok to do that if you have a better default or at least write a lot about it. But especially when you run stuff in a small cluster you get bitten by stuff like that.
Comment by aaronbwebber 1 day ago
Many applications do not require true durability and it is likely that many applications benefit from lazy fsync. Whether it should be the default is a lot more questionable though.
Comment by johncolanduoni 1 day ago
Comment by traceroute66 1 day ago
Yeah, it should use safe-defaults.
Then you can always go read the corners of the docs for the "go faster" mode.
Just like Postgres's infamous "non-durable settings" page... https://www.postgresql.org/docs/18/non-durability.html
Comment by semiquaver 1 day ago
Comment by tybit 21 hours ago
Comment by senderista 1 day ago
Comment by otabdeveloper4 20 hours ago
Pretty much no application requires true durability.
Comment by staticassertion 15 hours ago
When your system doesn't do things like fsync, you can't do that at all. X is 1. That is not what people expect.
Most people probably don't require X == Y, but they may have requirements that X > 1.
Comment by millipede 1 day ago
Comment by aphyr 1 day ago
Comment by to11mtm 1 day ago
I must note that the default for Postgres is that there is NO delay, which is a sane default.
> You can batch up multiple operations into a single call to fsync.
Ive done this in various messaging implementations for throughput, and it's actually fairly easy to do in most languages;
Basically, set up 1-N writers (depends on how you are storing data really) that takes a set of items containing the data to be written alongside a TaskCompletionSource (Promise in Java terms), when your stuff wants to write it shoots it to that local queue, the worker(s) on the queue will write out messages in batches based on whatever else (i.e. tuned for write size, number of records, etc for both throughput and guaranteeing forward progress,) and then when the write completes you either complete or fail the TCS/Promise.
If you've got the right 'glue' with your language/libraries it's not that hard; this example [0] from Akka.NET's SQL persistence layer shows how simple the actual write processor's logic can be... Yeah you have to think about queueing a little bit however I've found this basic pattern very adaptable (i.e. queueing op can just send a bunch of ready-to-go-bytes and you work off that for threshold instead, add framing if needed, etc.)
[0] https://github.com/akkadotnet/Akka.Persistence.Sql/blob/7bab...
Comment by aphyr 1 day ago
Comment by to11mtm 1 day ago
Just wanted to clarify that the default is still at least safe in case people perusing this for things to worry about, well, were thinking about worrying.
Love all of your work and writings, thank you for all you do!
Comment by loeg 1 day ago
Comment by kbenson 1 day ago
Comment by senderista 1 day ago
Comment by millipede 13 hours ago
Comment by thinkharderdev 1 day ago
Yes, exactly.
Comment by mrkeen 1 day ago
The kind of failure that a system can tolerate with strict fsync but can't tolerate with lazy fsync (i.e. the software 'confirms' a write to its caller but then crashes) is probably not the kind of failure you'd expect to encounter on a majority of your nodes all at the same time.
Comment by johncolanduoni 1 day ago
Comment by loeg 1 day ago
I'm sorry, tail latencies are high for SSDs? In my experience, the tail latencies are much higher for traditional rotating media (tens of seconds, vs 10s of milliseconds for SSDs).
Comment by johncolanduoni 1 day ago
Comment by loeg 1 day ago
Comment by senderista 1 day ago
Comment by dilyevsky 1 day ago
Comment by speedgoose 1 day ago
Comment by formerly_proven 1 day ago
Comment by onionisafruit 1 day ago
Comment by orthoxerox 1 day ago
Comment by cnlwsu 1 day ago
Comment by mysfi 1 day ago
Comment by aphyr 1 day ago
Comment by bsaul 1 day ago
Comment by andersmurphy 1 day ago
> They will refuse, of course, and ever so ashamed, cite a lack of culture fit. Alight upon your cloud-pine, and exit through the window. This place could never contain you.
https://aphyr.com/posts/340-reversing-the-technical-intervie...
Comment by dangoodmanUT 1 day ago
Just use redpanda.
Comment by maxmcd 1 day ago
Is the performance warning in the NATS possible to improve on? Couldn't you still run fsync on an interval and queue up a certain number of writes to be flushed at once? I could imagine latency suffering, but batches throughput could be preserved to some extent?
Comment by scottlamb 1 day ago
Yes, and you shouldn't even need a fixed interval. Just queue up any writes while an `fsync` is pending; then do all those in the next batch. This is the same approach you'd use for rounds of Paxos, particularly between availability zones or regions where latency is expected to be high. You wouldn't say "oh, I'll ack and then put it in the next round of Paxos", or "I'll wait until the next round in 2 seconds then ack"; you'd start the next batch as soon as the current one is done.
Comment by ADefenestrator 1 day ago
Comment by clemlesne 1 day ago
Comment by belter 1 day ago
Comment by Thaxll 1 day ago
https://archive.fosdem.org/2019/schedule/event/postgresql_fs...
It did not prevent people from using it. You won't find a database that has the perfect durability, ease of use, performance ect.. It's all about tradeoffs.
Comment by dijit 1 day ago
Postgresql was able to fix their bug in 3 lines of code, how many for the parent system?
I understand your core thesis (sometimes durability guarantees aren’t as needed as we think) but in postgresql’s case, the edge was incredibly thin. It would have had to have been: a failed call to fsync and a system level failure of the host before another call to fsync (which are reasonably common).
It’s far too apples to oranges to be meaningful to bring up I am afraid.
Comment by Thaxll 1 day ago
Comment by mring33621 1 day ago
The persistence stuff is kinda new and it's not a surprise that there are limitations and bugs.
You should see this report as a good thing, as it will add pressure for improvements.
Comment by njuw 1 day ago
It's not really that new. The precursor to JetStream was NATS Streaming Server [1], which was first tagged almost 10 years ago [2].
[1] https://github.com/nats-io/nats-streaming-server
[2] https://github.com/nats-io/nats-streaming-server/releases/ta...
Comment by hurturue 1 day ago
as they would say, NATS is a terrible message bus system, but all the others are worse
Comment by johncolanduoni 1 day ago
Comment by adhamsalama 1 day ago
Comment by cedws 1 day ago
Comment by rockwotj 1 day ago
Comment by tptacek 1 day ago
Comment by KaiserPro 1 day ago
Comment by rdtsc 1 day ago
I am getting strong early MongoDB vibes. "Look how fast it is, it's web-scale!". Well, if you don't fsync, you'll go fast, but you'll go even faster piping customer data to /dev/null, too.
Coordinated failures shouldn't be a novelty or a surprise any longer these days.
I wouldn't trust a product that doesn't default to safest options. It's fine to provide relaxed modes of consistency and durability but just don't make them default. Let the user configure those themselves.
Comment by Thaxll 1 day ago
One of the most used DB in the world is Redis, and by default they fsync every seconds not every operations.
Comment by andersmurphy 1 day ago
The problem is it has terrible defaults for performance (in the context of web servers). Like just bad options legacy options not ones that make it less robust. Ie cache size ridiculously small, temp tables not in memory, WAL off so no concurrent reads/writes etc.
Comment by jwr 20 hours ago
Comment by hxtk 1 day ago
Comment by hobs 1 day ago
Comment by lubesGordi 1 day ago
Comment by akshayshah 1 day ago
The docs explicitly state that clusters do not provide strong consistency and can lose acknowledged data.
Comment by sk5t 1 day ago
Comment by KaiserPro 1 day ago
I like that, and it allows me to build things around it.
For us when we used it back in 2018, it performed well and was easy to administer. The multi-language APIs were also good.
Comment by traceroute66 1 day ago
Not so fast.
Their docs makes some pretty bold claims about JetStream....
They talk about JetStream addressing the "fragility" of other streaming technology.
And "This functionality enables a different quality of service for your NATS messages, and enables fault-tolerant and high-availability configurations."
And one of their big selling-points for JetStream is the whole "stora and replay" thing. Which implies the storage bit should be trustworthy, no ?
Comment by KaiserPro 1 day ago
Comment by billywhizz 1 day ago
Comment by KaiserPro 1 day ago
smoke bomb
Comment by gopalv 1 day ago
The trouble is that you need to specifically optimize for fsyncs, because usually it is either no brakes or hand-brake.
The middle-ground of multi-transaction group-commit fsync seems to not exist anymore because of SSDs and massive IOPS you can pull off in general, but now it is about syscall context switches.
Two minutes is a bit too too much (also fdatasync vs fsync).
Comment by senderista 1 day ago
tl;dr "multi-transaction group-commit fsync" is alive and well
Comment by wseqyrku 1 day ago
Wait, isn't that the whole point of acknowledgments? This is not acknowledgment, it's I'm a teapot.
Comment by rdtsc 14 hours ago
Comment by CuriouslyC 1 day ago
Comment by nchmy 1 day ago
Comment by traceroute66 1 day ago
Dude ... the guy was testing JetStream.
Which, I quote from the first phrase from the first paragraph on the NATS website:
NATS has a built-in persistence engine called JetStream which enables messages to be stored and replayed at a later time.Comment by petre 1 day ago
Comment by KaiserPro 1 day ago
Comment by 0xbadcafebee 1 day ago
> I wouldn't trust a product that doesn't default to safest options
This would make most products suck, and require a crap-ton of manual fixes and tuning that most people would hate, if they even got the tuning right. You have to actually do some work yourself to make a system behave the way you require.
For example, Postgres' isolation level is weak by default, leading to race conditions. You have to explicitly enable serialization to avoid it, which is a performance penalty. (https://martin.kleppmann.com/2014/11/25/hermitage-testing-th...)
Comment by TheTaytay 1 day ago
Woah, those are _really_ strong claims. "Lost writes are accepted"? Assuming we are talking about "acknowledged writes", which the article is discussing, I don't think it's true that this is a common default for databases and filesystems. Perhaps databases or K/V stores that are marketed as in-memory caches might have defaults like this, but I'm not familiar with other systems that do.
I'm also getting MongoDB vibes from deciding not to flush except once every two minutes. Even deciding to wait a second would be pretty long, but two minutes? A lot happens in a busy system in 120 seconds...
Comment by 0xbadcafebee 1 day ago
Most (all?) NoSQL solutions are also eventual-consistency by default which means they can lose data. That's how Mongo works. It syncs a journal every 30-100 ms, and it syncs full writes at a configurable delay. Mongo is terrible, but not because it behaves like a filesystem.
Note that this is not "bad", it's just different. Lots of people use these systems specifically because they need performance more than durability. There are other systems you can use if you need those guarantees.
Comment by andersmurphy 1 day ago
Comment by zbentley 1 day ago
Even if most users do turn out to want “fast_and_dangerous = true”, that’s not a particularly onerous burden to place on users: flip one setting, and hopefully learn from the setting name or the documentation consulted when learning about it that it poses operational risk.
Comment by hxtk 1 day ago
If you need performance and you pick data integrity, you find out when your latency gets too high. In the reverse case, you find out when a customer asks where all their data went.
Comment by to11mtm 1 day ago
- Read Committed default with MVCC (Oracle, Postgres, Firebird versions with MVCC, I -think- SQLite with WAL falls under this)
- Read committed with write locks one way or another (MSSQL default, SQLite default, Firebird pre MVCC, probably Sybase given MSSQL's lineage...)
I'm not aware of any RDBMS that treats 'serializable' as the default transaction level OOTB (I'd love to learn though!)
....
All of that said, 'Inconsistent read because you don't know RDBMS and did not pay attention to the transaction model' has a very different blame direction than 'We YOLO fsync on a timer to improve throughput'.
If anything it scares me that there's no other tuning options involved such as number of bytes or number of events.
If I get a write-ack from a middleware I expect it to be written one way or another. Not 'It is written within X seconds'.
AFAIK there's no RDBMS that will just 'lose a write' unless the disk happens to be corrupted (or, IDK, maybe someone YOLOing with chaos mode on DB2?)
Comment by hansihe 1 day ago
Comment by ncruces 1 day ago
No. SQLite is serializable. There's no configuration where you'd get read committed or repeatable read.
In WAL mode you may read stale data (depending on how you define stale data), but if you try to write in a transaction that has read stale data, you get a conflict, and need to restart your transaction.
There's one obscure configuration no one uses that's read uncommitted. But really, no one uses it.
Comment by williamstein 1 day ago
Comment by williamstein 1 day ago
I got very deep into using NATS last year, and then realized the choices it makes for persistence are really surprising. Another horrible example if that server startup time is O(number of streams), with a big constant; this is extremely painful to hit in production.
I ended up implementing from scratch something with the same functionality (for me as NATS server + Jetstream), but based on socket.io and sqlite. It works vastly better for my use cases, since socketio and sqlite are so mature.
Comment by PaoloBarbolini 1 day ago
https://github.com/nats-io/nats.rs/issues/1253#issuecomment-...
Comment by shikhar 1 day ago
Pros: unlimited streams with the durability of object storage – JetStream can only do a few K topics
Cons: no consumer groups yet, it's on the agenda
Comment by embedding-shape 1 day ago
Comment by shikhar 1 day ago
https://s2.dev/blog/dst https://s2.dev/blog/linearizability
We have also adopted Antithesis for a more thorough DST environment, and plan to do more with it.
One day we will engage Kyle to Jepsen, too. I'm not sure when though.
Comment by embedding-shape 19 hours ago
If everyone who was making a database/message queue/whatever distributed system shared their projects on every Jepsen submission, we'd never have any discussions about the actual software in question.
Comment by shikhar 3 hours ago
Comment by Kinrany 21 hours ago
Comment by pdimitar 1 day ago
I understand that you need to make money. But you'll have to have a proper self-hosting offering with paid support as well before you're considered, at least by me.
I'm not looking to have even more stuff in the cloud.
Comment by shikhar 3 hours ago
Comment by hmans 22 hours ago
Comment by sreekanth850 1 day ago
Comment by akshayshah 1 day ago
[0]: https://www.redpanda.com/blog/why-fsync-is-needed-for-data-s...
[1]: https://jack-vanlightly.com/blog/2023/4/24/why-apache-kafka-...
Comment by sreekanth850 1 day ago
Comment by menaerus 1 hour ago
Comment by lionkor 1 day ago
Comment by dzonga 1 day ago
Comment by ViewTrick1002 1 day ago
To implement backpressure without relying on out of band signals (distributed systems beware) you need to have a deep understanding of the entire redis streams architecture and how the the pending entries list, consumers groups, consumers etc. works and interacts to not lose data by overwriting yourself.
Unbounded would have been fine if we could spill to disk and periodically clean up the data, but this is redis.
Not sure if that has improved.
Comment by ubercore 21 hours ago
Comment by gostsamo 1 day ago
Comment by selectodude 1 day ago
Comment by the__alchemist 1 day ago
Comment by Infiltrator 1 day ago
Comment by crote 1 day ago
Comment by loeg 1 day ago
Comment by selectodude 1 day ago
Comment by sam_lowry_ 1 day ago
Comment by t0i7a1r1a 1 day ago