Scaling PostgreSQL to power 800M ChatGPT users
Posted by mustaphah 1 day ago
Comments
Comment by ninkendo 1 day ago
Idle transactions have been a huge footgun at $DAYJOB… our code base is full of “connect, start a transaction, do work, if successful, commit.” It means you’re consuming a connection slot for all work, even while you’re not using the database, and not releasing it until you’re done. We had to bump the Postgres connection limits by an order of magnitude, multiple times, and before you know it Postgres takes up more RAM than anything else just to support the number of connections we need.
The problem permeated enough of our (rust) codebase that I had to come up with a compile time check that makes sure you’re not awaiting any async functions while a Postgres connection is in your scope. Using the .await keyword on an async function call, but not passing the pg connection to that function, ends up being a nearly perfect proxy for “doing unrelated work while not releasing a connection”. It worked extremely well, the compiler now just straight up tells us where we’re doing it wrong (in 100+ places in fact.)
Actually getting away from that pattern has been the hard part, but we’re almost rid of every place we’re doing it, and I can now run with a 32-connection pool in load testing instead of a 10,000 connection pool and there’s no real slowdowns. (Not that we’d go that low in production but it’s nice to know we can!)
Just decreasing the timeout for idle transactions would have probably been the backup option, but some of the code that holds long transactions is very rarely hit, and it would have taken a lot of testing to eliminate all of it if we didn’t have the static check.
Comment by MaxGabriel 5 hours ago
1. Nested database transactions could exhaust the transaction pool and deadlock 2. Same as you described with doing eg HTTP during transactions
We now have a compile time guarantee that no IO can be done outside of whitelisted things, like logging or getting the current time. It’s worked great! Definitely a good amount of work though.
Comment by ninkendo 17 minutes ago
I had this realization while writing the rustc plugin that this is basically another shade of “function coloring”, but done intentionally. Now I wish I could have a language that lets me intentionally “color” my functions such that certain functions can only be called from certain blessed contexts… not unlike how async functions can only be awaited by other async functions, but for arbitrary domain-specific abstractions, in particular database connections in this case. I want to make it so HTTP calls are “purple”, and any function that gets a database connection is “pink”, and make it so purple can call pink but not vice-versa.
The rule I ended up with in the lint, is basically “if you have a connection in scope, you can only .await a function if you’re passing said connection to that function” (either by reference or by moving it.) It works with rust’s knowledge of lifetimes and drop semantics, so that if you call txn.commit() (which moves the connection out of scope, marking the storage as dead) you’re now free to do unrelated async calls after that line of code. It’s not perfect though… if you wrap the connection in a struct and hold that in your scope, the lint can’t see that you’re holding a connection. Luckily we’re not really doing that anywhere: connections are always passed around explicitly. But even if we did, you can also configure the lint with a list of “connection types” that will trigger the lint.
Comment by csiegert 1 day ago
Comment by ninkendo 1 day ago
If the codebase is large, and full of that kind of pattern (interleaving db writes with other work), the compiler plugin is nice for (a) giving you a TODO list of all the places you’re doing it wrong, and (b) preventing any new code from doing this while you’re fixing all the existing cases.
One idea was to bulk-replace everything so that we pass a reference to the pool itself around, instead of a checked-out connection/transaction, and then we would only use a connection for each query on-demand, but that’s dangerous… some of these functions are doing writes, and you may be relying on transaction rollback behavior if something fails. So if you were doing 3 pieces of “work” with a single db transaction before, and the third one failed, the transaction was getting rolled back for all 3. But if you split that into 3 different short-lived connections, now only the last of the 3 db operations is rolled back. So you can’t just find/replace, you need to go through and consider how to re-order the code so that the database calls happen “logically last”, but are still grouped together into a single transaction as before, to avoid subtle consistency bugs.
Comment by yobbo 1 day ago
Comment by maherbeg 1 day ago
Comment by tasn 1 day ago
Comment by ninkendo 1 day ago
Hopefully it’s something I can open source soon (I may upstream it to the sqlx project, as that is what we’re using for db connections.)
Comment by deepbluev7 20 minutes ago
Comment by szastamasta 1 day ago
There’s also a lot of repetition. Maybe it was AI generated…?
Comment by PUSH_AX 1 day ago
Comment by embedding-shape 1 day ago
I remember coming across an article from NYCMesh which looked interesting ("Administrating the Mesh" - https://www.nycmesh.net/blog/datadog/) which made sense all the way until they put Datadog on top of everything, and I asked myself:
> What the hell, how is using a centralized service for managing a decentralized mesh a suitable solution? Did the author get employed by Datadog or what happened?
Then I got curious and lo and behold; the author was indeed hired by Datadog (and still works there AFAIK), effectively compromising the entire article and the project itself, because of their new employer.
Comment by jll29 1 day ago
Comment by lenkite 1 day ago
Comment by salmonik 1 day ago
Comment by dbuser99 1 day ago
Comment by sbstp 1 day ago
There's nothing novel about optimizing queries, sharding and using read replicas.
Comment by ramraj07 1 day ago
Comment by jll29 1 day ago
For that reason, I find it actually bold that they disclosed it, and I appreciate it.
The article reminded me of a similar post about MySQL use for Facebook from the Meta team, which had the same message: big database servers are powerful workhorses that scale and are very cost-effective (and simpler to manage than distributed setups where writes need be to carefully orchestrated - a very hard task).
The two core mesages of both articles combined could be read as: 1. big DB servers are your friend and 2. keep it simple, unless you can't avoid the extra complexity any more.
Comment by evanelias 23 hours ago
Comment by spullara 1 day ago
Comment by ants_a 1 day ago
Comment by hu3 1 day ago
They had huge problems with VACUUM at high tps. Basically the database never had space to breath and cleanup.
Comment by everfrustrated 1 day ago
Comment by zozbot234 1 day ago
Comment by dmix 1 day ago
> The primary rationale is that sharding existing application workloads would be highly complex and time-consuming, requiring changes to hundreds of application endpoints and potentially taking months or even years
Comment by manquer 1 day ago
On one hand OAI sell coding agents and constantly hype how easy it will replace developers and most of the code written is by agents, on the other hand they claim it will take years to refactor
Both cannot be true at the same time.
Comment by simonw 1 day ago
Comment by gloflo 1 day ago
Comment by csto12 1 day ago
Comment by Ozzie_osman 1 day ago
Comment by aisuxmorethanhn 1 day ago
Comment by zozbot234 1 day ago
Comment by EB66 1 day ago
Comment by zozbot234 1 day ago
But how would any of that change by going outside Postgres itself to begin with? That's the part that doesn't make much sense to me.
Comment by londons_explore 1 day ago
Ie. if you shard by userId, then a "share" feature which allows a user to share data with another user by having a "SharedDocuments" table cannot be consistent.
That in turn means you're probably going to have to rewrite the application to handle cases like a shared document having one or other user attached to it disappear or reappear. There are loads of bugs that can happen with weak consistency like this, and at scale every very rare bug is going to happen and need dealing with.
Comment by zozbot234 1 day ago
Not necessarily? You can have two-phase commit for cross-shard writes, which ought to be rare anyway.
Comment by londons_explore 1 day ago
Other clients (readers) have to be able to deal with inconsistencies in the meantime.
Also, 2PC in postgres is incompatible with temporary tables, which rules out use with longrunning batch analysis jobs which might use temporary tables for intermediate work and then save results. Eg. "We want to send this marketing campaign to the top 10% of users" doesn't work with the naive approach.
Comment by ants_a 1 day ago
Comment by awesome_dude 19 hours ago
The nanosecond that the system has the concept of readers and writers being different processes/people/whatever it has multiple copies, the one held by the database, and the copies held by the readers when they last read.
It does not matter if there is a single DB lock, or a multi shared distributed lock.
Comment by awesome_dude 1 day ago
If you're having trouble there then a proxy "layer" between your application and the sharded database makes sense, meaning your application still keeps its naieve understanding of the data (as it should) and the proxy/database access layer handles that messiness... shirley
Comment by mkleczek 1 day ago
Comment by 9rx 1 day ago
Can you, though? OpenAI is haemorrhaging money like it is going out of style and, according to the news cycle over the last couple of days, will likely to be bankrupt by 2027.
Comment by londons_explore 1 day ago
You suddenly have literally thousands of internal users of a datastore, and "We want to shard by userId, nobody please don't do joins on user Id anymore" becomes an impossible ask.
Comment by AlisdairO 1 day ago
This would be a particularly nice-to-have feature for Postgres - the option to have heavyweight locks just proactively cancel any conflicting workload. For any case where you have a high-throughput table, the damage of the heavyweight lock sitting there waiting (and blocking all new traffic) is generally much larger than just cancelling some running transactions.
Comment by zozbot234 1 day ago
Comment by AlisdairO 1 day ago
SELECT or DML operations take a lightweight lock on the table that doesn't block most other work, but it does block these schema changes. While the schema change is waiting to acquire the table lock, all new operations (like new SELECTs, for example) get blocked until the schema change completes.
So the following scenario can be pretty disastrous:
* Start a long-running SELECT operation on table
* Attempt to apply schema change to the table
* All new work on the table is blocked until the SELECT completes and the schema change can apply.
* Production outage
What the ChatGPT folks do is set a lock timeout when applying the schema change to make it 'give up' after a few seconds. This works to avoid truly excessive impact (in their case, they may have micro-outages of up to 5s while trying to apply schema), but has problems - firstly, they then need to retry, which may lead to more micro-outages, and secondly there's no guarantee on a system with mixed workload that they will be able to force the change through, and the schema change just ends up getting starved out.
A better alternative for most workloads is to build a system that detects what workload is blocking your schema change and kills it, allowing the schema change to go through quickly and unblock all the work behind it. You'd still use a lock timeout with this to be on the safe side, but it shouldn't be necessary in most cases.
Side note on transactional DDL - for Postgres systems with high throughput, most people just use autocommit. Table level locks that get taken to perform the schema change get held for the duration of the transaction, and you generally want to really minimize the amount of time you hold them for.
Comment by winterrx 1 day ago
Comment by londons_explore 1 day ago
I think they handled the massive growth by a lot of 2am emergencies and editing config files directly in production in the hope of fixing fires.
Comment by n_u 1 day ago
> We added nearly 50 read replicas, while keeping replication lag near zero
I wonder what those replication lag numbers are exactly and how they deal with stragglers. It seems likely that at any given moment at least one of the 50 read replicas may be lagging cuz CPU/mem usage spike. Then presumably that would slow down the primary since it has to wait for the TCP acks before sending more of the WAL.
Comment by tomnipotent 1 day ago
Other than keeping around more WAL segments not sure why it would slow down the primary?
Comment by bostik 1 day ago
You could use asynchronous WAL shipping, where the WAL files are uploaded to an object store (S3 / Azure Blob) and the streaming connections are only used to signal the position of WAL head to the replicas. The replicas will then fetch the WAL files from the object store and replay them independently. This is what wall-g does, for a real life example.
The tradeoffs when using that mechanism are pretty funky, though. For one, the strategy imposes a hard lower bound to replication delay because even the happy path is now "primary writes WAL file; primary updates WAL head position; primary uploads WAL file to object store; replica downloads WAL file from object store; replica replays WAL file". In case of unhappy write bursts the delay can go up significantly. You are also subject to any object store and/or API rate limits. The setup makes replication delays slightly more complex to monitor for, but for a competent engineering team that shouldn't be an issue.
But it is rather hilarious (in retrospect only) when an object store performance degdaration takes all your replicas effectively offline and the readers fail over to getting their up-to-date data from the single primary.
Comment by ants_a 1 day ago
Comment by stemchar 1 day ago
I'd like to know more, since I don't understand how this could happen. When you say "block", what do you mean exactly?
Comment by bostik 1 day ago
Combination of: streaming replication; long-running reads on a replica; lots[þ] of writes to the primary. While the read in the replica is going it will generate a temporary table under the hood (because the read "holds the table open by point in time"). Something in this scenario leaked the state from replica to primary, because after several hours the primary would error out, and the logs showed that it failed to write because the old table was held in place in the replica and the two tables had deviated too far apart in time / versions.
It has seared to my memory because the thing just did not make any sense, and even figuring out WHY the writes had stopped at the primary took quite a bit of digging. I do remember that when the read at the replica was forcefully terminated, the primary was eventually released.
þ: The ballpark would have been tens of millions of rows.
Comment by ants_a 6 hours ago
Comment by maherbeg 1 day ago
Comment by huksley 1 day ago
So it is not really scaling too much now, rather maintaining current state of things and new features go to a different DB?
Comment by CodeCompost 1 day ago
Comment by londons_explore 1 day ago
Comment by freakynit 1 day ago
Comment by ggregoire 1 day ago
I always wondered what kind of instance companies at that level of scalability are using. Anyone here have some ideas? How much cpu/ram? Do they use the same instance types available to everyone, or does AWS and co offer custom hardware for these big customers?
Comment by jiggawatts 1 day ago
For example, Azure Standard_E192ibds_v6 is 96 cores with 1.8 TB of memory and 10 TB of local SSD storage with 3 million IOPS.
Past those "general purpose" VMs you get the enormous machines with 8, 16, or even 32 sockets.[1] These are almost exclusively used for SAP HANA in-memory databases or similar ERP workloads.
Azure Standard_M896ixds_24_v3 provides 896 cores, 32 TB of memory, and 185 Gbps Ethernet networking. This is generally available, but you have to allocate the quota through a support ticket and you may have to wait and/or get your finances "approved" by Microsoft. Something like this will set you back [edited] $175K per month[/edited]. (I suspect OpenAI is getting a huge effective discount.)
Personally, I'm a fan of "off label" use of the High Performance Compute (HPC) sizes[2] for database servers.
The Standard_HX176rs HPC VM size gives you 176 cores and 1.4 TB of memory. That's similar to the E-series VM above, but with a higher compute-to-memory ratio. The memory throughput is also way better because it has some HBM chips for L3 (or L4?) cache. In my benchmarks it absolutely smoked the general-purpose VMs at a similar price point.
[1] https://learn.microsoft.com/en-us/azure/virtual-machines/siz...
[2] https://learn.microsoft.com/en-us/azure/virtual-machines/siz...
Comment by tetha 51 minutes ago
> For example, Azure Standard_E192ibds_v6 is 96 cores with 1.8 TB of memory and 10 TB of local SSD storage with 3 million IOPS.
Is a well-stocked Dell Server going for ~50 - 60K capex without storage before the RAM prices exploded. I"m wondering a bit about the CPU in there, but the Storage + RAM is fairly normal and nothing crazy. I'm pretty sure you could have that in a rack for 100k hardware pricing.
Comment by antonkochubey 1 day ago
lol, no, cloud is nowhere near that good value. It’s $3.5M annually.
> The Standard_HX176rs HPC VM size gives you 176 cores and 1.4 TB of memory
This one is $124k per year.
Comment by jiggawatts 1 day ago
I noticed that the M896i is so obscure and rarely used that there are typos associated with it everywhere including the official docs! In once place is says it has 23 TB of memory when it actually has 32 TB.
Comment by manquer 1 day ago
https://docs.aws.amazon.com/sap/latest/general/sap-hana-aws-...
Comment by jiggawatts 1 day ago
That AWS instance uses these 60-core processors: https://www.intel.com/content/www/us/en/products/sku/231747/...
To anyone wondering about these huge memory systems: avoid them if at all possible! Only ever use these if you absolutely must.
For one, these systems have specialised parts that are more expensive per unit compute: $283 per CPU core instead of something like $85 for a current-gen AMD EPYC, which are also about 2x as fast as the older Intel Scalable Xeons that need to go into this chassis! So the cost efficiency ratio is something like 6:1 in favour of AMD processors. (The cost of the single large host system vs multiple smaller ones can get complicated.)
The second effect is that 32-way systems have huge inter-processor cache synchronisation overheads. Only very carefully coded software can scale to use thousands of cores without absolutely drowning in cache line invalidations.
At these scales you're almost always better off scaling out "medium" sized boxes. A single writer and multiple read-only secondary replicas will take you very far, up to hundreds of gigabits of aggregate database traffic.
Comment by cuu508 1 day ago
Comment by jiggawatts 1 day ago
Multiple of these can be linked together with “NUMALink” cables, which carry the same protocol as the traces that go between sockets on the motherboard. You end up with a single kernel running across multiple chassis.
Comment by zallarak 6 hours ago
ChatGPT is definitely the snappiest web UI of any LLM.
Comment by ahachete 18 hours ago
However, I'm still surprised about the reasons for not sharding. They have been mentioned before, but I haven't seen a substantial rationale.
Sharding is almost only analyzed from the perspective of write scaling. But sharding may not only be about write scaling, but a path to reducing blast radius. And this is something that triggers much earlier than write scaling needs (especially given how well Postgres scales vertically and reads).
When you shard your database, you end up having N clusters (for HA purposes, each "shard" must be a primary-replica(s) cluster itself), each holding 1/Nth of the data.
There are certain scenarios which, while unlikely, may hit you: data corruption in the WAL replication stream, a problem during a major upgrade, a situation that requires a whole cluster restore from a backup, you name it. For those cases, the whole cluster may experience notable downtime.
If you have a single cluster, 100% of your users experience downtime. If you sharded into N clusters, only 1/Nth of your users experience downtime. For a company servicing 800M users the difference from both scenarios is dramatically different. Even for much much smaller companies.
I'm puzzled why this is not perceived as a risk, and if it is not, how it is mitigated.
While I wouldn't advocate to shard "too early", given that it comes with notable caveats, I believe more and more in sharding your workloads when possible more earlier than later. Way before truly needing it from a write scaling perspective. Because apart from reducing the blast radius, it applies implicitly the principle of "divide-and-conquer", and your problems become much more manageable (your number of connections per cluster decreases at will, backup restore times can be a fraction of the time, logical replication can be considered as a true option for replication/upgrades/etc if you keep shards relatively small and many other operational procedures are greatly simplified if now you have much smaller databases, even if you have many more of them).
Comment by mrweasel 1 day ago
This is however the most down to earth: How we scale Postgresql I've read in a long time. No weird hacker, no messing around with the source code or tweaking the Linux kernel. Running on Azure Postgresql it's not like OpenAI have those options anyway, but still it seems a lot more relatable than: We wrote our own drive/filesystem/database-hack in Javascript.
Comment by bzmrgonz 1 day ago
Comment by doodlesdev 1 day ago
Honestly, only us nerds in Hacker News care about this kind of stuff :) (and that's why I love it here).
edit: also, the article cites OpenAI did adopt Azure Cosmos DB for new stuff they want to shard. Still shows how far you can take PostgreSQL though.
Comment by bdcravens 1 day ago
Comment by DLA 1 day ago
Comment by beoberha 1 day ago
Comment by csto12 1 day ago
Comment by bzmrgonz 1 day ago
Comment by Tostino 1 day ago
Comment by tormeh 1 day ago
Comment by atonse 22 hours ago
I've lost count the number of times I'll read about some new postgres or MySQL thing where you find out that Oracle or SQL server implemented it 20 years ago. Yes they always have it behind expensive SKUs. But they're hardly slouches in the technical competence departments.
I found Oracle to just be a lot more unwieldy from a tooling perspective than SQL Server (which IMO had excellent tools like SSMS and the query planner/profiler to do all your DB management).
But overall, these paid databases have been very technically sound and have been solving some of these problems many, many years ago. It's still nice to see the rest of us benefit from these features in free databases nowadays.
As others have said, the query planners I used 25 years ago with Oracle (cost based, rule based, etc) were amazing. The oracle one wasn't visual but the MSSQL one was totally visual that actually gave you a whole graph of how the query was assembled. And I last used the MSSQL one 15 years ago.
Maybe pgAdmin does that now (I haven't used pgAdmin), but I miss the polished tools that came with SQL Server.
Comment by Tostino 1 day ago
It offers heap tables, as well as index organized tables depending on what you need.
The protocol supports running multiple queries and getting multiple resultsets back at once saving some round-trips and resources.
Also supports things like global temp tables, and in memory tables, which are helpful for some use cases.
The parallelism story for a single query is still stronger with SQL Server.
I'm sure I could think of more, but it's been a few years since I've used it myself and I've forgotten a bit.
It is a good database. I just wouldn't use it for my startup. I could never justify that license cost, and how it restricts how you design your infrastructure due to the cost and license terms.
Comment by bzmrgonz 1 day ago
Comment by beoberha 1 day ago
Comment by everfrustrated 1 day ago
Comment by esjeon 1 day ago
Comment by bzmrgonz 1 day ago
Comment by hu3 1 day ago
I wonder, is there another popular OLTP database solution that does this better?
> For write traffic, we’ve migrated shardable, write-heavy workloads to sharded systems such as Azure CosmosDB.
> Although PostgreSQL scales well for our read-heavy workloads, we still encounter challenges during periods of high write traffic. This is largely due to PostgreSQL’s multiversion concurrency control (MVCC) implementation, which makes it less efficient for write-heavy workloads. For example, when a query updates a tuple or even a single field, the entire row is copied to create a new version. Under heavy write loads, this results in significant write amplification. It also increases read amplification, since queries must scan through multiple tuple versions (dead tuples) to retrieve the latest one. MVCC introduces additional challenges such as table and index bloat, increased index maintenance overhead, and complex autovacuum tuning.
Comment by 0xdeafbeef 1 day ago
Comment by anonzzzies 1 day ago
When did you get your results, might be time to re-evaluate.
Comment by 0xdeafbeef 1 day ago
Comment by menaerus 1 day ago
So, this is the part that actually made me left wondering why.
Comment by kuon 1 day ago
I don't really get the point here. What is novel and great? It feels they followed the first " how to scale pg" article.
Comment by _ink_ 1 day ago
Comment by easton 1 day ago
(not that that's an excuse, but i've seen similar things before)
Comment by nasretdinov 1 day ago
The main point of the article is that it's actually not that hard to live with a single primary Postgres for your transactional workloads (emphasis on _transactional_), and if OpenAI with their 800M+ users can still survive on a single primary (with 50(!) read replicas), so could you, especially before you've reached your first 100M users.
Any non-distributed database or setup is orders of magnitude easier to design for, and it's also typically much more cost efficient too, both in terms of hardware and software too.
There are some curious details, e.g.:
- you can ship WAL to 50 read replicas simultaneously from a single primary and be fine - you can even be using an ORM and still get decent performance - schema changes are possible, and you can just cancel a slow ALTER to prevent production impact - pgbouncer is ok even for OpenAI scale
There are so many things that contradict current "conventional wisdom" based on the experience from what was possible with the hardware 10+ (or even 20+) years ago. Times finally changed and I really welcome articles like these that show how you can greatly simplify your production setup by leveraging the modern hardware.
Comment by voidd 1 day ago
Comment by neya 1 day ago
Months passed by since this application was developed (a simple Phoenix/Elixir backend), and yesterday I was casually checking my database to see how many rows it had - about 500,000+ roughly. I didn't notice a single hint of the volume the Postgres was handling, granted - I'm the only user, but there's always a lot going on - RAG, mostly that requires searching of the database for context before multiple agents send you a response (and respond amongst themselves). Absolutely zero performance degradation.
I'm convinced that Postgres is a killer database that doesn't get the attention it deserves over the others (for chat). Already managing some high traffic websites (with over 500M+ requests) with no issues, so I am extremely unsurprised that it works really well for chat apps at scale too.
Comment by QuiCasseRien 1 day ago
Comment by hahahahhaah 1 day ago
Comment by KellyCriterion 1 day ago
How do they store all the other stuff related to operating the service? This must be a combination of several components? (yes, including some massdata storage, Id guess?)
This would be cool to understand, as Ive absolutely no idea how this is done (and could be done :-)
Comment by ASalazarMX 1 day ago
Comment by ph4evers 1 day ago
Comment by kachapopopow 1 day ago
Comment by ed_mercer 1 day ago
Comment by noxs 1 day ago
> Author Bohan Zhang
> Acknowledgements Special thanks to Jon Lee, Sicheng Liu, Chaomin Yu, and Chenglong Hao, who contributed to this post, and to the entire team that helped scale PostgreSQL. We’d also like to thank the Azure PostgreSQL team for their strong partnership.
Comment by Natfan 1 day ago
e: and the link points to en-us at time of writing. I frankly don't see the value in your comment.
Comment by PunchyHamster 1 day ago
.
Comment by killingtime74 1 day ago
Comment by ramraj07 1 day ago
Comment by killingtime74 1 day ago
Comment by mannyv 1 day ago
I'm not sure that's the answer people are looking for.
Comment by ahmetozer 1 day ago
Comment by HenriTEL 1 day ago
Comment by bhouston 1 day ago
I mentioned that as a right solution to the problem last time they posted about Postgres performance issues:
https://news.ycombinator.com/item?id=44072645
But the response from an OpenaI engineer (who is the author of this article) was that sharding isn't the solution:
Comment by qaq 1 day ago
Comment by DeathArrow 1 day ago
Sure, but choosing from the start a DB that can scale with ease would have taken far less time and effort.
You can bend any software into doing anything, but is it worth it?
Comment by PunchyHamster 1 day ago
Comment by trhway 1 day ago
if there is a read replica that has reached required snapshot - it is usually enough (depends on your task of course) for it to be the snapshot that was at the start of your transaction - and if the read query doesn't need to read your transaction uncommitted data, then that replica can serve the read query.
Comment by MarginalGainz 1 day ago
If you treat Postgres strictly as a Key-Value store for transactional integrity, it flies. But if you try to run complex discovery or tsvector search on that same primary, the locking/CPU load kills it instantly.
The real hero here is aggressive segregation: keep Postgres for 'Truth', and offload all discovery to specialized indices that don't block the write-head.
Comment by lighthouse1212 1 day ago
Comment by fbotelho 1 day ago
> It may sound surprising that a single-primary architecture can meet the demands of OpenAI’s scale; however, making this work in practice isn’t simple.
And it also says that this approach has cornered them into a solution that isn't trivial to change. They now use different database deployments (the single primary one that is the focus of the post and *multiple* other systems, such as Azure CosmosDB, to which some of the write traffic is being directed).
> To mitigate these limitations and reduce write pressure, we’ve migrated, and continue to migrate, shardable (i.e. workloads that can be horizontally partitioned), write-heavy workloads to sharded systems such as Azure Cosmos DB, optimising application logic to minimise unnecessary writes. We also no longer allow adding new tables to the current PostgreSQL deployment. New workloads default to the sharded systems.
I wonder how easy it is for developers to maintain and evolve this solution of miscellaneous database systems.
So yes, you can go far with a single primary, but you can also potentially never easily get away from it.
Comment by poemxo 1 day ago
Comment by Etheryte 1 day ago
Comment by oofbey 1 day ago
That said big beef is so simple to start with. And this story is a strong example that YAGNI is a practical reality for almost everybody wrt “distributed everything”.
Comment by iamlintaoz 1 day ago
Comment by kevincox 1 day ago
Quite possibly they would have been better off staying purely postgres but with sharing. But impossible to know.
Comment by msp26 1 day ago
Comment by LudwigNagasena 1 day ago
Comment by resdev 1 day ago
/s