Sharding to Contain the Blast Radius of Data Breaches

Posted by jboutwell 3 days ago

Counter22Comment8OpenOriginal

Comments

Comment by mhitza 1 day ago

Has anyone here, that's built SaaS software, tried to use a one database per customer approach?

It's been in my mind for a while, as an easy way to isolate customer data, make migrations easier and build monitoring for per customer resource usage.

Haven't seen this approach used in practice yet, and didn't get the chance to architect new SaaS products in the last few years and try it out long term.

Comment by evanelias 20 hours ago

Yes, a decent number of SaaS businesses do this. It requires a lot of automation on the ops side, but it's definitely doable.

I primarily work on schema management (migration) software for MySQL/MariaDB, and some companies leverage my product's built-in sharding support [1] for this type of per-customer sharding scheme.

When using per-customer shards, it's particularly important to avoid shard schema drift. Once your number of shards gets relatively high, you can end up with problems where an ALTER TABLE failed on just one shard, due to a network blip for example.

For this reason, at large scale with many shards, a declarative approach to schema management often works quite well: you express the desired state purely as a set of CREATE statements, and the tooling figures out the correct DDL to run to bring each shard into that desired state. This inherently solves drift because you can use a simple reconciliation loop to retry failures, i.e. running the tool repeatedly until there are no remaining changes to apply to any shard.

It's also important to support a flexible layout/mapping between shards and database servers. Initially, you can likely fit all the shards in the same server / DBMS daemon; but as the number of customers increases, you'll need to span multiple servers.

[1] https://www.skeema.io/docs/features/sharding/

Comment by eugenekay 1 day ago

Yes, using Microsoft SQL Server for Linux; hosted both on-premises with VMware and in Azure Virtual Machines - later migrated to Azure SQL Managed Instances. It worked great for the business’ needs. The major architectural advantage was that each Customer had a completely isolated Tablespace, easing compliance auditing. Each DB could be exported/migrated to a different Instance or Region, and migration scripts running slow for “whale” customers had no effect upon small fish. Monitoring of the Servers and individual Instances was straightforward, albeit very verbose due to the eventual Scale.

There were a few administrative drawbacks; largely because the MS-SQL Server Management Studio tools do not scale well to hundreds of active connections from a single workstation, worked-around through lots of Azure Functions runs instead. Costs and instance sizing were a constant struggle; though other engines like Postgres or even SQLite would likely be more efficient.

I have also seen this used in other formats quite successfully - Fandom/Wikia (used to?) use a MySQL database for each sub-site.

Comment by edwhitesell 1 day ago

Came in to, worked on a SaaS product that did this in 2000 (it was around since '97/'98). I was doing new customer deploy and support, not direct development. It was running on MSSQL 97, I think, then moved to MSSQL 2000. It worked okay, but we moved away from that model in a "next gen" build around 2001/2002.

The biggest hurdles are in the things like configurations. You'll probably want to have one code base, and maybe even one deploy/package for web servers. However, you'll need different configurations for each customer (DB name, credentials, etc.) and a way to manage them, and a way to identify which customer an HTTP request goes to before you can process it. You can use things like host names in your web app, but you'll really end up wanting some kind of "request router" to manage everything...at that point, it's far easier to put everything in one DB and move on with revenue-generation.

Comment by mhitza 1 day ago

Thanks for the perspective. I actually think the complex parts you mention are relatively easy nowadays.

If I were to implement it today I would probably use a centralized authorization service "authentication gateway" with something like forward_auth in Caddy to "tag along" configuration data with the request (teams, instance landing page etc. including encrypted database configuration storage, encoded as a JWT) https://caddyserver.com/docs/caddyfile/directives/forward_au...

I think the hard part is having enough discipline within a team to mostly work with backwards-compatible database changes, the automation to make that seamless, and the will to be proactive with the possibilities of this setup.

Comment by bob1029 1 day ago

We installed one server per customer and used SQLite. If someone wanted to breach all of our clients they'd have to attack each directly. With a simple single server architecture, we can easily manage individual on-prem deployments of our software.

Comment by shizcakes 1 day ago

Usually DBMSes struggle under very high cardinality of databases. Also, easier migration control is a double-edged sword, with the other side being needing to coordinate many migrations.

You probably don’t want to do what you are proposing except in extreme, carefully evaluated cases.

Comment by nrhrjrjrjtntbt 1 day ago

Yes using Tidb.

Comment by 1 day ago