Ask HN: How do you keep system context from rotting over time?
Posted by kennethops 4 days ago
Former SRE here, looking for advice.
I know there are a lot of tools focused on root cause analysis after things break. Cool, but that’s not what’s wearing me down. What actually hurts is the constant context switching while trying to understand how a system fits together, what depends on what, and what changed recently.
As systems grow, this feels like it gets exponentially harder. Add logs and now you’ve created a million new events to reason about. Add another database and suddenly you’re dealing with subnet constraints or a DB choice that’s expensive as hell, and no one noticed until later. Everyone knows their slice, but the full picture lives nowhere, so bit rot just keeps creeping in.
This feels even worse now that AI agents are pushing large amounts of code and config changes quickly. Things move faster, but shared understanding falls behind even faster.
I’m honestly stuck on how people handle this well in practice. For folks dealing with real production systems, what’s actually helped? Diagrams, docs, tribal knowledge, tooling, something else? Where does it break down?
Comments
Comment by dlcarrier 3 days ago
A laptop computer is extremely complex, but is actively developed and maintained by a small number of people, built on parts themselves developed by a small number of people, many of which are themselves built on parts themselves developed by a small number of people, and so on and so forth.
This works well in electronics design, because everything is documented and tested to comply with the documentation. You'd think this would slow things down, but developing a new generation of a laptop takes fewer man hours and less calendar time than developing a new generation of any software of a similar complexity running on it, despite the laptop skirting with the limitations of physics. Technical debt adds up really fast.
The top-level designers only have access to what the component manufacturers have published, and not to their internal designs, but that doesn't matter because the publications include correct and relevant data. When the component manufacturer comes out with something new, they use documentation from their supplier, to design the new product.
As long as each components of documentation is complete and accurate, it will meet all of the needs of anyone using that component. Diving deeper would only be necessary if something is incomplete or inaccurate.
Comment by nitwit005 3 days ago
You then eventually have that same pattern happen with services, where people give up on mapping the full thing out as well.
What I've done for my current team is to list the "downstream" services, what we use them for, who to contact, etc. It only goes one level deep, but it's something that someone can read quickly during an incident.
Comment by kennethops 3 days ago
Comment by gnabgib 3 days ago
ERD/ Entity Relationship Diagram https://www.lucidchart.com/pages/er-diagrams
ERM / Entity-Relationship Model https://en.wikipedia.org/wiki/Entity%E2%80%93relationship_mo...
(same-same, ERD is the more common acronym)
Comment by kennethops 3 days ago
Comment by htrp 3 days ago
I'm not sure I've seen any good vendors but I remember seeing a reverse devops tool posted a few days ago that would reverse engineer your VMs into Ansible code. If that got extended to your entire environment, that would almost be an auto documenting process.
Comment by dexdal 3 days ago
Comment by kennethops 3 days ago
I will check that tool out.
Comment by liveoneggs 3 days ago
All of those endpoints should be documented in an environment variable or similar as well.
The breakdown is when you don't instrument the same tooling everywhere.
Documentation is generally out of date by the time you finish writing it so I don't really bother with much detail there.
Comment by kennethops 3 days ago
Comment by liveoneggs 3 days ago
Always start at the head (what a customer sees -- actually load the website) and work down into each layer.
If something is breaking way downstream and customers don't see it then it doesn't actually matter right now.
Comment by amadeuswoo 4 days ago
Comment by kennethops 3 days ago
Comment by amadeuswoo 3 days ago
Comment by linux4dummies 3 days ago
Comment by kennethops 3 days ago
Comment by IceCoffe 3 days ago
Comment by canhdien_15 3 days ago
Comment by BOOSTERHIDROGEN 3 days ago
Comment by canhdien_15 3 days ago
Comment by kennethops 3 days ago