Do monitoring tools still miss early signals before incidents?
Posted by gabdiax 3 hours ago
I'm curious how teams detect early signals of infrastructure problems before they turn into incidents.
In some environments I've seen cases where monitoring alerts arrive only after the system is already degrading.
Examples: - disk usage spikes faster than expected - network latency gradually increases - services degrade slowly before failing
Tools like Datadog, Zabbix, Prometheus etc. are great for alerts, but they still feel mostly reactive.
How do you deal with this in your infrastructure?
Do you rely more on: - anomaly detection - predictive monitoring - custom scripts - or just good incident response?
I'm trying to understand what actually works in real-world environments.
Comments
Comment by zippyman55 3 hours ago
Comment by gabdiax 3 hours ago
In your experience, were there usually early signals in metrics before job failures increased? For example patterns like latency changes, resource saturation or network anomalies.
I'm trying to understand whether those signals appear consistently enough to detect issues before incidents actually happen.
Comment by gabdiax 3 hours ago
For example: - unusual latency patterns - slow resource saturation - network anomalies
Do people actively monitor these patterns or mostly rely on threshold alerts?