Do monitoring tools still miss early signals before incidents?

Posted by gabdiax 3 hours ago

Counter3Comment3OpenOriginal

I'm curious how teams detect early signals of infrastructure problems before they turn into incidents.

In some environments I've seen cases where monitoring alerts arrive only after the system is already degrading.

Examples: - disk usage spikes faster than expected - network latency gradually increases - services degrade slowly before failing

Tools like Datadog, Zabbix, Prometheus etc. are great for alerts, but they still feel mostly reactive.

How do you deal with this in your infrastructure?

Do you rely more on: - anomaly detection - predictive monitoring - custom scripts - or just good incident response?

I'm trying to understand what actually works in real-world environments.

Comments

Comment by zippyman55 3 hours ago

My team was responsible for the system administration of a large scale HPC center. We seemed to get blamed, incorrectly, for a lot of sloppy user code. I implemented statistical process controls for job aborts, and reported the results as mean time to failure rates over the years. It was pretty cool, as I could respond with failure rates for each of several thousand different programs. What did not work was changing the culture to get people to improve their code. But I was able to push back hard when my team was arbitrarily blamed for someone else’s bad code. It was easy to show that a jobs failure rate was increasing and link it to a recent upgrade or change. But, I felt I was often just shining the flashlight at an issue and trying to encourage a responsible party to take ownership.

Comment by gabdiax 3 hours ago

That's really interesting. Using statistical process control for failure rates in HPC systems sounds like a very solid approach.

In your experience, were there usually early signals in metrics before job failures increased? For example patterns like latency changes, resource saturation or network anomalies.

I'm trying to understand whether those signals appear consistently enough to detect issues before incidents actually happen.

Comment by gabdiax 3 hours ago

One thing I'm particularly curious about is whether teams see early signals in metrics or logs before incidents actually happen.

For example: - unusual latency patterns - slow resource saturation - network anomalies

Do people actively monitor these patterns or mostly rely on threshold alerts?