DO Ideas 2

[Monitoring] Different metrics aggregations

Currently, all metrics seem to be aggregated by doing an average over the time period. However, as one might know, an average can hide a lot of different patterns (see the book "How to lie with statistics" for in-depth examples).

Let's take an example amongst many. Suppose that you have a cron that processes data every hour. As your database grows, the cron eats up more and more RAM. Suppose that this process lasts for 1 minute. At first, it takes 50% of RAM but now it takes 90% of RAM. Soon it's going to take more RAM than available and jam the server (you know how an OOM goes: basically everything freezes for a long time and if cron repeatedly creates new jobs it just never ends).

If things go this way, you're only going to see a tiny increase of use on your RAM graph, especially on the daily or weekly graph, because it makes an average of the RAM. However, the situation is dramatic because the server is actually going to explode.

On a general rule, just look at your graphs: the bigger the granularity, the lower the values.

What would help would be for the graphs to have a few more lines: min, 5th percentile, median, 95th percentile, max are the one I think about.

Of course, the currently presented average value is already helpful but only for very few use cases like detecting a steady load. A lot of other stuff you'd want to prevent are completely left out.

  • Rémy Sanchez
  • Sep 11 2018
  • Attach files