Let’s talk about failure. Why MTTR is more than a four-letter word

    • When (not if) things go wrong, MTTR is more important than MTBF.

      Mean Time Between Failures (MTBF) is an indicator that is often used to measure the quality of a system. The longer the system can run without failing, the better it is, right? Well, yes and no.

      Back in the day when fixing a bug required sending a floppy disk or a CD to all your customers, this was certainly an important factor. But today, when software can be fixed by downloading a patch, the cost of fixing bugs is a lot lower. The important metric to keep an eye on today is instead the Mean Time to Recovery (MTTR), i.e. how quickly your system can be fixed when something goes wrong. Because things will go wrong. Even the most mission-critical software has an allowance for error, with high availability provided at great cost.

      Recently there have been two large-scale outages that have made front-page news. First, there was the bug in the content delivery provider Fastly that brought down dozens of media websites around the globe, and then another content delivery provider (Akamai) had an issue that caused outages for major banks and airlines.

      No matter how robust you think your system is, or how much you have tested it, sooner or later it will need to be repaired. And when it does, you want it to be quick and easy to fix. Some organisations are still struggling to do releases to production every three months. That means you will have to wait three months to get a bug fixed. But high-performing teams do multiple releases to production every day. This means that a fix can be released before most customers even notice that anything is amiss. With more frequent releases it is also easier to isolate the root cause of a failure. If you release three months’ worth of work in a single go, it will be much harder to identify the guilty code.

      The issue at Akamai was fixed within a couple of hours, and Fastly had their issue fixed within 45 minutes. There were obvious and widespread consequences from both outages, but imagine if the timeframe had been days or weeks rather than hours. The ability to recover is so important that Google regularly conduct Disaster Recovery Training (DiRT) where they deliberately (but in a controlled manner) bring down certain parts of their production environment, to practise their ability to recover.

      One of the key attributes of a system when it comes to investigating failures is observability (often abbreviated as o11y) and this is an area where the Appian low-code automation platform shines. When something goes wrong, a system admin can immediately find the process instance that failed, see the values of all the variables at the time of the failure, and even trace the process all the way back to the start and see how the variables were updated along the way. Compared with filtering through log files, the time savings are massive. And as we’ve seen from the Akamai and Fastly experiences, it’s not only time that’s saved, but organisational reputation.

  • About the Author

    Oskar Kindbom


    Oskar is the lead for Procensol’s test practice. He is based in Brisbane, Australia and has extensive Test Lead experience for government and commercial enterprises in Australia and Sweden.

Related Posts