Let’s talk about failure. Why MTTR is more than a four-letter word

Mean Time Between Failures (MTBF) is an indicator that is often used to measure the quality of a system. The longer the system can run without failing, the better it is, right? Well, yes and no.

Mean Time Between Failures (MTBF) is an indicator that is often used to measure the quality of a system. The longer the system can run without failing, the better it is, right? Well, yes and no.

Back in the day when fixing a bug required sending a floppy disk or a CD to all your customers, this was certainly an important factor. But today, when software can be fixed by downloading a patch, the cost of fixing bugs is a lot lower. The important metric to keep an eye on today is instead the Mean Time to Recovery (MTTR), i.e. how quickly your system can be fixed when something goes wrong. Because things will go wrong. Even the most mission-critical software has an allowance for error, with high availability provided at great cost.

Recently there have been two large-scale outages that have made front-page news. First, there was the bug in the content delivery provider Fastly that brought down dozens of media websites around the globe, and then another content delivery provider (Akamai) had an issue that caused outages for major banks and airlines.

No matter how robust you think your system is, or how much you have tested it, sooner or later it will need to be repaired. And when it does, you want it to be quick and easy to fix. Some organisations are still struggling to do releases to production every three months. That means you will have to wait three months to get a bug fixed. But high-performing teams do multiple releases to production every day. This means that a fix can be released before most customers even notice that anything is amiss. With more frequent releases it is also easier to isolate the root cause of a failure. If you release three months’ worth of work in a single go, it will be much harder to identify the guilty code.

The issue at Akamai was fixed within a couple of hours and Fastly had their issue fixed within 45 minutes. There were obvious and widespread consequences from both outages, but imagine if the timeframe had been days or weeks rather than hours. The ability to recover is so important that Google regularly conduct Disaster Recovery Training (DiRT) where they deliberately (but in a controlled manner) bring down certain parts of their production environment, to practise their ability to recover.

One of the key attributes of a system when it comes to investigating failures is observability (often abbreviated as o11y) and this is an area where the Appian low-code automation platform shines. When something goes wrong, a system admin can immediately find the process instance that failed, see the values of all the variables at the time of the failure, and even trace the process all the way back to the start and see how the variables were updated along the way. Compared with filtering through log files, the time savings are massive. And as we’ve seen from the Akamai and Fastly experiences, it’s not only time that’s saved, but organisational reputation.


Queensland Crime and Corruption Commission

Queensland’s Crime and Corruption Commission (CCC) is an independent statutory body established to reduce the incidence of major crime and public sector corruption in Queensland, and to provide the state’s witness protection service. The CCC investigates both crime and corruption, has oversight of police and the public sector and protects witnesses. The CCC is run by a small, dedicated staff of approximately 50 people and is the only integrity agency in Australia with this range of functions.

Queensland Crime and Corruption Commission

by Procensol



2 pages of diverse content
FREE download in PDF format for reading anywhere
Industry leading content
Answering the big questions in Queensland Crime and Corruption Commission
By downloading you’ll be sent our regular newsletter with content based on Queensland Crime and Corruption Commission - don’t miss out!
Previous Post Next Post

Related Articles

May 30th, 2022

Low-code and Local Government – a match made in civic heaven?

Why the time is right for local authorities to embrace low-code solutions.
July 1st, 2020

Three steps to a pain-free low code implementation

Low code platforms offer tremendous benefits but can still pose challenges for the underprepared. Following these foundational steps will ensure a pain-free implementation.
February 21st, 2020

Reasons to consider low code

Low code tools can simplify application development, facilitate innovation and speed up the digital transformation process for businesses in all industries.
July 6th, 2022

How to make sure your low code project succeeds

The low-code and no-code movement is part of an increasing democratisation of programming. Gartner predicts that by 2024, low-code application development will be responsible for more than 65% of application development activity. Listen to this webinar where AU Managing Director of Procensol, Dan Cooke, and his colleague Manish Tripathy, Head of Delivery chat with Pete Ames about: What low code is and how it reduces costs and speeds up delivery, how no or low code projects are different and their key benefits, the common reasons low-code projects fail, the steps involved in successfully rolling out a no or low code project and change management strategies - 5 tips that will help you deliver a successful low-code project. If you’re thinking of exploring low-code as an option to deliver more “bangs for your IT budget bucks” this webinar is a must watch.