I was meeting a startup CTO for coffee the other morning. As he arrived, he apologised and said that for the next 10 minutes or so he’d have to ignore me.
Why?
“Prod is down.”
I wasn’t offended. I’ve been there plenty of times.
When your production environment is down, when customers can’t access the service they’re paying you for, only one thing matters: getting it back running.
It’s not the time to refactor.
It’s not the time to get the deployment process into a better pipeline.
It’s not the time to release the new version of the API and see if that fixes everything.
And it’s certainly not the time to enjoy a coffee and have a chat.
He sent a few emails, and then he was back in the room. His team were on it. They knew what the problem was, and had some clear steps to improve their system for next time.
Here’s the thing.
Prod being down sucks, but it’s also an important opportunity to level up your product.
Those changes that were inappropriate while the app was broken might well be essential once it’s back up and running.
Running a post-mortem, or having a 5 Whys session, gives the opportunity to uncover a deeper understanding of how you operate on a socio-technical level.
Rather than failures, we can think of issues like these as tears that offer a chance for repair and build back stronger, just like muscles in the body do after exercise. It’s an essential step to create an antifragile system.
It’s not about what you do when prod is down that matters.
It’s what you do after that makes all the difference.