The one with the canary - Engine Room 2020 ++ This story is about a much celebrated failure in the history of FT.com, but something at the time that was actually pretty embarrassing for me, a bit distressing, something that put me on the back foot for a few weeks - these sort of incidents tend to make me overly cautious and keen not to repeat myself. Having been in this situation a few time you realise over time that feeling goes away, you forget about how you felt in the moment and instead carry forward the lessons you learnt. I think sometimes when talking about failure, perspective is a useful thing to keep in mind. ++ I have a mental model of how badly I’ve messed up, So, I haven’t accidentally made Europe uninhabitable for a few thousand years, never started a war by taking a wrong turn, crashed a $500m space ship. Each started with a simple innocent, oversight or a nobel attempt to do one thing that had some unintended consequences. So, in the scheme of things, I always think the thing I’ve just broken PROBABLY isn’t that that bad in the history of world events, but it can still be personally, though temporarily devastating. ++ So, anyway, what I meant to do was this. We’ve built a new FT website - it had a front-page, articles, a paywall - and a few people had started using it - it was going great. And we’d made some recent changes to the release process - every pull request would fire up a temporary new sever, the changes get deployed to that, the system run some tests over the new stack and the build would be marked as a ‘pass’ if everything was ok, ready to be deployed in to production by an engineer. But we thought we could do better, and instead of testing this temporary build in a QA environment we thought we could ship that version automatically in to production and test it out for real with a small number of users. This is known as canary releasing, after the bird that miners take down the mine to act as an early warning system. Our canary release would be sent in to production to see how it survived. Exciting! What could possibly go wrong? +++ So, like you can see - you can simulate what the CDN is doing by sending a header contain the canary build hostname to ft.com, which will forward your request to that Heroku server and return whatever it finds. ++ and… voila - your canary version of the front-page staring back at you. We could automate our whole release process like this - every commit would automatically get in to production. ++ I was quite pleased, and I probably went to the canteen for a treat at that point. But before I celebrated too much I thought it would be a good idea to test some of the safety catches. Being able to proxy any website in to the ft.com domain name was a bit of a security concern so my first idea at mitigation was to rely on the CDN’s vary headers. The canary hostname would be stored in a cache key to keep it separate from the production version of the page, in theory, safe enough as long as we controlled the list of hostnames. ++ So we can test that again with our curl command. ++ But this time something different happened. The next thing I knew someone from the newsroom was at my desk. And they were wondering why the ft.com homepage has disappeared and been replaced with the Guardian’s one. It was all a bit of a panic. Of course knew my little test was responsible, and became quite flustered. Obviously it’s best to be open and honest about these things so I admitted that my testing had cause this to happen. But the key thing here as with many incidents is not to worry too much about why exactly something went wrong - it’s to get in back in an acceptable state - so flushing the CDN cache seemed like the best thing to do. I’d messed up the cache key somehow. I think the incident lasted a couple of minutes. ++ As with lots of incidents, sometime after when things have calmed down it’s good to reflect on lessons. And in retrospect - my lack of comms, not using the frontpage to test experimental stuff should have all been at the forefront of my thinking. But every organisation that makes software, makes mistakes. We can’t help releasing a bug into production. And when that happens customers are confused or angry, stakeholders are panicking, everyone want to know what’s going on,. It can be very stressful, but it’s always worth remembering you are in a supportive environment. There’s a great blog post by an engineer called Matt Wynne. He talks about the two ways to react in the after such an incident. The choice you make speaks volumes about the culture of team and place you work. ++ The first reaction he says is often instinctive and defensive. How can we possibly avoid this happening in the future? So we put safety nets in place - maybe create more roles dedicated to upholding quality, additional checks and process, lock all the systems down, add more internal development environments to improve quality. All these things we can do not make mistakes EVER again. But over time this adds up to a huge overhead - extra people, process, committees, rules that nobody can quite remember why they exist. These things become a tax. This creates a sense of distrust within the team, a team optimised for navigating bureaucracy, not delivering value. It creates a fear around failure because we’ve invested so much in avoiding mistakes. ++ The second reaction I think is more pragmatic. People will always make mistakes, and it’s inevitable a byproduct of a confident, experimental business - the challenge is really not completely preventing them but making them easy to spot and quick and simple to fix, and also designing the system architecture to accommodate mistakes to limit their impact. The decisions we took in the early days of rebuilding ft.com - and the decisions you can take in your teams today - were all with this latter mindset. I’m talking about things like investment in isolated services to limit individual things breaking, automated failover to allow us to break a whole region, monitoring & alerts to spot technical and customer problems, ensuring everyone had production access to make investigation easy. And teams like O&R have build a fantastic incident management process that supports you during the problem and brings people together to reflect once it’s over. All in the attempt to balance risk, productivity - getting that right is a really key competitive advantage for a company. So in my experience the FT gets this - a supportive culture, one that is comfortable with failure. ++ So if you do ending up replacing the frontpage with The Guardian one, or even cause a nuclear reactor to meltdown, remember that, as the warm glow of radiation envelopes your body, it’s natural to feel sheepish for a bit. But you aren’t alone, people have been there before, and when it comes to summarise the lessons and actions think about what you can do to make your next mistake easier to spot and fix. Thanks for listening.