The one with the canary - Engine Room 2020

++

This story is about a much celebrated failure in the history of FT.com, but
something at the time that was actually pretty embarrassing for me, a bit
distressing, something that put me on the back foot for a few weeks - these sort
of incidents tend to make me overly cautious and keen not to repeat myself.

Having been in this situation a few time you realise over time that feeling goes
away, you forget about how you felt in the moment and instead carry forward the
lessons you learnt.

I think sometimes when talking about failure, perspective is a useful thing to
keep in mind.

++

I have a mental model of how badly I’ve messed up,

So, I haven’t accidentally made Europe uninhabitable for a few thousand years,
never started a war by taking a wrong turn, crashed a $500m space ship.

Each started with a simple innocent, oversight or a nobel attempt to do one
thing that had some unintended consequences.

So, in the scheme of things, I always think the thing I’ve just broken PROBABLY
isn’t that that bad in the history of world events, but it can still be
personally, though temporarily devastating.

++

So, anyway, what I meant to do was this.

We’ve built a new FT website - it had a front-page, articles, a paywall - and a
few people had started using it - it was going great.

And we’d made some recent changes to the release process - every pull request
would fire up a temporary new sever, the changes get deployed to that, the
system run some tests over the new stack and the build would be marked as a
‘pass’ if everything was ok, ready to be deployed in to production by an
engineer.

But we thought we could do better, and instead of testing this temporary build
in a QA environment we thought we could ship that version automatically in to
production and test it out for real with a small number of users.

This is known as canary releasing, after the bird that miners take down the mine
to act as an early warning system. Our canary release would be sent in to
production to see how it survived.

Exciting! What could possibly go wrong?

+++

So, like you can see - you can simulate what the CDN is doing by sending a
header contain the canary build hostname to ft.com, which will forward your
request to that Heroku server and return whatever it finds.

++

and… voila - your canary version of the front-page staring back at you.

We could automate our whole release process like this - every commit would
automatically get in to production.

++

I was quite pleased, and I probably went to the canteen for a treat at that
point.

But before I celebrated too much I thought it would be a good idea to test some
of the safety catches. Being able to proxy any website in to the ft.com domain
name was a bit of a security concern so my first idea at mitigation was to rely
on the CDN’s vary headers.

The canary hostname would be stored in a cache key to keep it separate from the
production version of the page, in theory, safe enough as long as we controlled
the list of hostnames.

++

So we can test that again with our curl command.

++

But this time something different happened.

The next thing I knew someone from the newsroom was at my desk. And they were
wondering why the ft.com homepage has disappeared and been replaced with the
Guardian’s one.

It was all a bit of a panic. Of course knew my little test was responsible, and
became quite flustered. Obviously it’s best to be open and honest about these
things so I admitted that my testing had cause this to happen.

But the key thing here as with many incidents is not to worry too much about why
exactly something went wrong - it’s to get in back in an acceptable state - so
flushing the CDN cache seemed like the best thing to do.

I’d messed up the cache key somehow.

I think the incident lasted a couple of minutes.

++

As with lots of incidents, sometime after when things have calmed down it’s good
to reflect on lessons. And in retrospect - my lack of comms, not using the
frontpage to test experimental stuff should have all been at the forefront of my
thinking.

But every organisation that makes software, makes mistakes. We can’t help
releasing a bug into production. And when that happens customers are confused or
angry, stakeholders are panicking, everyone want to know what’s going on,.

It can be very stressful, but it’s always worth remembering you are in a
supportive environment.

There’s a great blog post by an engineer called Matt Wynne. He talks about the
two ways to react in the after such an incident. The choice you make speaks
volumes about the culture of team and place you work.

++

The first reaction he says is often instinctive and defensive.

How can we possibly avoid this happening in the future?

So we put safety nets in place - maybe create more roles dedicated to upholding
quality, additional checks and process, lock all the systems down, add more
internal development environments to improve quality.

All these things we can do not make mistakes EVER again.

But over time this adds up to a huge overhead - extra people, process,
committees, rules that nobody can quite remember why they exist. These things
become a tax. This creates a sense of distrust within the team, a team optimised
for navigating bureaucracy, not delivering value.

It creates a fear around failure because we’ve invested so much in avoiding
mistakes.

++

The second reaction I think is more pragmatic.

People will always make mistakes, and it’s inevitable a byproduct of a
confident, experimental business - the challenge is really not completely
preventing them but making them easy to spot and quick and simple to fix, and
also designing the system architecture to accommodate mistakes to limit their
impact.

The decisions we took in the early days of rebuilding ft.com - and the decisions
you can take in your teams today - were all with this latter mindset.

I’m talking about things like investment in isolated services to limit
individual things breaking, automated failover to allow us to break a whole
region, monitoring & alerts to spot technical and customer problems, ensuring
everyone had production access to make investigation easy.

And teams like O&R have build a fantastic incident management process that
supports you during the problem and brings people together to reflect once it’s
over.

All in the attempt to balance risk, productivity - getting that right is a
really key competitive advantage for a company.

So in my experience the FT gets this - a supportive culture, one that is
comfortable with failure.

++

So if you do ending up replacing the frontpage with The Guardian one, or even
cause a nuclear reactor to meltdown, remember that, as the warm glow of
radiation envelopes your body, it’s natural to feel sheepish for a bit. But you
aren’t alone, people have been there before, and when it comes to summarise the
lessons and actions think about what you can do to make your next mistake easier
to spot and fix.

Thanks for listening.