Shipping code at the Guardian

August 2013

The Guardian frontend team deploys it's infrastructure and software several times a day. We've invested quite a bit in tools to help us do that efficiently. Here's a run down of how we do that.

Ship it

We wanted to make deploying software simple and accessible to everyone on the team.

For reference, we currently have around ten developers on the team, a set of around 12 applications (Eg, articles, frontpages) that make up the user interface. Over August 2013 we averaged about 29 deployments per day (this is effectively about ~2.5 deployments of our entire stack, with an average of 5.2 commits per deployment. The general ethos is to keep changes small and incremental and get them running in production quickly.

Today, anyone can deploy the suite of applications running the public face of theguardian.com at any time by typing this simple command at their terminal,

gu deploy --prod

Several things happen after this point.

A request is made to our deployment API, which creates a fresh set of hardware for the deployment (we run on AWS). It then boots up the AMIs and runs the install scripts (puppet) and shuttles the latest software artifacts on to it (basically a single JAR). It then runs a health check and, if successful, takes the old boxes out of the load balancers.

The end result is that 15 minutes after telling system to deploy, your new feature is running in production in front of millions of people.

Likewise, reconfiguring the platform after updating the configuration is just as quick and as simple,

gu cloudformation update --prod --name frontend

And if you fancy, you can remove our production stack from the Internet in a snatch too,

gu cloudformation destroy --prod

Don't do that though.

This has eliminated the rigmarole and theatrics of a typical release process we've experienced in other projects and at previous organisations.

Along the way we've learnt several things,

The person writing the code is the person responsible for shepherding it in to the production environment. This means no throwing the code at other people to deal with, which means learning more about the whole delivery cycle. Developers, for the most part, like this additional responsibility.
It can eat in to your day - the hour you spent on Tuesday fixing a broken server, the time you spend nannying the deployment, the constant eye on the graphs hovering above your head, and so on...
Anything more than a simple bug fix must come wrapped in a feature flag and/or AB test so it can be enabled/disabled outside of a release cycle. It's permissible to break things, but you'll want to un-break them quicker than it takes to notice and roll out a fix or be falling over the person who is deploying after you. Rolling backwards is a worst case scenario.
Developers must have complete access to the environments in order to aid the deployment. It's no good being told to take responsibility for the platform and not feel in control of it. Having access to any nooks and crannies is essential, so everyone is granted sudoers access on production.
Just because you can deliver things quickly doesn't mean the rest of the business always wants that. Oddly, on several occasions, we've shipped features that have quietly sat behind flag for some time while some other business function (a review, a conversation, an email ...) is carried out.
Visibility (logging, metrics etc.) is essential to help people understand what state the software and infrastructure.

I'll expand on visibility in the next section.

Insight

If you give everyone the responsibility of releasing the code we've learnt that you need to invest in tools to give everyone in the team visibility as to what is happening. Essentially, "has the change I've just made worked and have I left the system in a good state for the next deployment?"

Our strategy here is threefold,

Making the system state very obvious and accessible.
Having some thresholds to know when the system state has become upset.
Notifying people when those thresholds are crossed.

System state

Although we have access to thousands of different metrics it's important to distill general system health down to a handful of key indicators. More than anything else we care about the performance of the end-user experience therefore the most critical errors will be user-facing ones. For example, quite often our back-end system dies, or slows down without any user-facing consequences - these are not the things you want to wake the whole team up at 3am for.

Obviously the KPIs correspond with the key purpose of the software we've written - 'can someone arriving at the Guardian website read the news' - so the metrics are fairly self-selecting.

Primarily we use these three things to determine general system health:-

Load balancer latency (cloudwatch) - Time elapsed after the load balancer receives a request until it receives the corresponding response. The whole system will struggle if the load balancer can't process incoming requests quickly.
CDN errors from Europe & US (fastly) - Errors on our CDN, in two key commercial and editorial territories, means errors to the end-user.
CDN hit & miss ratio (fastly) - A high hit ratio means we i) have visitors, ii) traffic is being kept off the origin servers.

We have several secondary indicators that also help identify serious failures:-

Availability of hosts (pingdom) - Fairly standard external uptime and response time monitoring.
Production deployment health (internal) - Did the last deployment succeed?
Page views p/min (internal) - How many requests p/sec are being received? If this is much lower than average something isn't right.
JavaScript errors (internal) - Caught on the client and shipped back via a beacon. A growing proportion of our application is written in JavaScript so spikes in errors mean users are seeing broken features.

Arriving at these metrics has been a process of understanding what matters over time, it's evolved along with the architecture and experience of past incidents.

Dashboards

Sometimes the only fun part of a project is making a dashboard. Not this project though, but we still needed a dashboard.

Each service I mentioned above (fastly, cloudwatch, pingdom, and the several internal tools we use) has an API that lets us extract the inforamation it collects and place it on a communal dashboard. Here's what our dashboard looked like yesterday at 10:30pm.

Aside from the failed production deploy everything looks calm. We decided some of the information works better as a red or green boolean state block (red bad, green good) and other on a time series where trends and spikes are what we are looking for.

The dashboard is placed prominently around the teams desks in the office and, importantly, is responsively designed and available outside the company network - You will need to check it in the park on a Sunday afternoon on your mobile phone.

Typically a wobble on a graph is a prompt to pay closer attention to the finer grain information we have available to us (graphite, RUM, logs, auto-scale notifications etc.)

Thresholds

While everyone is expected to pay attention to the dashboard and other data we have available, not everyone can do this all of the time. We need to eat, sleep, oversights happen etc., so it's helpful if the platform can report it's own ill-health to us when we are otherwise engaged.

The primary metrics (above) each have Cloudwatch alarms configured at thresholds we deem useful to know about (a number you have to arrive at over time), and a few of the secondary indicators have higher thresholds against them.

Where feasible we've attempted to build a system that can survive minor abnormalities without human intervention (Eg. bursts in traffic can be countered by auto-scaling, server stale content is acceptable to mask a temporary back-end outage etc.), so each system is given a bit of leeway (around 5 minutes) before it tells a human there's a problem. Likewise, the secondary indicators are given even longer to sort themselves out. Pingdom, for example, waits 15 minutes before telling us about a poorly host.

Alerts

Notifications are fairly straightforward. We use PagerDuty to help co-ordinate the team.

It works like this.

Everyone on the team that deploys software to production is added to our PagerDuty rota. This creates a sense of responsibility. It's bad form if you have team members who are deploying things and expecting other people to notice it's broken.

Upon receiving any incident PagerDuty sends the whole team an alert.

The type of alert is left to individual preference - some like SMS, some email, some push alerts - and if the message isn't acknowledged by someone with 30 minutes it's escalated to a phone call to me (from a spooky automated voice).

The system outline above has server us well to date. We striven to find a balance between keeping ourselves informed of problem and avoiding the classic overabundance of alert messages (that ultimately get marked as spam and ignored). A typical month sees us receive a couple of alerts.

Thoughts

Releasing code like this is great fun and relatively trouble free.

It allows us to push changes out to users quickly, but in a controlled manner. Everything from bugs to split tests to experimental code and larger user-facing features gets the same treatment. This means we learn quickly about what works and what doesn't - both in terms of technical and product performance.

It's free from the problems associated with a single, monolithic release. The cause and effect of problems is quite obvious with this release pattern.

It forces the team to think about priorities. On a 4 week drop of code - probably spanning many commits and features - you can lose the focus on what really matters. The longer the time between deploys, the easier it is to just add that little extra essential feature you think you need. Giving yourself a day or two between changes helps you focus repeatedly on the next-most-useful-thing, and the next, and the next.

Similarly, a dozen minor improvements over two weeks hugely reduces the cognitive load on the rest of the engineering team in reviewing, sanity checking the code.

One byproduct of this is the effect on the management of projects. Before, we might sit in a planning session for a hour or two, sketching out what we want to deliver in the near term, writing out a collective to-do list, tracking the cards in a ticketing system, group things up at the end of the week and push them out. Now that people understand delivery is ad-hoc, planning is typically often a short conversation, a whiteboard diagram, followed up by a short burst of deploys. Bugs get raised and dealt with in minutes, if serious. It's greatly reduced the overheads in our short term project planning.

One pleasing aspect of continual delivery is the number of itches the get scratched because of the low cost of doing something and releasing it. The annoying bit of code, that refactoring you do in between things, the bug that you care about but hasn't really been prioritised, all actively get added to the project with very little fuss.

Most pleasingly of all is how the team has grasped ownership of the whole software platform and delivery cycle.

Like going to the gym, it's a constant challenge to keep yourself and the team motivated, plugging away, but the rewards are plentiful.