Democratising data at the Financial Times

May 2016

This is the text of a talk I gave at CSV Conf.


View the slides


Hi! Thanks for coming.

The imposter

I feel like an impostor. Many of these talks are about openness, for the greater benefit of humanity, but my talk is about closed, private data, mostly for the benefit of one organisation. But a lot of the themes - hack-ability, simplicity, usability, interoperability - are directly applicable to private organisations like the FT. So this talk is about that.

The FT is a 125 year old news organisation with nearly 800k subscribers, over 70% of those are digital subscriptions, and we have over 5k companies on corporate licenses. For a sense of scale the central data team has around 30 people. 

This is Tom Betts, our Chief Data Officer. He came up with a phrase 'democratising data' last year to describe a push to try help the organisation make data central to people's day to day lives.

No matter what job you do, the better educated you are, the better decisions you can take. So in democratising data - we are really talking about accessibility of our data.

If your data is locked up in a warehouse, or (often) a dozen different warehouses, or it's stored in oblique formats, or has to be fished out with odd languages or protocols, then it's going to be less accessible, less democratic.

It's not dissimilar to the various open data manifestos. Here's a list of criteria from a site called Open Definition, part of the Open Knowledge Foundation. Is it available online? Can we use it? Is it machine-readable? Is it available in bulk? Is it free?

Although the FT and other organisations won't make it's data openly available the needs of it's internal community are the same as those as users of, say, government data. You can ask yourself does your internal data follow these rules?

I should say it’s not entirely universal at the FT - financial and personally identifiable information especially is kept more closed.

Data powers tools for the newsroom

This is Lantern, a tool used by our newsroom to gauge the performance of the stories we publish. These sort of things are quite common place, most publishers use them. Every time a user does something on FT.com they log a tracking pixel with metadata describing what they did. And Lantern aggregates this data and displays it in meaningful ways. It’s partly education - what % of users found this type of story via another website, and it’s partly a decision making tool - Eg, what stories are going stale on the front-page. This was a UK focussed story, so we can see the peak of UK traffic for an article is around 8am, we can see people took 43 seconds to read it, and the retention rate was 14% - so those people who did something else after reading it.

Our news agenda isn't lead by this, but in terms of understanding audience, the impact of promotion and so on it has it’s place in helping understand the consequences of our actions.

Similarly a key part of how we communicate with our users is via email. We have dozens of daily emails - some, automated, curated, newsletters, alerts triggered by keywords, breaking news, marketing - we send several million each day.

Email has a life-cycle - from when you subscribe, to send, open, click, unsubscribe. So our editors have dashboard to see the performance of each email.

The analytics team

The FT has a central data analytics team. They use a mix of SQL, Excel, R, and more recently hosted reporting product Chart.io. This is fairly typical of the sorts of information being produce, description the referral traffic to our web app.

You can see the influence of social media referrals in red to the green search referrals in this graph - about 50/50 split.

Data in our products

Aside from reporting, we also power parts of website products purely from data. Here's a standard 'top 10' most popular stories, generated from a simple rolling count of who is reading what over the last 12 hours or so. It's one of the most popular parts of the front-page.

Our journalists also tag all their stories with topics so we can display popularity on a more thematic level - what was in the news today, or this week, or what have we written most about.

To power that sort of thing internally we have a rich graph of interconnected topics. We can link stories to companies and companies to their industry, their board members, their stock listings and so.

On the FT site you can ‘follow’ any of these topics.

By linking our internal model of the news to our model of the user activity we can generate personalised journeys through the information we publish.

Here’s just one simple example of how that looks. So rather than recommend based on what everyone else is doing we can find things that relate to you personally, which we think it more valuable.

So we’ve got a quite a data lead, multi-layered approach to recommendations at the moment and these different ways of helping people find what they are interested in are all complementing our journalists editorial line.

Data is propaganda :)

We have an internal communications team. They build dashboards to display around the screens in the office - in the canteen, reception and so on. Typically these are things that give a sense of who is using the FT and what they are reading. They need something visual, something that explains itself with a few seconds as you walk past.

This is article stats cut by country. You can sort of see from this it looks like most countries like reading about themselves.

Data for growing our audiences

Marketing is one of the most interesting parts of the FT when it comes to data.

The paywall, pricing and so on, are all underpinned by AB testing, and projections based on behavioural and financial data.

So on the screen is an example of the output from what we call Propensity API. It's a predictive model of anonymous person's likelihood to subscribe to the FT. A score of 0 means low, a score of 1 is very likely, and within that the persons propensity to subscribe to a particular offer, Eg. trails for £1

It’s built by data scientist, distilling 500 variables in to a dozen or so key indicators of future behaviour.

Having this in API form means we can adapt the experience for different types of our users, for example, what would it take for a person who we think has a high propensity to subscribe to actually do so - marketing, discounts, showcase etc.

Customer research collect structured data

This is a form on the website, prompting the user for feedback. In previous companies I’ve worked at customer research teams have used external third-party survey tools to collect qualitative data from the audience. These work well but typically the information ends up disconnected from the rest of your data in your warehouse.

Collecting it ourselves and connecting it to the rest of our customer data means we have a much better handle of the value of each piece of data - for example you can take each piece of negative feedback and look at what the user has done, or did after they sent it. We might want to know if negative feedback results in a lower subscription renewal rate.

Or we can split out feedback for loyal users, which we may want to prioritise or pay closer attention to.

The team who use this data like spreadsheets, so we pump that data in to Google Sheets API, which lets them do the analysis they need.

And because customer feedback is so important we broadcast it to all staff on a internal Slack channel. It’s an unhealthily mix of praise, disappointment and general abuse.

Democracy = Diversity

It’s a very diverse set of use cases.

Lots of users - represented right across the whole business, from the newsroom, to commercial, to the digital product teams.

Different needs - the newsroom needs real-time data to make decisions, but the behaviour models for marketing evolve over a 3 month window.

Different levels of complexity - some, like the most popular globe are simple counters, whereas recommendation algorithms require specialist graph databases.

Different skills - some people are comfortable with processing raw data, but processing terabytes of data isn't easy so some people need abstractions or spreadsheets to make it simple to work with.

We need one system to do all of this. As fragmented data easily leads to multiple versions of the truth. This data warehouse says this, whereas this one tells a different story. That’s a real problem as it creates conflict - and data should be used to give answers, not create confusion.

Nothing really fits that market, so we started building something ourselves.

Modelling events

Architecturally an analytics system typically collects events of the world and puts them a data warehouse. It’s quite straight-forward.

If we are going to make a usable system, people need to be able to understand the data they need to send to it. Almost all the things going on inside our ecosystem at the FT can be described as events. There’s a mix of things our users do on the website, as well as things that our back-office systems are doing. We express each event as a category and an action.

Category is the general domain - like 'page' or 'email' or 'signup' or 'infrastructure' and action is a verb that describes what happened - 'view' or in the the context of email 'opened'

The idea is to get away from arcane descriptions of these things and make them more human.

What have several other concepts attached to this model:-

The API in to the data warehouse is a simple HTTP JSON one. You should be able to see all the concepts I just mentioned here serialised as JSON.

Again, the emphasis on simple - there's no strange interfaces to learn, no limitations on key-value pairs, no dependency on particular libraries. It’s essential schema-less. Anything that can generate an HTTP request can start sending events, and importantly, anyone in the FT can read the API documentation and integrate the collection of events they care about in minutes. There’s some edge cases where we just need a pixel with data sent via querystring, but 99% of users use JSON.

So we've got lots of systems generating events.

Client-side applications running in the browser, server-side events, web hooks from third-party systems, things like AMP, we even capture offline events which are then buffered then sent in a batch.

One pipeline, many sinks

The very first version of the system simply had this API for everyone to put events and put them in the data warehouse. But all databases have limitations - RedShift is powerful but slow, ElasticSearch is a great cache but not suitable for 10 years worth of analytics data. If you think back to some of the examples at the start of this presentation, for a simple 'Yop 10 news stories' Redis is a great, cheap choice. The newsroom analytics tool uses Elastic Search - essentially a cache of 30 days. The recommendation system uses Neo4J as it needs to query relationships.

So we needed a way to let people consume the event stream in to their own specialist systems. We ended up with two options.

Kinesis is a natural choice for this problem. It’s a ordered 7 days record of all your events - so you can tap in to a point in history and start replaying the events in to your consumer. It’s a shared stream.

SQS is a message queue. It’s conceptually simpler than Kinesis - you pick up a message, process it, delete it, pick up the next one, on so on. Some people liked this simplicity.

Enriching the events to add meaning

So we have this data pipeline. We've got a way for anyone to put something in to it and anyway to take the collective pool of events out again. But what value can it add?

When we talked to people and looked at what they were using the they often want to transform or annotate the event with extra information.

Sometimes they were doing the same thing over and over. Each variation introduces an opportunity for mistakes. So we want to centralise some of the generally useful things people were doing. We call these enrichments.

If your event contains a URL we tokenise it. Most analytics software does something like this URLs.

Rarely do you need to perform operations on a full URL, you typically want a parameter from the query string (Eg, like if you are analysing search trends) or a path or a domain as a filter.

Each annotation is just appended to the event, so the original data exists alongside the enrichment.

We have a time annotation. This transforms a simple ISO date stamp when the event was received by the data pipeline in to lots of useful properties.

Some of this is to speed up data processing for downstream systems with poor or slow Date() capabilities, operating on numbers might be faster than parsing dates. We ensure everything is transposed to UTC timezone to avoid confusion.

Annotations like 'week' have a specific meaning at the FT when generating our weekly reports.

So enrichments help centralise this business logic.

Sometimes a transforms need to connect to APIs. Each event, naturally, has an associated IP address. So we fire that at the MaxMind API. MaxMind is a service that geo-locates an IP address to a reasonable degree of accuracy - so our events can now all be geo-location down to a city level in most parts of the world. You can also see we've hacked on support for FT Office detection too so as to help people identify data generated by staff.

Enrichments allows us to take data from any other FT API or external API to make the event richer, more meaningful.

Sometimes we need to invent an API.

Given a domain name, like facebook.com, this one classifies it in to one of six buckets - social in this case, or in other cases search, news, partner and so on.

This enrichment makes it much easier to do a quick analysis on our referral traffic.

Enrichment results in this huge JSON document being constructed for every event. And consumers around the business can pick and choose the events they want to store in their own systems, then chuck the rest away.

There’s a long list of these enrichments yet to be implemented.

Some of them, like Freebase, point at connecting our data to external vocabularies.

Some involve hitting external APIs to gather information - sharedcount.com stores social media activity around URLs.

There’s some interesting, experimental ideas - market prices, weather - to help find correlations with the rest of our data.

It's working ok for us. As of March this year we no longer use a third party analytics system at the FT.

The most painful thing at the moment is alignment of event schema's across multiple products.

If you are tracking something, say how people use video, in three different ways it's painful to use that data, so we've put some writing schemas for each type of event then validating them inside the pipeline.

Democracy?

So have we reached the original vision of democracy?

None of the ideas I’ve talked really planned as such, we didn't draw up an 18-month plan of all the things people wanted to do with data, we just tried to make it usable and accessible and watched how people used it.

Which I think it testimony to the principals of Open Data:- if it’s made accessible it allows curious people to take it, play with it, it stimulates ideas, they learn something, then share it and create value from it.

And that freedom is of real benefit for organisations like the FT.

Thanks!