Mobile app marketers need access to increasingly granular data — which inevitably requires processing huge volumes of data on the fly. This can dramatically slow down database query times, creating bottlenecks for marketers and preventing them from optimizing as quickly and as often as they’d like to. Call it the Catch-22 of modern-day mobile marketing: an ever-increasing need for speed amid an ever-increasing flood of data.
At Singular, the issue came to the fore ahead of our latest analytics offering for marketers, Publisher ROI. Publisher ROI allows marketers to quickly expose a breakdown of an ad network’s inventory by publisher and determine the individual sites and apps driving the best performance.
Early testing of the feature showed that customer queries for publisher-level data required an enormous increase in computing power — to the tune of 50-100X the data normally being ingested and processed for queries in Singular. As we saw query times spike during testing, it became clear that we had hit a major startup milestone: We had outgrown our database technologies.
In order to launch our latest innovation, and continue offering mobile app marketers fast and flexible access to increasingly granular marketing data, we would need to introduce a new data pipeline and datastore — one that was capable of enabling ad-hoc queries on a billion rows with sub-second performance.
In this post, we’ll describe how we rebuilt certain components of our database technologies and dramatically increased the speed of our customers’ queries — in some cases by a factor of 150X.
But before we dive into how we accelerated our customers’ queries, it’s worth giving some background on the evolving needs of mobile app marketers, and the sheer size of the data challenges we face in providing them with an analytics platform as flexible as Singular.
From the early days of Singular, we kept a close eye on the kinds of queries our customers were running in Singular. We tried to spot patterns in order to build systems that anticipated the dimensions a marketer might use in their queries. But we quickly found that there was a huge diversity in the reports and queries customers were running, with countless combinations of different dimensions and metrics.
Think about it in terms of rows in a table. Say you advertise on an ad network that runs your ads on 100 different publishers. You might check the overall ROI of the network — with Singular, that’s just one line of data. To then expose a breakdown of each individual publisher’s performance in order to segment high-performing from under-performing publishers, you would need to render 100 rows of data. Now say you’re running 4 multi-country campaigns and you want to see if certain publishers are performing particularly well in certain Geos — now you’re looking at 400 rows. Want to a break out data by iOS and Android? — that’s 800 rows.
This kind of complexity led to our first conclusion about the database architecture that could support our needs. Because usage of Singular is so diverse, with countless combinations of query dimensions, we needed to support ad-hoc queries, or queries that cannot be determined prior to the moment they are issued. As most data engineers know, databases that are truly optimized for ad-hoc querying are usually not good at updating data after it is ingested. We therefore needed to ingest data into our database in its final form.
Data Ingestion at Singular
At Singular, we run around 10,000 data collection tasks a day. Each collection task is responsible for collecting data for a specific customer from one of its third-party marketing vendors from a window of 1-30 days back. Our ingestion pipeline is distinctive in the volume of data each task collects (granular Facebook stats for 30 days is huge) and because we collect data from vendors who tend to update data retroactively on a regular basis, which requires Singular to constantly swap out old statistics with newly updated data.
Our ingestion pipeline previously loaded data into MySQL and kept running various ingestion logic by querying the data and using updates. After the data was available, the queries triggered by our dashboard and API would hit MySQL as well.
This has been working well for us — that is, until we introduced publisher-level collection and querying. With the large increase in data volume, loading to MySQL became slower, creating bottlenecks in our pipeline. In addition, running analytics queries on this volume of data was simply too slow.
Designing a New Ingestion Pipeline
To support an infinite number of tasks, with an ever-growing size of ingested data, we aimed to build a pipeline that was horizontally scalable (unlike MySQL). The choice of Amazon Simple Storage Service or “S3” to support these dependencies was obvious. We had already been using S3 for backing up our data before inserting it into MySQL. Plus it is horizontally scalable, requires zero ops and offers fast download/upload speed when working within Amazon Web Services.
Thus, our new ingestion architecture relies on S3 with only the metadata stored in MySQL. Instead of querying and updating MySQL, each component of our pipeline receives an S3 file as input and passes on another S3 file containing the data it received together with the new information the component produces.
The end of our pipeline sends the data to an S3 bucket, containing the most up-to-date statistics per customer, marketing source and date. This enables running data collection tasks in parallel, and makes this bucket the source of truth for our system.
Towards a New Datastore
With our data in its final form, ready to be queried in S3, our choice of datastore was super flexible. We had previously built a small abstraction layer between our API and MySQL, which could be adapted to support any query language or schema. Thus, in evaluating new databases, we knew beforehand that the decision wasn’t final, as switching costs were so low. In the end, we selected Druid, an open-source data store developed for the exact need of aggregate queries over marketing analytics data.
Once implemented, we were thrilled with the results: With Druid, queries that once 60 seconds took 1-2 seconds, while queries that once took 30 seconds took less than a second. In certain cases, we saw improvements in database query times as high as 150X compared to the old system.
All these developments bring us to earlier this year, when we began switching on Publisher-Level ROI for select customers in a closed beta test of the feature. The beta would serve as the ultimate stress test for our new architecture.
Granularity & Speed for the Win
As a refresher on Publisher-Level ROI and why it’s so groundbreaking, here’s a little industry background. Ad networks typically purchase inventory consisting of ad slots in hundreds, sometimes thousands, of sites and apps. These sites and apps, known individually as “publishers”, are where marketers’ ads run.
In recent years, mobile app marketers have demanded more visibility into publisher-level performance in order to identify pockets of their most valuable traffic. Networks have responded by providing Publisher IDs in the performance data they expose to marketers. Marketers, in turn, use these Publisher or Site IDs to optimize, increasing spend in high-performing publishers and decreasing spend or “blacklisting” under-performers.
Historically, however, publisher optimization across multiple networks has been an arduous and error-prone process, requiring marketers to toggle between multiple dashboards and manually update unwieldy Excel files. The process is particularly painful for marketers who wish to analyze the performance of publishers not merely by click-through rates or raw install count but rather by the actual quality of those users, as measured by ROI.
Singular automates the process of collecting all your marketing data under one roof, before cleaning and combining the data with revenue and events retrieved from tracking links in order to expose app ROI and other full-funnel business metrics.
But, as we’ve shown here, once ingested, enriched and combined, making publisher-level data fast and flexible is a whole ‘nother beast. Which is precisely why we invested so heavily in our new pipeline and datastore. Mobile app marketers need both granularity and speed — and we’re proud to say that the results speak for themselves.
As we’ve rolled out Publisher-Level ROI to our beta test customers who are now running on our new Druid-based system, customers have reported lightning-fast load times, even with the massive increase in data volume.
And the performance improvements aren’t limited to publisher-level querying. With Singular, similarly data-intensive querying — to expose Campaign, Country, Creative and User-Level performance — is now faster than ever thanks to these advancements in our new database technologies.