June 8, 2016
Online ad innovators must process hundreds of terabytes a day at the lowest possible cost. How do they do it?
To be useful, sometimes big data just needs a good traffic cop. NoSQL1 key-value stores and wide-column stores serve that function. For example, they enable fast, scalable, targeted, and more compelling online ad placement. Consider the following online ad by GumGum, a pioneer of in-image advertising:
You’re surfing the web, looking at home improvement and garden ideas. You land on an article about lawns. The first thing you see is a close-up photo, front and center, of a lush patch of green lawn with a pop-up sprinkler at work. Wait, now the photo fades to black.
A new ad type called a canvas starts to appear over the photo, showing pieces of lawn equipment popping up one at a time, until the whole canvas is complete. After a moment, it changes again.
The canvas collapses into an ad called a studio, which has an interactive bar. Out of curiosity, you click the Learn More button and watch a video that shows a modular approach to yard equipment—one chassis and engine for several equipment types—a lawn mower, a leaf blower, a snow blower. The message: This modular solution takes up much less space in your garage than three larger, standalone pieces.2
The scenario just described is contextually relevant, in-image advertising. Each in-image ad complements the existing image on the web page. In this case, a canvas ad for a suite of yard-care equipment appears with a photo of a lawn in an article about lawn care. Now think about all the big data storage, processing, and retrieval requirements GumGum’s system must meet just to be able to do what you’ve watched online:
- Awareness and understanding of all the photos and pages on websites from thousands of publishers. These photos are the available inventory that GumGum matches ads to. Awareness and understanding at scale implies image recognition and text-mining capability—that is, a way for machines to read and recognize text and images.
- Awareness and understanding of the target sub-audiences for each ad as conveyed by an analysis of anonymous data from publishers about their readers.
- Inexpensive but highly capable cloud-based, petabyte-scale3 storage, processing, and retrieval of the data previously described to support ad placement and serving decisions in near real time.
“We have only milliseconds to make decisions and to figure out what ad we’ll serve. So we need to be able to access that data quickly.” —Ken Weiner, GumGum
A wide-column or key-value store such as Cassandra, Amazon DynamoDB, Redis, or Riak is central to each of these requirements. “Latency is very important to any advertising,” says GumGum CTO Ken Weiner in an interview with PwC. “We must select and show ads to users in as little time as possible. GumGum also participates in real-time bidding integrations with other companies, where we have only milliseconds to make decisions and to figure out what ad we’ll serve. So we need to be able to access that data quickly.”
Wide-column stores and key-value stores are equally suitable for use cases that require storing, organizing, and quickly retrieving huge amounts of data that require little analysis or modeling; for example, personalizing retail website experiences and organizing Internet of Things (IoT) data from sensors.
Wide-column stores and key-value stores, the topic of this article, are among many innovations creating a sea change in database technology.
What are key-value and wide-column stores?
A key-value store is a highly simplified version of a database that is also highly optimized for a few primary capabilities. It stores individual elements, which could be a digital representation of any type of value, large or small. Key-value stores are optimized for speed and scalability and are often used in caching applications that require extremely high throughput. They derive their speed and scalability from a simple data model and low overall database complexity.
Wide-column stores are also fast and are nearly as simple as key-value stores. They include a primary key, an optional secondary key, and any grouping of digital bits stored as a value. A wide-column store can be considered part of the larger family of key-value stores. A wide-column store is also a row store, despite the name. Each row contains a different record.
Key-value and wide-column store designers see advantages inherent in the simplest, most stripped-down data models for certain use cases that require application development speed or flexibility. Consider the example of Basho Technologies Riak, a key-value store. Riak stores groups of keys and values in “buckets,” or binary data objects. The buckets can contain any data type. Basho Technologies asserts that a simple data model simplifies application development, and that “new features to the application will not require updating a schema or changing the data model, ideal for applications where rapid iterations are required and changes in the underlying data model are undesirable.” However, data ingested into a key-value store must be in the form of key-value pairs.
Data storage options for wide-column and key-value stores
Wide-column and key-value stores vary and are designed for particular classes of use cases that determine their cost and performance tradeoffs. Some of the main distinctions and tradeoffs include these:
- Hard disk drive: Hard disk drive (HDD) storage on large collections of distributed compute clusters has become quite inexpensive, and the viability of some companies’ business models depends on choosing the lowest-cost storage options. As a small startup, GumGum, for example, wants to minimize storage cost per bit and yet capture the essential big data about all the pages and photos in inventory produced by the publishers it represents. So it uses a public cloud infrastructure service with an open-source version of Apache Cassandra that is optimized for spinning disk and a distributed environment. In Cassandra, once the rows are committed to disk, they’re immutable—later changes appear in subsequent rows. With this approach, writes to disk are non-blocking and faster.
- DRAM and SSD with a caching layer: In-memory databases are optimized to store data in random access memory, or RAM, for better performance. Databases can be more or less “in-memory” depending on the use case and storage architecture. The lower-cost options rely more on disk plus an in-memory caching layer.
- DataStax Enterprise offers a distribution of Cassandra with three options: spinning disk, solid-state disk (SSD), or dynamic RAM (DRAM) caching. Similarly, Apache Accumulo adds two caching layers to the spinning disk storage option, and Oracle Berkeley DB and Amazon DynamoDB include a caching layer with an SSD option.
- DRAM and/or SSD all in-memory: Some databases load the entire key-value store into main memory. That extra speed sought through optimizing for RAM comes at a cost, because RAM is much more expensive than disk, and big data analytics implies huge data volumes. With their in-memory solutions, products such as Aerospike, Redis, and Riak often claim latency of a millisecond or less, and Aerospike offers either DRAM with spinning disk persistence or a hybrid DRAM/SSD alternative to address the volatility associated with DRAM (that is, the data loss potential of DRAM if a power loss occurs). The latter takes advantage of a proprietary file structure designed to access flash RAM.
There are many other factors to weigh when thinking about how a key-value or wide-column store might complement an existing data architecture. These include strategies to support overall data management goals and cost versus performance.
The viability of some companies’ business models depends on choosing the lowest-cost storage options.
In-Hadoop database designs. With Hadoop, organizations can preserve many petabytes of heterogeneous data in its original, full-fidelity format in a unified, low-cost, distributed computing environment. This capability makes Hadoop attractive for a data lake scenario and enterprise-wide, exploratory analytics across silos. Some wide-column stores—such as Apache HBase, Apache Accumulo, and MapR-DB—are designed to work on top of the Hadoop Distributed File System (HDFS) and in conjunction with other services in the Hadoop stack, such as ZooKeeper and Thrift. In-Hadoop databases do introduce a level of management complexity that can be considerable, because they’re built on the rapidly evolving, multi-tier Hadoop stack.
GumGum, for instance, started with Apache HBase on Hadoop, but then encountered some perplexing troubleshooting challenges. “HBase uses HDFS and ZooKeeper,” says Vaibhav Puranik, director of engineering at GumGum. “HBase runs multiple processes on a node [region server], so whenever there was a problem, we didn’t know whether the HBase processes, the Hadoop processes, or something else caused the problem. To maintain HBase, you must maintain three or four pieces of software together, whereas with Cassandra, we have just one simple process running on every single node.”
After a single-point failure caused data loss (a known risk of using Hadoop 1.0 since rectified by version 2.0), GumGum decided to move to Cassandra alone. The main Hadoop distributors do claim newer HBase and Hadoop management functionality, which may address some of the other problems GumGum encountered.4
Cost-performance tradeoffs that affect database performance. Other related variables that have cost-performance tradeoffs include networked versus direct-attached storage, virtualized versus physical hardware, and how the data is written or ingested.
Conclusion: Match the use case to the key-value or wide-column store alternative
Since 2006, when Google first published its paper on BigTable, the database world has seen a proliferation of wide-column and key-value store options that reflect a range of alternatives. Users can match their specific use cases to the database that best suits each use case.
The data model simplicity of the key-value store family helps companies that need to process very large volumes of perishable, structured data. They can spread out the processing over large, distributed, commodity computer clusters, and in-memory options lower the level of latency inherent in networked systems. Those clusters can scale out linearly, allowing companies to affordably process more and more data as the business scales up.
Case studies such as the GumGum example described earlier help to highlight how many different tradeoffs and capabilities factor into choosing a database. In a polyglot persistence environment, several database types are used. Users tolerate somewhat more latency in a shopping cart function than they would in a gaming session. Some features of a database—such as many different Redis data types—lend themselves to numerical data, whereas Cassandra seems more suited for non-numerical data. Atomicity, consistency, isolation, durability (ACID) compliance might make sense for situations when a key-value store is used to augment a relational database directly.
- Structured query language, or SQL, is the dominant query language associated with relational databases. NoSQL stands for not only structured query language. In practice, the term NoSQL is used loosely to refer to non-relational databases designed for distributed environments, rather than the associated query languages. PwC uses the term NoSQL, despite its inadequacies, to refer to non-relational distributed stores because it has become the default term of art. See the section “Database evolution becomes a revolution” in the article “Enterprises hedge their bets with NoSQL databases,” PwC Technology Forecast 2015, Issue 1, http://www.pwc.com/us/en/technology-forecast/2015/remapping-database-landscape/features/enterprises-nosql-databases.jhtml, for more information on relational versus non-relational database technology.
- GumGum demo, April 3, 2015.
- Digital advertising companies process huge amounts of data, and GumGum is no exception. In 2014 in VentureBeat, John Koetsier reported about AdRoll’s data storage and processing requirements: “Retargeting leader AdRoll announced that it is processing a massive 130 terabytes of advertising data daily and has reached 10 petabytes of stored data from the last 12 months. That’s 90 times the volume of data generated by all the stock exchanges in the U.S.” (John Koetsier, “AdRoll hits gigantic 130 terabytes of ad data processed daily, says size matters,” VentureBeat, November 12, 2014, http://venturebeat.com/2014/11/12/adroll-hits-gigantic-130-terabytes-of-ad-data-processed-daily-says-size-matters/, accessed April 3, 2015.)
- For more information, see “MapR-DB In-Hadoop NoSQL Database” at https://www.mapr.com/products/mapr-db-in-hadoop-nosql, “Cloudera Manager Backup and Disaster Recovery” at http://www.cloudera.com/content/cloudera/en/documentation/cloudera-manager/v5-0-0/PDF/Cloudera-Manager-Backup-Data-Recovery.pdf, and “Apache HBase” at http://hortonworks.com/hadoop/hbase/, accessed April 7, 2015.