August 26, 2016
Vaibhav Puranik is director of engineering, big data at GumGum.
Ken Weiner is CTO of GumGum.
Vaibhav Puranik and Ken Weiner of GumGum discuss the challenges and benefits of open source databases for in-image advertising.
PwC: What does GumGum do?
Ken Weiner: GumGum sells advertising via its in-image ad platform to brand advertisers in the Fortune 500. In-image advertising is a hybrid between display and native advertising; it’s a way to overlay ads on a photo or an image. These ads are usually contextually targeted to complement the images.
GumGum works with a few thousand publishers, and that’s how we secure our digital inventory. We’re basically able to sell ad impressions on images to different advertisers and agencies.
PwC: How are NoSQL databases important to your business?
Ken Weiner: For GumGum to target ads properly, we need an understanding of all of the photos and images that we see on websites and of all of the pages that those photos fit on. We also need some anonymous targeting data that we might associate with all the users who look at all those photos and images. So we need a large database to look up information that we’ve already computed about each photo, each page, and each user. Our ad server uses that information in real time to make decisions about which ads to serve.
Vaibhav Puranik: To give an example, a photograph of actress Jennifer Lawrence might appear on a particular web page. Our software recognizes automatically that this is Jennifer Lawrence’s photograph. Once it does, we can display a trailer ad for The Hunger Games on that photograph.
PwC: And you store that information in the NoSQL database?
Vaibhav Puranik: In an Apache Cassandra database, we save the information that this particular photo is of Jennifer Lawrence. And then we can use that information in real time to serve the ads.
PwC: Is latency another factor?
Ken Weiner: Yes, it’s definitely a factor. Low latency is very important to any advertising, because a user’s attention is fleeting. If the ad isn’t served really close in time to when the image appears, it may never be seen. So we must select and show ads to users in as little time as possible. GumGum also participates in real-time bidding integrations with other companies, where we have only milliseconds to make decisions and to figure out what ad we’ll serve.
PwC: What was the challenge GumGum faced that caused you to move to a NoSQL database such as Cassandra?
Vaibhav Puranik: In 2013, we were using another NoSQL database called HBase. HBase uses the Hadoop Distributed File System [HDFS] and ZooKeeper. HBase runs multiple processes on a node [region server], so whenever there was a problem, we didn’t know whether the HBase processes, the Hadoop processes, or something else caused the problem. To maintain HBase, you must maintain three or four pieces of software together, whereas with Cassandra, we have just one simple process running on every single node.
PwC: How do you query the data in Cassandra?
Vaibhav Puranik: We have apps that would query the data programmatically. For ad hoc purposes, we use a tool called Presto, which allows us to write SQL [structured query language] queries.
PwC: Are you also looking at in-memory databases?
Vaibhav Puranik: One other thing we are looking into is how we could use Apache Spark in conjunction with Cassandra. Spark would allow ad hoc querying on top of Cassandra. Spark can load Cassandra data into memory and then execute really, really fast queries on top of it. Because Spark can work in memory, it can perform 100 times faster. Spark can also provide a query processing engine for Cassandra.
PwC: Does Cassandra come with an in-memory capability to begin with?
Vaibhav Puranik: Cassandra does come with in-memory capability in its enterprise version. Unfortunately, we are not using that enterprise version right now, but rather the Apache license version of Cassandra. I know people who are using the enterprise version, and they’re pretty happy with it.
PwC: If you went back in time 10 years and you didn’t have access to these NoSQL options, what would you have done? How dependent are you on the new big data technologies just to execute your business models?
Ken Weiner: I think it might have been possible to do 10 years ago, but it wasn’t as cost-effective. There were solutions back then for big, vertically powered databases, and you could get a really powerful, expensive, single machine. But beyond a certain point, I’m not sure exactly how that would have worked out.
Vaibhav Puranik: The reason data is growing so fast is that you can store it and process it in cheaper ways. Ten years ago, most companies would not process that much data because the cost of processing that data was too high. And now they are processing much more data, because they can do it less expensively.