August 26, 2016
Martin Van Ryswyk is an executive vice president at DataStax.
Marko Rodriguez is chief of engineering and a co-founder of Aurelius, acquired in February 2015 by DataStax.
Martin Van Ryswyk and Marko Rodriguez of DataStax explore the challenges and benefits of big data analytics with graphs.
PwC: Do customers still think of graph databases as a niche technology, or are attitudes changing?
Martin Van Ryswyk: The reason we acquired Aurelius was very customer driven. We had more than 30 customers telling us they had graph use cases they needed to scale. They wanted us to do something with the Titan graph database that Aurelius created—either support it commercially or come up with our own version. We took a long look and were really surprised at how mainstream graph databases had become.
The Aurelius team was seeing use cases in fraud detection and recommendation engines, evidence that our big enterprise customers had already identified graph as the right modeling framework to solve their problems. These enterprise customers were really just looking to us to make sure we could get them an enterprise-grade solution.
PwC: Are the use cases entirely different from other NoSQL database options, such as Cassandra?
Martin Van Ryswyk: They’re somewhat adjacent. Titan has the ability to use Cassandra underneath it as one way to persist the data. Our customers wanted to have both their wide-column database model and a graph model all in the same store. That was a common theme we heard.
PwC: Graph theory is quite old. What has been inhibiting adoption of graph technologies?
Marko Rodriguez: It took a long time for people to realize that many of the data problems they were trying to solve were graph problems. So although the theory is relatively old, enterprises just didn’t have the terminology to understand what they were getting themselves into or what their problem was. The graph is actually a nice way to represent enterprise data and metadata and to solve enduring data problems.
In addition to the conceptual challenge, graph technologies lacked a certain level of enterprise readiness. Take Titan, for example. Aurelius didn’t have enough resources for enterprise support and enterprise testing, and that really hindered adoption for very large customers. What’s nice about DataStax is that now we’re able to deliver the outreach that helps overcome the conceptual challenge while also providing the support our enterprise customers require. For very large customers with terabytes upon terabytes of data, there is no graph database that supports their needs right now.
“In a graph database, every relationship already acts as a join. That’s why we can get better scaling.”—Martin Van Ryswyk
PwC: What’s unique about the graph approach from an enterprise perspective?
Martin Van Ryswyk: One of the constraints and benefits of a graph is that it already has the precomputed join. In the SQL world you have tables and columns, and you can arbitrarily join tables based on various columns. In a graph database, we would say that this person knows that person, or this person is related to that person. It scales nicely in that sense, because every relationship already acts as a join. That’s why we can get better scaling with a graph database.
PwC: How can established enterprises benefit from those advantages?
Martin Van Ryswyk: We’ve seen a number of good use cases across different sectors. For example, with improved relationship analytics, utilities can predict better when they will have peaks in usage or equipment failures. Large retailers can do better targeting for club cards and coupon recommendations. Banks can detect more instances of fraud or insider trading.
PwC: When we think about these kinds of use cases in relational technology, we typically look for querying and reporting capability. Is that how to think about graph databases?
Martin Van Ryswyk: There will be analysts who will run queries or who want some really nice graphics and visualizations out of graphs. For the most part, that’s not our target market. That’s the OLAP [online analytical processing] side of things.
Our functionality is meant to be accessed programmatically as part of an application in an OLTP [online transaction processing] context. If I am checking out at a grocery store, the store would want data about me plus what I just put in the cart. They would need to run the data through the graph database, so they could figure out all sorts of information in near real time about Martin. Let’s say he’s a football fan, he’s in California, and there’s a game this week. Maybe we’ll offer him a beer coupon. I’m making all of that up. But they’re taking a lot of pieces of data and trying to make very fast analytic decisions, and that’s the big thing with DataStax Enterprise (DSE) Graph. It’s the real-time component.
Marko Rodriguez: In the OLTP space Martin is talking about, when you perform a graph analysis, you’re just doing a particular traversal for a real-time query. You’re touching only a subset of the full data set. You’re starting at the Martin vertex and you’re walking around. You’re trying to solve a problem. And the less data you touch, the faster the traversal will be.
But in an OLAP query, you’re typically touching the whole graph or large subgraphs. There are multiple threads touching many things and, as a result, touching the disk heavily. [Retrieving more data from disk means more latency.] DSE Graph has both OLTP and OLAP capabilities.
PwC: Does the OLTP approach help with the partitioning problem that graph databases have suffered from?
Marko Rodriguez: For sure. That’s the biggest problem in graphs. It’s impossible to get a perfect cut across machines, so what you’re trying to do is limit cross-machine communication.
“It’s the God node problem… Everything linked to ‘time,’ and when everything links to ‘time,’ there is no information in the concept of time.”
As much as you can put data that will be co-retrieved on the same machine, the better off you’ll be. That is typically not a general function of some abstract algorithm, but rather a function of understanding your data. For example, on a social network, people who communicate with each other tend to be geographically located close in space. You can think of your machine as being represented like a world map; people in the same country will map to a particular machine.
PwC: What are some other considerations to take into account when using graph databases?
Marko Rodriguez: A key concern is relationship density. Although it sounds counterintuitive, you might actually want to avoid relationship density as much as possible. Take a shopping site, for example. You’ve purchased a lot of products over the years, and so the overall graph is dense around you.
If you’re doing a query, most of that graph is irrelevant information. What you’re really interested in at the current moment is very, very specific. With graphs, you try to be very particular and filter, filter, filter to only certain types of relationships. You really want to contextualize your traversal, so it meets the semantics of your ultimate problem.
PwC: Do users struggle with overly dense graphs?
Marko Rodriguez: Yes, they do. It’s the God node problem. Network science papers and graph theory papers have examined this problem. For example, we had a project with a customer who was parsing arbitrary text. They were looking at people communicating, and they were creating links between two words that both occurred in the same text. We realized that the word “time” became this super node. Everything linked to “time,” and when everything links to “time,” there is no information in the concept of time.
With too many linkages, there’s no information. But with no linkages, there is also no information. You want to have connectivity, but not too much connectivity.
You really want to have contextualized links between your nodes and various levels (or groupings) of nodes, because that will give you a more accurate representation of the world — where there are structures within structures.
If everything is connected to everything in every possible way, there is no form and that is not an accurate representation of the reality that we share (though at some level of awareness, it is correct).