August 25, 2016
Mike Franklin is director of the Algorithms, Machines and People Lab (AMPLab) and chair of the computer science division at the University of California, Berkeley.
Mike Franklin of the University of California, Berkeley, discusses the goals behind Spark and a more unified cloud-data ecosystem.
PwC: What do you see in the data technology that’s coming online, and how is the landscape changing?
Mike Franklin: I was trained as a database person. Most of my research and work in industry has been along the lines of traditional relational databases, object-oriented databases, and so on. That’s my perspective. I think things are in the process of changing in a fundamental way. In the 30 years or so that I’ve been involved in this area in various ways, this is the biggest change I’ve seen since the relational database revolution.
There wasn’t a generally accepted alternative to the relational database in its heyday. You either figured out how to use it, or you were on your own. There were attempts over the years to try to change that, such as object-oriented databases, XML-oriented databases, and the like. None of them really got the traction that we’re starting to see now in some of these newer systems.
More recently, the appreciation of the potential value in data has spread everywhere, across all industries. Basically, every department at UC Berkeley, for instance, is doing data-driven work. That’s led to an explosion in the sort of use cases and the types of data that people want to store and analyze. I’m not sure that the formats of the data have changed all that much, but the variety of the data that people want to store is what’s really changed.
PwC: How has that impacted the nature of databases?
Mike Franklin: Users need flexibility more than anything else, so you can take different approaches to answer the question. One approach is an idea expressed as store first, schema later. The first step in a traditional database environment is to analyze your application and organization, and then do a data design and schema design—where you figure out what each piece of data must look like and how all of the data are interrelated. Only after you’ve gone through that process can you start thinking about putting your system together and loading any data into it.
Now you just collect as much data as you can, because the alternatives in storage are incredibly cheap relative to what they used to be. So you just store everything you can get your hands on. Some people call this a data lake.
About ten years ago, my colleagues and I wrote a paper about something that we called data spaces. That’s exactly the same idea, where you don’t impose structure and don’t require data to conform to a structure to be stored. With that concept, you store the data and then figure out how much structure you need to make sense of the data and do what you need to with it.
The biggest gain in the flexibility of data management is that you can store anything and then try to make sense of it and work with it.
PwC: For the past couple of years, the assumption has been that you have this heterogeneous environment and there are different tools you can use, depending on your needs. You could use a document store here, a graph database there, while continuing to use a relational database for critical transaction support. Is there a more unified approach, or must we accept database heterogeneity for the long term?
Mike Franklin: From my perspective, it’s obvious that heterogeneity is not the right way to do things. If you think about the history of the Spark project and the ecosystem we’ve built around it, Spark was originally a research project.2 We didn’t worry too much about things you would worry about if you were building a product from Day One.
“Our view of the Spark ecosystem is that you don’t need all those isolated data systems.”
Our view of the Spark ecosystem is that you don’t need all those isolated data systems. Once you’ve got the data into our Spark environment, you should be able to treat that data any way you want to. If you want to run SQL [structured query language] queries over it, we’ll let you do that. If you want to look at the data in the form of a graph, we have a system called GraphX that will let you do that. If you want to use computers that learn to process data better from their own experience, along with data clustering and recommendation systems, we have libraries that will let you do that. And if you want to write low-level code to do something to the data we haven’t even imagined, you can do that, too.
When we created the Spark ecosystem, we had the advantage of not needing to hit revenue targets to please impatient investors. We had the time to build out a comprehensive ecosystem. We did it in an open way, so we’re not the only ones contributing to and extending the ecosystem for the benefit of all.
Of course, any organization is limited by the technologies it has and knows how to use. But there’s no technical reason why you need a dozen different systems to do a dozen different things. There’s enough commonality in the patterns of data acquisition and analytics to enable more effective management of data with a single enterprise-wide data management framework.
PwC: What about meeting the requirements of operational data, which tends to need more structure?
Mike Franklin: Data structure poses interesting questions: When do you need it, and how much do you need? I look at structure as a spectrum—from completely unstructured data, like just a bag of bytes, all the way to a relational schema or maybe even something that has more sophisticated structure at the opposite extreme.
You can think of data structure options as a choice about what tradeoff you’ll accept to meet your needs. On the unstructured side, you get incredible flexibility. You get the ability to bring in whatever data you find, keep it and, hopefully, do something useful with it. But as you move toward the increasingly structured side of data management, you gain more confidence about what your data mean and how that data can be applied to benefit your business.
As you move to operational systems that house valuable data, where consistency in the structure and management of that data is the top priority, you must give up flexibility in the way that you structure and manage the data. You apply rules and constraints to get more predictable results.
“What’s changed today is that the same system can support different points along the data-structure spectrum.”
What’s changed today is that the same system can support different points along the data-structure spectrum. With systems like Spark and some other NoSQL [not only structured query language] environments, you get to pick different points along the data-structure spectrum in the same environment. At least that’s the goal, because I’m not sure that capability fully exists yet.
PwC: How does the Spark ecosystem address the operational aspect?
Mike Franklin: Spark is focused on analytics, but we’re in the process of building out capabilities that will be more operational. Spark is one component of something we’re building called the Berkeley Data Analytics Stack [BDAS]. And Spark is the middle, or main, part of that stack.
We’re also building a new component of BDAS called Velox. If you look at the architecture diagram [see below], Velox sits right next to Spark in the middle of the architecture. The whole point of Velox is to handle more operational processes. It will handle data that gets updated as it is available, rather than loading data in bulk, as is common when it’s analyzed. Our models also get updated on a real-time basis when using computers that learn as they process data.
PwC: What are the problems you’re trying to solve on the operational side?
Mike Franklin: We’re trying to create a system that lets you move all your data from one environment to another as you go from the operational side to analytics. When we succeed, you’ll be able to move your data in a closed loop between operations and analytics, so analytics can directly inform your operations.
PwC: How do you support different user groups?
Mike Franklin: The users of our tool include both front-line analysts in a security operations center and more advanced security investigators and incident handlers. Often, certain sensitive types of data are not available to the front-line analysts, but the more advanced investigators would be able to see all the data.
Source: Dan Crankshaw, UC Berkeley AMPLab, 2015, Velox at SF Data Mining Meetup (Slideshare)
PwC: Some people think that business logic should not be contained in a database. For example, perhaps someone decides not to order supplies from a specific vendor for fear of violating compliance regulations, even though database systems have long been used to assert compliance. Now the practice of asserting compliance through a database is being questioned because the proof of business logic is in its application, rather than inside a database. Would you agree that putting business logic in the database itself was a bad idea to start with, something we should move away from?
Mike Franklin: Your question makes me wonder if there’s a third alternative, a shared system for business logic other than the database. Maybe the right idea all along was to pull business logic out of individual applications, so you had a single source of truth for business logic. In that case, putting business logic in the database would have been the wrong way to achieve that. Maybe there is an ideal shared repository for business logic that isn’t the database. I can see that. You may have just given me my new research project.
PwC: What caused the rise of NoSQL? Is NoSQL enough?
Mike Franklin: My view on the rise of NoSQL is that database systems require too much up-front work before you can get anything done. The first time I hand-typed an HTML [HyperText Markup Language] page and then pointed a browser at it, the page wasn’t quite right. But a lot of the page looked OK. That was a revelation to me as a database guy. Because if you did that to a database, it would just return a syntax error. It wouldn’t show you anything until you did it perfectly right.
The NoSQL systems are much more forgiving. But at some point you run into a wall with NoSQL. The truth is that once you reach a certain stage, structured query language is a really good tool for a lot of what people want to do with data.
A flexible and incremental approach to structure is the most valuable. You start by loading whatever data you want. Just throw it in. Make it super easy. Then, as you need more structure and definitions to get the results you need and to comply with any mandatory procedures, you start imposing control on your data.