May 8, 2016
NoSQL database technology is maturing, but the newest Apache analytics stacks have triggered another wave of database innovation.
Previous articles in Remapping the database landscape explore the nooks and crannies of NoSQL databases,1 how they are best used, and their benefits. This article examines the business implications of NoSQL and the evolution of non-relational database technology in big data analytics platforms.
NoSQL was a response to the scaling and flexibility demands of interactive online technology. Without the growth of an interactive online customer environment, there wouldn’t be a need for new database technology. The use of NoSQL especially benefits three steps in a single business process:
- Observe and diagnose problems customers are having.
- Respond quickly and effectively to changes in customers’ behavior and thus serve them better.
- Repeat steps 1 and 2 to improve responsiveness continually.
New distributed database technologies essentially enable better, faster, and cheaper feedback loops that can be installed in more places online to improve responsiveness to customers. Consider these examples described in depth in the previous five articles of this series:
Although NoSQL is maturing, advances in database technology promise new forms of heterogeneity in application platforms. For mobile and web developer teams, polyglot persistence and NoSQL heterogeneity have been the watchwords. For integrators and data scientists, unified big data platforms that have primarily had a more analytics slant are gaining additional general utility and making possible a new class of data-driven, more operational applications. At the same time, more NoSQL databases and database functionality are being incorporated into platforms.
Operational and analytic directions of evolution
Hadoop has evolved. The goal is to create a single architecture for stream, analytics, and transaction processing.
Two separate directions of development have existed for decades in big data architecture. The first was on the developer side and began with operational data, such as the data generated by large e-commerce sites.
E-commerce developers used MySQL or other relational databases for years. But more and more, they’ve found value in non-relational NoSQL databases. “If I use an OLTP [online transaction processing] database, such as Oracle, the result will be data spread across several tables. So when I write the transaction, I must worry about who might be trying to read it while I’m writing it,” says Ritesh Ramesh, chief technologist for PwC’s global data and analytics organization.
“In the case of NoSQL,” Ramesh points out, “consistency is implicit, because you’ll use the order ID as a key to write the entire receipt into the associated value field. It’s almost like you’re writing the transaction in a single action, because that’s how you’re going to retrieve it. It’s already ID’d and date-stamped. Enterprises will soon figure out that NoSQL delivers implicit consistency.”
Relational databases continue to be good for analytics of structured data that has a stable data model, but more and more, web and mobile app developers at retailers, for example, benefit from the time-to-market advantages of a key-value or a document store. “NoSQL doesn’t need a normalized data model, which makes it possible for developers to focus on building solutions and to own the process of storing and retrieving data themselves,” Ramesh says. Less time is consumed throughout the development and deployment process. And if developers want a flexible but comprehensive modeling capability to support a complex service that will need to scale, then graph stores, either standalone or embedded in a platform, provide an answer.
The latest twist in non-relational database technologies is the rise of so-called immutable databases such as Datomic, which can support high-volume event processing. Immutability isn’t a new concept, but decoupling reads and writes does make sense from a scalability perspective. Declarative, functional programming is more common in these applications that process many events. Immutable, log-centric data stores also lend themselves to microservices architectures and continuous delivery goals.
The second direction of database evolution has been Hadoop. Some form of Hadoop has existed for a decade, and the paradigm is promising for large-scale discovery analytics. As the next section will explain, Hadoop has evolved, and more complete big data architectures are emerging in open source and in the startup community. The toolsets and all the elements of these newest stream-plus-batch processing platforms aren’t mature, but the goal is to create a single architecture for stream, analytics, and transaction processing.
As these vectors of database evolution come closer together, there are two types of hybrids: hybrid multi-model databases and hybrid stream-plus-batch analytics stacks that have more operational capabilities.
Hybrids and the polyglot-driven evolution of NoSQL
It’s been six years since Scott Leberknight, then a chief architect and co-founder of Near Infinity, coined the term polyglot persistence. It is the database equivalent of polyglot programming, coined by Neal Ford in 2006 in reference to using the right programming language for the right job. As of July 2015, DB-Engines listed 279 databases on its monthly ranking website. Of those, 128 were NoSQL key-value, document, RDF graph, property graph, or wide-column databases and 35 were object-oriented, time-series, content, or event databases. The remainder—more than half—consisted of relational, multi-value, and search-specific data stores.
Database proliferation continues, and the questions become: Will polyglot persistence become the norm? Or will efforts to unify distributed data stores counter the polyglot trend? Mainstream enterprises often try to minimize the number of databases their developers use so as not to overwhelm their data management groups. In August 2014, Mike Bowers, enterprise data architect at the Church of Jesus Christ of Latter-day Saints (LDS), described the chilly reception LDS IT leadership initially gave MarkLogic. Over time, Bowers and the rest of the data management staff grew to appreciate the value of a NoSQL alternative in the mix, but they drew the line at a single hybrid NoSQL database. Afterward, rogue instances of other NoSQL databases started to appear that did not gain IT support.
NoSQL2 (non-relational, distributed) databases began as a more scalable alternative to MySQL and the other relational databases that developers had been using as the basis for web applications. Use cases for these databases tended to have a near-term operational or a tactical analytics slant. So far, the most popular NoSQL databases for these use cases have been single-model key-value, wide-column, document, and graph stores. Single-model databases offer simplicity for app developers who have one main use case in mind. In July 2015, the DB-Engines popularity analysis listed 12 single-model NoSQL databases among the top 50 databases; MongoDB, Cassandra, and Redis were in the top 10.
Hybrid multi-model databases emerged in the early 2010s. Examples include key-value and document store DynamoDB, document and graph store MarkLogic, and document and graph store OrientDB. In July 2015, these three multi-model databases ranked in the top 50 in popularity, according to DB-Engines.
The DB-Engines popularity analysis listed 12 single-model NoSQL databases among the top 50 databases; MongoDB, Cassandra, and Redis were in the top 10.
Multi-model databases will become more common. Cassandra distributor DataStax, for example, plans to offer a graph database successor to Aurelius Titan as a part of DataStax Enterprise. MongoDB points out that with its version 3.2, some graph traversal and relational joins will be possible with the help of the $lookup operator, used to bring data in from other collections. In general, MongoDB uses a single physical data model, but can expose multiple logical data models via its query engine.
Will the adoption of hybrids slow the growth of polyglot persistence? Given the pace of new database introduction and the increasing popularity of single-model databases, such a scenario seems unlikely. But that doesn’t mean the new unified stacks won’t prove useful.
Database technology in new application stacks
Previous articles in this PwC Technology Forecast series didn’t focus much on relational databases, because the new capabilities in NoSQL databases—such as data modeling features, scalability, and ease of use—are the key disruptors. With the exception of PostgreSQL, an object-relational database that can handle tables as large as 32 terabytes, the top NoSQL databases ranked by the DB-Engines popularity analysis saw the highest growth of any databases ranked in the top 10. MongoDB grew nearly 49 percent year over year, Redis grew 26 percent, and Cassandra grew 31 percent.
Relational technology is well understood, including the so-called NewSQL technologies that seek to address the shortfalls of conventional web databases in distributed computing environments. And relational technology continues to be the default. As of July 2015, most database users think primarily in terms of tightly linked tables and up-front schemas. Of the top 30 databases ranked in DB-Engines, relational databases received 86 percent of the ranking points. In contrast, nine NoSQL databases received 11 percent of the points. Dedicated search engines (called databases in the DB-Enginesanalysis) Elasticsearch, Solr, and Splunk together as a group received 3 percent of the points.
NoSQL is not the only growth area. Excluded from the DB-Engines ranking are Hadoop and other comparable distributed file systems, such as Ceph and GlusterFS and their associated data processing engines. These systems, as explained later, are becoming the foundation for unified analytics and operational platforms that incorporate not only other database functions but also stream and batch processing functions in a single environment.
Some unified analytics and operational platforms have NoSQL at their core. These unified platforms offer the ability to close the big data feedback loop for the first time, enabling near-real-time responsiveness—even in legacy systems that previously seemed hopelessly outdated and unresponsive.
The Berkeley Data Analytics Stack
Hadoop data infrastructure has been the starting point for many of these unified big data frameworks. The Berkeley Data Analytics Stack (BDAS)—an open-source framework created over several years at the University of California, Berkeley, AMPLab (Algorithms, Machines, and People) in collaboration with industry—takes the batch processing of Hadoop, accelerates it with the help of in-memory technologies, and places it in an analytics framework optimized for machine learning.
The capabilities of BDAS have been paying off in notable ways for more than a year. Consider the case of Joshua Osborn, a 14-year-old with an immune system deficiency who was suffering from brain swelling in 2014. Doctors couldn’t diagnose the cause of the swelling by using conventional bioinformatics systems. Eventually they turned to an application the AMPLab pioneered called Scalable Nucleotide Alignment Program (SNAP) that runs on BDAS.
Researchers at the University of California, San Francisco, took the available DNA from the patient, sequenced 3 million fragments of genomic information within two days, and put aside the human genome information with the help of SNAP. Then they matched the sequences that were left with the sequences of known pathogens within 90 minutes.
It turned out Joshua was infected with Leptospira, a potentially lethal bacteria that had caused his brain to swell. Doctors gave him doses of penicillin, which immediately reduced the swelling. Two weeks later, Joshua was walking, and soon he fully recovered.
Researchers sequenced 3 million fragments of a patient’s genomic information within two days. Then they matched the sequences that were left with the sequences of known pathogens within 90 minutes.
At the heart of BDAS is Apache Spark, which serves as a more accessible and accelerated in-memory replacement for MapReduce, the data processing engine for Hadoop. Key to Spark is the concept of a resilient distributed dataset, or RDD. An RDD is an abstraction that allows data to be manipulated in memory across a cluster. Spark tracks the lineage of data transformations and can rebuild data sets without replication if an RDD suffers a data loss. The input and lineage data can be stored in the Hadoop Distributed File System (HDFS) or a range of databases.
RDDs are one reason Spark can outperform MapReduce in traditional Hadoop clusters. In the 2014 Daytona GraySort competition, Apache Spark sorted 100 terabytes at 4.27 terabytes per minute on 207 nodes. In contrast, the 2013 winner was Yahoo, whose 2,100-node Hadoop cluster sorted 102 terabytes at 1.42 terabytes per minute.
Spark uses HDFS and other Hadoop-compatible disk-based file systems for disk storage. Tachyon, which the AMPLab announced in 2014, augments the capabilities of these file systems by allowing data sharing in memory, another way to reduce latency in complex machine-learning pipelines.20
In 2015, the AMPLab added Velox to BDAS. Velox allows BDAS to serve up predictions generated by machine-learning models and to manage the model retraining process, completing the model-driven feedback loop that Spark begins.
According to Michael Franklin, director of the AMPLab and chair of the computer science department at the University of California, Berkeley, “If you look at the architecture diagram [included here], Velox sits right next to Spark in the middle of the architecture. The whole point of Velox is to handle more operational processes. It will handle data that gets updated as it is available, rather than loading data in bulk, as is common when it’s analyzed. Our models also get updated on a real-time basis when using computers that learn as they process data.”
In other words, an entire test-driven, machine-learning-based services environment could be built using this stack.
Kafka- and Samza-based unified big data platforms
Interestingly, while BDAS relies heavily on in-memory technology up front and maximizes the use of random access memory (RAM) to reduce latency in various parts of the stack, the proponents of platforms that use Apache Kafka and Apache Samza think a disk-based storage or file system first approach is best, particularly in a Kafka, log-file-centric system. What’s stored on disk gets cached in memory without duplication, and the logic that manages the synchronization of the two resides entirely in the operating system. The main memory first approach, they assert, risks greater memory inefficiencies, slowness, and unreliability because of duplication and a lack of headroom.
Apache Kafka and Apache Samza were born of LinkedIn’s efforts to move to a less monolithic, microservices-based system architecture. Kafka is an event-log-based messaging system designed to maximize throughput and scalability and minimize dependencies. Services publish or subscribe to the message stream in Kafka, which can handle “hundreds of megabytes of reads and writes per second.” Samza processes the message stream and enables decoupled views of the writes in Kafka. Samza and Kafka together constitute a distributed, log-centric, immutable messaging queue and data store that can serve as the center of a big data analytics platform.
But many pieces of that platform are only now being put into place, including serving databases and associated querying capabilities. Confluent is a stream processing platform startup founded by former LinkedIn employees who had worked on Kafka, Samza, and other open-source projects. Kafka can synchronize the data from various sources, and Confluent has envisioned how those data sources could feed into a Kafka-centric enterprise architecture and how stream analytics could be accomplished with the help of Apache Flink, an emerging alternative to Apache Spark that claims true stream and batch processing using the same engine.
Others have also built their own stacks using Kafka and Samza. Metamarkets, a big data analytics service, has built what it calls a real-time analytics data (RAD) stack that includes Kafka, Samza, Hadoop, and its own open-sourced Druid time-series data store to query the Kafka data and serve up the results. Metamarkets says its RAD stack can ingest a million events per second and query trillions of events, but has found it necessary to instrument its stack and establish a separate monitoring data cluster to maintain such a high-throughput system in good working order.
The Confluent and Metamarkets platforms could be considered implementations of what have been called the Kappa and Lambda architectures. In the Kappa architecture, stream and batch coexist within the same infrastructure. Confluent asserts that the separate stream processing pipeline originally specified in the Lambda architecture was necessary only because stream processing was weak at the time. Now stream and batch can be together. Metamarkets continues to use a Lambda architecture, but has recently added Samza to the mix.
In either case, these are two very high-throughput analytics platforms designed for cost-effective scalability and flexibility. Metamarkets has a number of online and mobile advertising company clients, including OpenX, an online ad exchange that handles hundreds of billions of transactions each month. OpenX connects buyers and sellers in this online exchange. The Metamarkets service provides visibility into the ad exchange marketplace and includes custom dashboards built for OpenX on the front end. The Lambda-architecture-based back end queues up the streams to be analyzed, provides the processing, and serves up the results.
Conclusion: More capable databases and big data platforms
NoSQL databases emerged in the late 2000s out of necessity. As big data grew, the need arose to spread data processing across commodity clusters to make processing timely and cost-effective. As big data became more complex and interactive, developers and analysts needed access to a range of data models. As relational database technology improved, NewSQL responded to retake some of the territory lost earlier. Hybrids and immutable data stores are further expanding the choices available to enterprises.
But the mid-2010s saw expansion in a different, unanticipated direction—entirely new data-driven application stacks built on distributed file systems, particularly HDFS. Stacks such as BDAS promise to close the feedback-response loop in a machine-learning environment. Separate databases will continue to be used in these environments to provide persistence. Theoretically, machines doing unsupervised learning could continuously refine a model given a closed feedback loop. Doctors confronting illnesses difficult to diagnose could turn to a standard process of genomics testing to identify pathogens in a patient’s body, as described in the University of California, San Francisco, example. Similarly, enterprises could monitor operational processes, focus on those of particular concern, and use machine learning to identify elements that could cause disruption, helping management to diagnose a problem and prescribe a solution.
Just as BDAS has become more complete with the addition of Velox, other alternative stacks are emerging, such as the message-log-centric Kafka plus Samza environment. Stream processing is rapidly maturing with dataflow engines such as Apache Flink, so organizations can create very high-capacity online transactional environments that use the same cluster and engine for Hadoop-style batch analytics.
What’s possible with the newest technology is the ability to close the big data feedback loop for the first time, enabling near-real-time responsiveness—even in legacy systems that previously seemed hopelessly outdated and unresponsive. The distributed database can still be used on its own for a mobile or web app, but more and more, its real power comes when the data store is embedded as part of one of these newly built stacks. So-called deep or unsupervised machine learning pays off most when analytics and operational systems people work together to build one continuously improving, cloud-native system. The technology that makes such a system feasible no longer requires a forklift upgrade.
- Structured query language, or SQL, is the dominant query language associated with relational databases. NoSQL stands for not only structured query language. In practice, the term NoSQL is used loosely to refer to non-relational databases designed for distributed environments, rather than the associated query languages. PwC uses the term NoSQL, despite its inadequacies, to refer to non-relational distributed databases because it has become the default term of art.
- Structured query language, or SQL, is the dominant query language associated with relational databases. NoSQL stands for not only structured query language. In practice, the term NoSQL is used loosely to refer to non-relational databases designed for distributed environments, rather than the associated query languages. PwC uses the term NoSQL, despite its inadequacies, to refer to non-relational distributed databases because it has become the default term of art. See the section “Database evolution becomes a revolution” in the article “Enterprises hedge their bets with NoSQL databases,” PwC Technology Forecast 2015, Issue 1, http://www.pwc.com/us/en/technology-forecast/2015/remapping-database-landscape/enterprises-nosql–databases.html, for more information on relational versus non-relational database technology.