September 8, 2015
Without a scalable data architecture, the customer experience suffers.
Imagine you’re a retailer offering tens of thousands of products online. You have rich descriptions that include numerous attributes for each product. In standard relational databases, these attributes exist in silos, are poorly described, and cannot be indexed for maximum usefulness. So, if you’re using only a standard relational database and a conventional enterprise search engine, customers who search “17-inch laptop” will retrieve many false positive results that aren’t laptops.
For example, a customer can filter on the brand name of the laptop, and the results will include only records labeled with that name. Each tag name, and its associated values, acts as its own column or subcategory without the complexities found in relational tables. “When you search using XQuery, it finds the tag values that are associated with those tag names,” Unak explains.
The result is a much improved, long overdue online shopping experience for the customer. The ability to filter on potentially thousands of product attributes is just one of the many benefits of using NoSQL.
Why NoSQL suddenly became necessary
Part of NoSQL’s value derives from its ability to articulate views into less formally structured data in documents. But its value doesn’t end with document databases. NoSQL is a term that encompasses many types and styles of databases, and the totally stripped-down varieties also pay off in unexpected ways. Dave Duggal, CEO of graph overlay and full-stack integration provider EnterpriseWeb, says the company prefers to use a minimalist database, such as Apache Cassandra, for persistent storage because of its speed and lack of complexity, although he says EnterpriseWeb can use any database type.
NoSQL is a term that encompasses many types and styles of databases, and the totally stripped-down varieties also pay off in unexpected ways.
Even minimalist database types can support and represent the richest, most meaningful semantics, Duggal notes. “We eliminate the database as a concern of an application developer,” he says. “We’d prefer the database introduce minimum overhead, because that lack of overhead allows maximum capability for our application logic.”
Why this minimalist approach now? Access to globally networked information via a few taps on mobile devices from anywhere has redefined the competitive landscape. Anyone who can harness the right information and apply it in the right way at the right time can have an impact. The obvious changes that organizations confront in this newly interactive and global online environment include these:
- Customers demand better information when and where they need it, and they vote with their feet (or their phones and tablets) when they don’t get it.
- Entirely new businesses are being created and old ones undermined solely on how well they deliver information advantage to end users.
- Information advantage can come from many different boundary-crossing organizations that have the right resources and vision. The number of competitors multiplies.
- Data is easier to gather and analyze than it’s ever been, but noise and errors are also much easier to generate.
- The ability to remove information bottlenecks and create information flow has become the difference between economic success and failure.
- Analyzing that information flow in near real time so the results can be shared with customers globally is a primary challenge.
The ability to remove information bottlenecks and create information flow has become a difference between economic success and failure.
What’s the best way to create information access and flow where obstacles and bottlenecks have been the norm? On the front end, enterprises need data-driven mobile apps that are easy to use and responsive to business needs. Apps like that require a fluid, scalable, and accessible data back end that handles big data and the associated metadata. Eliminating unnecessary complexity, including at the database layer, is the only way to achieve fluidity and scalability.
It’s easy to talk about these goals in the abstract, but how do enterprises anticipate the needs of the business and build these new capabilities into their existing data infrastructures? New database technologies such as NoSQL are one important enabler. The first step is to refresh the perspectives of IT and business units on what’s feasible now.
Every businessperson seems to grasp the importance of the front end, but many don’t think much about the back end. The back end, of course, is where most of the processing and much of the innovation must occur.
As PwC described in 2014, agile data warehouses and data lakes are the primary means of bringing data together for analytics purposes. But without innovations in databases and persistent storage, apps can’t deliver the flow of relevant information the business needs, whether that’s operational or analytic.
Database evolution becomes a revolution
For decades, to meet their needs for persistent storage, most enterprises used relational databases from a handful of established providers. CIOs focused on making core transactional systems rock solid, and relational databases evolved to meet that challenge. These databases consist of many different tables, which are “related” to one another with the help of foreign keys. The ability to query data that is spread across multiple tables depends on maintaining and updating these foreign key connections.
In the meantime, an additional online environment—one centered on customers— started to emerge. That environment had completely different requirements. Instead of interacting online with just a few thousand employees, companies needed a way to interact with millions of customers, partners, and other stakeholders. Instead of gigabytes and terabytes, the environment needed to process hundreds of terabytes or even petabytes. In addition to five nines (99.999%) of reliability in transactions, it needed scale. In addition to enforcing precision for financial reporting, it needed to generate thousands of clues from billions of customer-related observations and somehow rank those clues. In addition to serving up definite answers, it needed to offer probabilities and allow users to explore.
Many enterprises are unfamiliar with the realm of big data discovery, analytics, and operations. Yet organizations are beginning to realize that without big data analytics and the ability to perform tasks such as social media discovery at scale, they aren’t responsive to the needs of online customers. Databases that have lots of smaller tables rigidly joined require extra overhead and change management investment, which hampers the ability of organizations to take advantage of new online analytics capabilities.
The new online environment was inspired by innovations from web companies that were confronting how to handle very large amounts of less structured data. Conventional relational databases didn’t scale or flex to meet their needs.
Traditional database design philosophy assumes that developers know what the structure of the data is and that it doesn’t change often. What if a team is in the early development phase of a project and unsure about the stability of the data model, or might need the ability to change it? What if the team didn’t require a lot of structure to begin with, or could put off schema for a time? That’s one example why an organization should think about using a NoSQL database.
Of course enterprises might have inertia and resist the adoption of any new technology, particularly one associated with the complexity of a big data environment. It’s appropriate to be cautious and diligent about evaluating alternatives. NoSQL databases offer clear alternatives and benefits that should be explored in various use cases and scenarios, particularly those involving customer and product data in interactive online environments. But they shouldn’t be evaluated in isolation.
NoSQL, NewSQL, and the needs of the expanded data lifecycle
The first NoSQL and NewSQL (relational with some big data capabilities) databases appeared in the 2000s, primarily from web-scale companies that literally had to invent data management capabilities to address their scale of operation. Many database types emerged, each with its own purpose and rationale, and in many cases the web-scale companies decided to make them open source projects.
Now in the 2010s, large enterprises in other industries are confronting unprecedented data volume growth and must analyze the data quickly to surface and act on new business opportunities or respond to new customer needs. To address new data handling requirements not well served by conventional relational database technology, vendors are adapting and improving open source databases for enterprise use. In the meantime, open source communities such as The Apache Foundation continue to tackle the needs of web developers.
Will you need a distributed NoSQL store, or could you just scale up?
Many questions surround NoSQL technology. One of the primary benefits, for instance, has been the ability to scale out across tens, hundreds or even thousands of servers One question is whether you need to scale out at all. Scaling up may become a better option than it has been of late. At the data layer, shared memory systems could make it possible to scale up relational and non-relational databases without sharding—that is, without partitioning the data set.
Until recently, user ability to take advantage of very large, unpartitioned memory spaces has been limited to supercomputers— scaled up unified systems that use custom microprocessors. But lately, shared memory systems have emerged, including those from Numascale, ScaleMP and TidalScale that are based on common x86 processsors. The capabilities of these systems vary, but some claim to enable software designed for a single server to be used across many servers by making the server cluster appear as a single, virtual server. This capability would simplify development and analytics tasks for a number of use cases involving datasets up to the hundreds of terabytes—that is, the smaller end of the big data range.
Impactful technologies such as reverse virtualization enter the fray every year, and large enterprises will find it an increasing challenge to track and assess each trend within the context of the many other impactful trends and other variables. Suffice to say that NoSQL or nonrelational database technology is a landing craft that has substantial promise, but lined up in the shallow water near the beach are many obstacles that could prevent a successful landing for any company that lacks awareness about the location and nature of the obstacles. In the articles to follow in this series, PwC will describe the landscape, but equally important, a number of these obstacles and considerations.
There are pure database types associated with a particular data model—such as key-value, column, document, and graph—and then there are blended, hybrid approaches that offer users a choice of data models. The simplest databases lend themselves to initial app development, data generation, and data capture, along with flexible, developer-driven data modeling. The more sophisticated databases, by contrast, tend to be more appropriate for data reuse by analysts and business users later in the data lifecycle.
When first introduced, many purely NoSQL data stores lacked features that have been considered essential for mission-critical core transactional data. They targeted specific, intended purposes and delivered on simplicity and scalability for those immediate purposes.
Now, relational database advocates offer so-called NewSQL databases that include improved scalability and other capabilities. Advocates say these capabilities address the shortfalls without sacrificing reliability guarantees.
Not only do the different classes of NoSQL and NewSQL databases fit distinct use cases, but they are intended for different categories of users. For example, key-value and document stores are popular with developers. The primary motivation behind the creation of MongoDB and Cassandra in the late 2000s was to meet the needs of web developers in distributed environments who had problems scaling or creating schemas up front for relational databases. Developers in this context often consider only a single use case for the data their apps will generate, while analysts later in the lifecycle of that data may have entirely different notions of how it could be reused.
The more complex varieties of databases are intended for users later in the data lifecycle— those in reporting and analytics who seek to make more involved queries or find questions to ask. Those who need to reuse data the most are on the right-hand side of the lifecycle in the illustration below.
Much depends on the business requirements at specific points in the data lifecycle. One common challenge enterprises encounter is that different constituencies demand different databases, because they’re focused on different points of the data lifecycle.
For example, developers working on mobile apps supported by a back end in the public cloud might advocate the use of something like MongoDB (a document store) or Cassandra (a key-value/wide column store) because of ease of use, relative schema-lessness, and a click-to-deploy capability from the public cloud provider. But when a business unit evaluates the app, that group may demand a sophisticated recommendation engine that NewSQL or the relationship analytics capabilities of a graph database might better serve. In this scenario, a hybrid database that includes a document and graph store or one that blends a key-value and relational store becomes an option, and ease of migration from the original store becomes a requirement.
It’s still early days for enterprise-ready versions of these new database types. Over the one- to three-year forecast period, CIOs and data architects must strike a balance between the need to reign in complexity and the need to encourage innovation. And they should consider how a new NoSQL capability fits into their existing data architectures.
How NoSQL could complement a conventional database analytics capability
Established enterprises have substantial conventional enterprise data warehousing (EDW) capability based on a stack that includes at least one relational database management system (RDBMS) and often parallel row, column, and in-memory stores that are also conventionally managed.
Consider large retailers as an example of an industry vertical with a substantial online presence that’s very customer focused. They will have dozens of data sources and an integration method that addresses each source, in addition to the EDW, complex application stack, presentation layer, and so on. A NoSQL capability is best added via a data lake architecture adjacent and complementary to the EDW as shown here.
Once the data lake section of the architecture with the NoSQL database is in place, then retailers have an opportunity to perform what conventional databases haven’t been able to accomplish easily: faceted search of product catalog data, for example.
Conclusion: For now, more database choice
Ensuring relevant and consistent information flow at scale to millions of customers presents many challenges. NoSQL databases have provided options to developers and analysts who’ve confronted scaling, speed, accurate search and retrieval, or other problems that conventional relational databases haven’t adequately addressed.
With NoSQL and NewSQL, developers now have a range of options for how they model and query their data. They can approach the modeling task through a series of quick iterations using the same languages they’re accustomed to using for their applications. They can avoid the complexities of managing foreign keys and joins of multiple tables and other actions that relational databases demand, or they can scale out in ways they haven’t been able to before or otherwise get beyond the constraints of relational databases. They can use databases designed for their purposes, whether it’s new, micro applications and services or advanced analytics.
The current era is one of fit-for-purpose databases, which have pros and cons. The next articles in this series will focus on the main NoSQL database types and their most suitable use cases. To conclude the series, articles and interviews will explore what’s emerging: hybrids, overlays, and other similar developments and how they promise to redefine the database once again.
- Structured query language, or SQL, is the dominant query language associated with relational databases. NoSQL stands for not only structured query language. In practice, the term NoSQL is used loosely to refer to non-relational databases designed for distributed environments, rather than the associated query languages. PwC uses the term NoSQL, despite its inadequacies, to refer to non-relational distributed databases because it has become the default term of art. See the section “Database evolution becomes a revolution” for more information on relational versus non-relational database technology.