August 31, 2016
Oliver Halter is a principal in the data and analytics practice at PwC.
Oliver Halter discusses how CEOs and CIOs are being forced to tolerate some data inconsistency.
PwC: A number of companies and external data services have been talking about how quickly they’re integrating many different customer data sets and delivering views of this data to end users. What’s going on here?
Oliver Halter: Initially, when companies were first trying to integrate large numbers of data sets, those involved wanted to rationalize the data. They said, “Let’s make a wonderful schema. Let’s make all-integrated data the theme.” And then they went into conference rooms and fought over definitions and data models for three years.
There was so much indecision and complexity that literally nothing got done. Should the service layer be this way? Should it be a granular service? Should it be a compound service? Don’t we need this? Don’t we need that? These were endless conversations that led to very little.
But then when NoSQL databases started to become more useful for enterprises, some companies finally created a simple way to help data owners share their data and get it into the database. “Just make it a bunch of name-value pairs and send it to us,” they said. Each of those records had some kind of an identifier in it, so they knew they could stitch everything together later.
“Without a schema, application developers can store whatever they want, which speeds up the initial development process substantially. But if you need to make that data analyzable or interoperable with other data outside of that app, eventually you’ll need to make your data definitions consistent.”
They weren’t trying to homogenize or rationalize values. They just said, “Send me your raw data.” The first huge success was to stitch the data together and show the records as they came.
Then these companies looked at the results and thought about what customer service reps, for example, would be seeing. And they decided, “This first attribute has a really cryptic value. On the way out to the screen, let’s translate the value to something that a customer service rep can understand.” But they didn’t try to do a giant, rationalized enterprise data model. There’s a clear benefit from avoiding rationalization, but that’s also where a lot of danger lies.
PwC: What is the primary danger?
Oliver Halter: The primary danger of letting every application do its thing is that you end up with 45 mildly different versions of a customer or any entity—definitions of the same things in different ways with different codes or different whatever. And then if you try to analyze the data outside of a particular application, it’ll be a nightmare.
You just need good governance. You don’t need to impose a core relational data model. For example, you can allow certain groups of users to extend the schema. That works well with NoSQL databases such as MongoDB or Apache Cassandra, much better than you could do it in a relational database. Governance then can focus on quality, completeness, and capturing the right kind of information rather than immediately diving into the semantics about how that information is captured and structured.
PwC: In general, when should enterprises use a flexible schema NoSQL database such as Mongo or Cassandra?
Oliver Halter: Mongo and Cassandra are good for a variety of things. The best use cases involve lots of individual transactions that are not too big, mostly creating them only once and then reading them many times. From a design point of view, there’s much less overhead in these systems for storing blog posts or profiles, for example—you can set up a NoSQL database much faster than you possibly could set up a relational database.
One of the main drivers for the use of these databases is just scalability, the ability to do tens of thousands of transactions a second. The other driver, as we’ve discussed, is the flexibility of not having a schema. The lack of schema should be considered and thought about a lot because, like everything, there’s no free lunch.
Without a schema, application developers can store whatever they want, which speeds up the initial development process substantially. But if you need to make that data analyzable or interoperable with other data outside of that app, eventually you’ll need to make your data definitions consistent.
For each case, there’s an art to determining how much governance you need and how much freedom you should allow.
PwC: With more freedom comes more responsibility, of course. What are some of the other factors contributing to the adoption of these databases?
“Many companies that have major online presences couldn’t do what they do if they relied solely on relational databases.”
Oliver Halter: There are growth drivers that many people overlook. For example, companies are making more compelling products and services by watching how their customers behave and then creating a segment of one, if you will, from an interaction, marketing, and customer services point of view. Those initiatives create tremendous amounts of data. If you get into managing hundreds of terabytes, then that becomes very, very expensive very, very quickly. It’s not just that NoSQL is a different database technology that does certain things better. It’s also that it’s far cheaper and it sometimes can be the only affordable choice for fast growing, rapidly changing online customer data.
Many companies that have major online presences couldn’t do what they do if they relied solely on relational databases. They can’t wait to enforce data consistency. They can’t say, “Oh, be careful. We might have some inconsistent data here.” They must roll out functionality and use NoSQL databases to capture data in volume at scale. CEOs and CIOs are worried about data consistency, of course, but they really don’t have any choice. They will need to move into a world where they will have inconsistent data, and they’ll need to get used to it.