Filling in the gaps in NoSQL document stores and data lakes

August 26, 2016

by



Tags: 

Matthias Brantner is the former CTO of 28msec and is a consulting member of technical staff at Oracle. This interview was conducted in 2014.

Matthias Brantner is the former CTO of 28msec and is a consulting member of technical staff at Oracle. This interview was conducted in 2014.

Matthias Brantner describes the role database virtualization and a business-user query interface can play in heterogeneous environments.

 

PwC: How are companies using NoSQL or non-relational databases?

Matthias Brantner: Developers use NoSQL databases because those databases are relatively easy to set up and they’re mostly free to get started. NoSQL databases are a no-brainer for developers to get something up and running quickly. MongoDB is an example—it’s very easy to install, and developers jumped on it.

But it’s not really clear to me how these databases will be used in the mainstream enterprise and what they will be used for. You can certainly analyze long streams of data from your websites or all the clickstreams. But, as things currently stand, eventually you need a developer to help you deal with the flexibility that those databases give you. You can essentially put everything in there, but to figure out what is in the database, you need a developer.

The tools haven’t caught up. That’s the next thing that needs to happen to drive adoption of those systems and to make the data in data lakes or enterprise data hubs accessible by business users.

PwC: NoSQL query languages are still scarce.

Matthias Brantner: That’s right. Those NoSQL databases started without really having query languages on top of them. Some of them are just key-value stores, and you must write a program to get the data out. MongoDB, for example, has a slightly more sophisticated query language, but it’s still very developer focused. Now many vendors are bringing SQL back onto NoSQL or Hadoop. But generally the semantics are different because NoSQL and Hadoop do not use the SQL data model. So you really cannot use SQL semantics.

Everyone who’s trying to serve the market is currently cooking in their own kitchen. Vendors need to focus not only on developer ease of use but also on the ability of business users to look at the data. The technology makes sense, but just collecting data is expensive and doesn’t make sense. You need to know why you are collecting the data and what you’ll do with it.

“Just because you have an API doesn’t mean you don’t need to maintain the database anymore.”

PwC: One organization we’ve heard from seems to have a solid BI [business intelligence] group and said they’re able to integrate their data from NoSQL and relational sources fairly quickly, more quickly than they can via APIs [application programming interfaces] and application-style integration.

Matthias Brantner: Absolutely. Just because you have an API doesn’t mean you don’t need to maintain the database anymore. If you start having a lot of data silos, each with its own API, you’re gathering a lot of technical debt with lots of code for APIs on different data stores. Each data store does a very specific or very small thing, but only one or two developers might understand it and they need to continue maintaining it. And if new demands for the data come in, you’re gathering more and more code that you must maintain. Companies can have many different data repositories but not really understand how they contribute to the big picture.

PwC: Many developers nowadays keep their logic in the app itself. Is 28msec putting the logic into the database?

Matthias Brantner: Yes. In their virtual database approach, the database and the application server are one thing, and you can write your entire application in the declarative JSONiq language, which allows you to read your own writes and make calls to the outside world. The only thing you expose is an API. You essentially have one thing, which is the database/application server—and it’s using one language that directs both the application logic and the data management.

PwC: Presumably you could do that with a relational database also.

Matthias Brantner: Yes, you could do that as well. I think the problem with the relational database is that you must migrate your schema completely if you make modifications, which is very hard: You mostly need IT, the person developing the application cannot migrate the schema, and the modifications might have great performance impacts on other aspects of the system. So the use of a relational database should be considered very carefully.

PwC: How can executives reduce reliance on IT when it comes to databases generally?

Matthias Brantner: Let me give you an example of what we experienced in the disclosure management fields with XBRL. In this case, business users are trying to get relational database systems to explore the information coming from the XBRL filings. The problem is that a lot of dimensional metadata is in XBRL filings. And in relational databases, those dimensions are part of the database schema. So your database schema encapsulates the dimension.

Dimensions often change, and only the business user understands the semantics of the dimensions and can add, modify, or remove them. The business user must talk to IT for every change. That certainly doesn’t make sense, and I think this barrier is the one we need to break.

A business user should be in control of the metadata, of the schema, because a business user is the only person who understands the domain. Enterprises don’t want business users to give a task to a developer and say, “Look, that’s what I want.” A developer comes back with a result, and a business user says, “That’s not exactly what I expected and now that I see the results, you might want to do it differently.” The communication between the business user and developers or DevOps is inefficient, and the problem is how the business user fits into the picture. The business user should be able to describe what is nowadays called the taxonomy. That taxonomy describes the metadata and should be reflected in the database. Business users should be in control of it.

We realized that the problem is the discrepancy between the business user and the developers and IT. And so we are looking into that discrepancy. The query language will not help that. In the end, vendors must build tools that the business user can use, and the technology is only an enabling technology that helps you to support that usage. This approach resonates with business users, because they don’t have to go through IT to make changes to the schema.

With Hadoop or MongoDB, you don’t know what’s in the database. You might have an idea, but you don’t really know. With MongoDB, someone can do analysis on one collection, but not on two. Then a developer needs to treat the data as a join between the two collections. And the business user is already out of the picture. Same for Platfora. If your schema changes and you have different data formats in your Hadoop ecosystem, you again need a developer who bridges the gap.

PwC: What’s another approach, then?

Matthias Brantner: The data lake as the common denominator makes sense because it can actually maintain consistency. The data is in one place. Then you have different microservices on top and the tools to access the data.

With that approach, you maintain as much consistency as you need, and business users define what they need. And so microservices in that context make sense.

Industries

Contacts

Chris Curran

Principal and Chief Technologist, PwC US Tel: +1 (214) 754 5055 Email

Vicki Huff Eckert

Global New Business & Innovation Leader Tel: +1 (650) 387 4956 Email

Mark McCaffery

US Technology, Media and Telecommunications (TMT) Leader Tel: +1 (408) 817 4199 Email