August 25, 2016
Tom Foth is director of analytics application development at PwC.
Tom Foth describes how analytics platforms can benefit from a blend of database types.
PwC: One of the main reasons you and your team built the SocialMind platform was so PwC could help clients better monitor and respond to the social media that mentions their brands. But the system can handle data from many different sources, correct?
Tom Foth: That’s right. SocialMind is really a text analytics operating system, not just a social media listening platform. When we bring in unstructured text, we can bring along with it metadata of any type. Right now we mostly do social media listening, and that metadata is mostly about the actual social media document. But the data and the metadata we bring into SocialMind can be anything.
They could just as easily be call records and agent notes from a call center, or a speech-to-text transcription. Likewise, the data could be the output from customer complaint forms that customer service representatives must fill out on a website.
We want to capture with fidelity the context of all the metadata—structured and unstructured—that goes along with the document we’re bringing into SocialMind. Ultimately, we might produce the metadata with a high degree of fidelity all the way into a format that’s fully queryable.
In essence, the metadata just consists of tags for the unstructured data. We’re providing context and structure to it through taxonomic classification.
PwC: What kinds of databases do you use for SocialMind?
“In the document store we use, we want to capture the context of all the metadata that goes along with the document that we’re bringing in. And with the help of that same metadata, it’s possible to deliver a high degree of contextual relevance all the way out to a fully queryable, relational form.”
Tom Foth: Right now we use a document store for our ingestion engine. We tag and provide structure to the unstructured data through a classification and sentiment scoring tool. Then we reflect the output both back to the document store and forward to a relational database.
We might decide to use a document store for processing as well, but our data visualization tools don’t handle input from something like Mongo, Couchbase or MarkLogic very well yet.
Also, we needed to consider the possibility that we’d want to merge CRM data with our data at some point, and we figured that would be easier with a relational database.
In the document store, we want to capture the context of all the metadata that goes along with the document that we’re bringing in. And with the help of that same metadata, it’s possible to deliver a high degree of contextual relevance all the way out to a fully queryable, relational form.
PwC: In that case, is the document store helping to ingest lots and lots of documents and making it possible to capture and preserve both the metadata and the data associated with those documents?
Tom Foth: Correct. But as humans, we will not be able to grok a terabyte of JSON documents about web transactions. With the help of other database types, we’ll need to create viewports into the data. The tools we use to create those viewports will be determined by user needs.
PwC: That’s why you’re using at least two different kinds of databases. Business purposes further downstream from the business event itself demand cleaner data and better querying capability, correct?
Tom Foth: That’s correct, and there can be a downside to that sort of refinement. With relational systems, we clean the data first, make sure the data is consistent, and make sure it’s all there before we put it into the system. We know the database won’t function well if the data is not consistent and complete. However, as soon as we clean the data, we’ve lost data, and that can be a problem.
Let me give you an example. For a power grid, let’s say data is coming in from a bunch of smart meters from people’s homes. That can be a really noisy environment. If I can analyze the noise, there’s a chance that I can have a lot of other insight about the operation of that grid because of the way the noise is generated.
If, all of a sudden, I find out that every day between 3 p.m. and 4 p.m., I can’t read from a set of meters in a certain neighborhood, then that indicates a couple of possible scenarios. There could be a really high-demand user, and maybe that user is creating a problem for transmission. Or somebody is on the line creating a lot of noise. If I throw away the hiccups (or noise) in the data, I can’t perform a certain type of diagnostic.
Even in the dirtiness of data, there’s value that we can now recover using analytics. The relational database thinking that says this data must be clean from the beginning is causing a loss of fidelity that we could otherwise preserve and use.
PwC: A couple of decades ago, it would have been ridiculous to think we would have the storage and compute power we have available today, and now it’s so inexpensive. Imagine saying we’re going to keep a terabyte of data just because we might use it sometime in the next 10 years—when keeping a terabyte of data online could cost you $5 million.
“As long as you have the data, you can pretty much manage the rest with no problem.”
Tom Foth: I remember when I was working in government IT at the state level, and I bought an 8 megabyte upgrade for the mainframe. It cost $4 million in 1982. The funny thing about that is when we bought the memory, the data center director told me that we already had 8 megabytes in the current mainframe. He said, “Look, I don’t want you to turn on all that memory. Keep 4 megabytes of that memory offline so we have it when we need it.”
The hardware and software is less of a constraint, and a few lines of code nowadays can be powerful, but now the primary constraint is having the data. As long as you have the data, you can pretty much manage the rest with no problem.