Solving a familiar e-commerce search problem with a NoSQL document store

August 31, 2016



Mark Unak is CTO of Codifyd.

Mark Unak is CTO of Codifyd.

Sanjay Agarwal is CEO of Codifyd.

Sanjay Agarwal is CEO of Codifyd.

Mark Unak and Sanjay Agarwal explain how document stores can help deliver precise e-commerce catalog search results.


PwC: What does Codifyd do?

Mark Unak: Codifyd has built an understanding of product content and e-commerce over the last 15 years. It helps companies better articulate their product data. By articulate, I mean create the taxonomy and the attribution for those particular products, and then align the actual values of those attributes to make sure they’re consistent across the product catalog. We’re experts in how the data and the definition of the data are represented in e-commerce websites.

An example would be e-commerce sites that have industrial supply data—products such as screwdrivers, bearings, adhesives, and abrasives. We’ve been very successful with moving data and transforming it from a supplier’s definition into a distributor’s definition.

PwC: What kinds of data problems related to that transformation can you help solve?

Sanjay Agarwal: Search is one main example. As retailers and marketplaces pursue the “endless aisle,” e-commerce sites are growing from hundreds of thousands of products to millions of products. Each product description has about 15 to 20 attributes, including images, data sheets, and other data attached to each product. Consumers are looking across millions and millions of points of information. With conversion rates at 2.7 percent on average, a retailer must deliver the products that customers search for, or they simply abandon the website. Product information and search are core to any e-commerce website’s success. When consumers go to a website, they want to find the product intuitively and quickly.

For consumers to do that, a retailer must organize this product information logically and intuitively across this very large data set. That’s the front end of a complex problem. The back end is that the products retailers sell come from 3,000 to 5,000 different suppliers. Each supplier has the product information in its own format based on how the manufacturer of these products chooses to present that information.

A retailer (or e-tailer) must ingest these thousands of data streams and convert them into the retailer’s merchandising e-commerce format. We assist retailers with a product strategy and e-commerce strategy for helping consumers search the retail product catalogs better, for example. We’ll help them get the data so they can maintain it, and we’ll also get involved with some of the technology decisions.

PwC: What’s the typical situation you find regarding online product catalogs at customer sites?

“Tens of thousands of attributes describe those products, but with the typical search engine a consumer can only really search on fewer than 50 of those attributes.” —Mark Unak

Mark Unak: Most search functionality is text based and uses either Solr, which is a part of the Apache Lucene suite, or a comparable proprietary product. When you have one of these, you try to stuff as many keywords into the 40-character description field of a product as you can in the hope that the description keys off of what a person is actually trying to find. Where retailers have entered certain pieces of information into the description field, the text search engine finds those terms that were typed in the text box search.

But the problem is that those terms are not contextually based—they’re not found in a context of which product the terms are associated with. One retailer we work with has more than 2 million products on its website. Tens of thousands of attributes describe those products, but with the typical search engine a consumer can only really search on fewer than 50 of those attributes. And the terms must be global across all the products. Those numbers alone explain that a customer can’t really find a specific item because those attributes are not available for search.

PwC: How does that translate to what the consumer searching for a product sees?

Mark Unak: On a typical retailer’s site, 80 percent of the people will go right to the text box and try to search. Let’s say you search for “17-inch laptop computer” in the text search. The search results show chill mats, backpacks, a power adapter, and a cooling system. Not one laptop computer is on the front page. The retailer just lost that customer and purchase.

Now let’s go through a similar example, but one that uses a document database—such as MarkLogic, MongoDB, or CouchDB—and its ability to filter on many different attributes and do complex querying. We’ll use screwdrivers instead of laptops. We have about 800,000 products in this industrial products database example that’s based on the work we’ve done.

Our search technology contextually parses each phrase the customer puts in the search box. For each phrase, it delivers a precise list of products that are relevant to the customer’s search. First, we can resolve screwdrivers to a particular department, meaning a particular type of product. And now you see all of the different attributes associated with a screwdriver. More than 20 attributes are unique to screwdrivers.

Each one has a tag and at least one value, or many values. When considering blade length, for example, you can have several different values. You can filter on blade length, then on the type of driver tip—if you’re looking for a slotted tip, that attribute reduces the number down to 384. Then you can filter on the size of the tips. One-eighth inch is used in a couple of different places, so you filter on one-eighth inch tip size and use that attribute when you run a text search. Finally, you pick an insulated screwdriver, and you find three different items.

You’re down to three. You can now type in “insulated, one-eighth inch tip size, slotted screwdriver” and the search will return three items.

It doesn’t matter what order you put those search terms in. The system will resolve it in context. For instance, “slotted” or “insulated” are valid only in the context of “screwdriver.” The use of “one-eighth inch” in this example is valid only in the context of tip size, which is in the context of screwdriver. So we’ve parsed the text search as if it were a filter type of search, and we’ve received the exact same results.

PwC: Would you call that faceted search?

Mark Unak: That’s what we call it now. That’s a great term. The power of XML or JSON document objects is that the terms used in our example are unique now just to screwdriver. We can have a recommendation based upon any of the attributes in that particular product type. In this situation, we have what we call an “is around me” type of recommendation, which is similar to what can happen when a customer goes into a store. If a customer looks at a particular screwdriver, a retailer can also show the customer screwdrivers that are slightly bigger or slightly smaller or part of a set.

One recommendation could be that this screwdriver is part of a set that has an alternative screwdriver, such as one that’s slightly bigger or smaller. That’s the attribute that we’re using in our Around Me algorithm.

But any of the attributes could be used for the algorithm, not just tip size or blade length. That gives the merchandiser the ability to offer other items that are similar based upon those attributes and not make the customer perform another search.

PwC: How does this new capability relate to database evolution generally?

Mark Unak: In the next generation of data, unstructured and structured data will merge. You will need to work with less structured or differently structured data in ways that previously were reserved only for structured, transactional data types.

Just now we’ve described how product data doesn’t fit nicely into a traditional data structure. Product data is unique. The attributes are unique to specific products. They’re potentially unlimited in number. They have their own sets of values, and there’s very little commonality among those attributes and values across products.

That’s why old-style enterprise search fails. There isn’t a set of global attributes that are applicable across all these different products. If you organize your data with XML or JSON and put it through a NoSQL engine, you can really make use of the product data as it’s meant to be used. You’ll be able to know why a person purchased something, because you’ll see which actual attributes were important to them when they searched.

When the purchase is made, you’ll also be able to distinguish between those attributes to give a customer additional features or make recommendations for additional products that are necessary. In the future, I think you’ll have the ability to use this unstructured format for transactionality.



Chris Curran

Principal and Chief Technologist, PwC US Tel: +1 (214) 754 5055 Email

Vicki Huff Eckert

Global New Business & Innovation Leader Tel: +1 (650) 387 4956 Email

Mark McCaffery

US Technology, Media and Telecommunications (TMT) Leader Tel: +1 (408) 817 4199 Email