The promise of graph databases in public health

June 8, 2016

by



Tags: 

One of the main advantages of a NoSQL graph store is web-scale discovery.

When a multinational biotech firm needed an advisory board member, it wanted a certain top US physician in its field—someone who was already a busy advisor to the firm’s parent company. “Placing him on the board in Latin America was a lower priority for the global group,” says Muriel Siadak, a medical affairs director who conducted the global search for the right person. “Obviously, it was a high priority for the Latin American affiliate.”

Siadak turned to Zephyr Health, which aggregates thousands of life science data sources about people, places, pills, and other things, and provides a software-as-a-service (SaaS) platform with search tools. Zephyr Health converts its sources into document database format, and then layers a graph store (or database) on top to easily traverse and search the underlying data. Siadak launched a search to find just the right fit for the Latin American affiliate. The richness of the diverse detail she could explore with relative ease was a key to quickly finding the perfect needle in the haystack—two, actually.

Some of the world’s most knowledge-intensive organizations, including multinational banks, media companies, space agencies, and logistics companies, are using graph databases.

“I found two fairly frequent users of the company’s products, [pharmaceutical] investigators with a lot of experience who had done their medical training in Latin America, so they obviously spoke Spanish,” she says. “They would be able and would have interest, because they grew up there, to be a part of the Latin American effort. The affiliate was very pleased to have someone [Siadak] on a global level come back and say, ‘Look, it’s not just the top name that you need.’”



If you could continually integrate thousands of external enterprise data sources, add internal ones on a custom basis by request, and tailor the whole so it’s appropriately accessible to a range of business users through a single application platform, what could you do that you haven’t been able to? What Zephyr Health enables in the life sciences illustrates one of the myriad possibilities.

The power of this level of data aggregation is just now becoming apparent. In the biotech industry, specific skills and knowledge are always at a premium. Companies in public health are starting to use graph-facilitated SaaS to solve business problems. Some of the world’s most knowledge-intensive organizations, including multinational banks, media companies, space agencies, and logistics companies, are also using graph databases, and intelligence agencies have been using them for a decade. Others will follow.

The graph store is one of many innovations creating a sea change in database technology. This article provides a deeper look at graph technology and how it is similar to other NoSQL1 data stores, which are explored in an earlier article.

What is a graph store?

At Zephyr Health, the pivotal technology is a Neo4j graph store used to find and traverse relationships between entities in data originally ingested using MongoDB, a document database. Document databases are useful for unstructured data, but Zephyr Health had troubles with indexing and latency at web scale, which is why it added the graph database.

The standard corporate data storehouse, the relational database management system (RDBMS), cannot begin to provide the speedy, flexible search support of a graph. An RDBMS needs absolute consistency among its rows and columns. The difference between a “join” in an RDBMS and in a graph store is like the difference between a precision dovetail joint in woodwork and a freeform Tinkertoy construct. Graphs only need to join or connect at a single point to have useful meaning in searches.

Effectively using graph technology poses its own challenges: the technology is relatively new and only recently capable of web-scale integration tasks. To take advantage of the technology, Zephyr Health needed to resolve some scalability and latency issues associated with web-scale graph environments.

A NoSQL graph store contains malleable maps of entities (named people, places, and things) and how they’re related. Entities become the nodes and relationships become the connections (developers call them edges) in these maps, which can take any shape. If you’re modeling your extended organization, for example, the relationships that appear in a native graph store can add to or become your model.

What’s different in a graph store from a database perspective is the sheer volume of connections, or relationships—how people, places, and things relate to one another through those interactions.

If your data is rich, you’ll see lots of relationships between the entities in native graph form. Older database technologies place less emphasis on relationships, resulting in less context. Graphs offer the chance for richer context through more connections and any-to-any data models rather than the usual tabular or hierarchical models. Graphs can store converted tabular and hierarchical document object information, too, but the graph’s uniqueness and power is in its ability to map any-to-any relationships. Relationship richness of this kind boosts the integration potential and the contextual relevance of the data being represented.

The verb in the sentence diagrammed in the graphic below is where the richness lies. Entities can also have attributes, such as keys to identify them, but verb-style relationships in this sense describe the entities in a richer way than simple identifiers or other labels do. And those relationships can be mined in ways most conventional analytics techniques haven’t explored yet, because they’re not optimized for graphs.



The verbs carry much of the power. The more that relationships can be mapped in the online world, the more high-level understanding can be tapped about those entities, the various clusters of those entities, and their nature. For example, social networks mapped in a graph can capture who’s married to whom, who works for whom, who went to school with whom, and so on.

Each added relationship enriches the profile of the persons, or entities, that are connected. Graphs can describe not only social relationships, but any relationship, whether it’s an investigator’s use of an experimental drug to treat a malignancy, the effect of that treatment on a given patient, or the cost of a single dosage of a given amount of the drug.



In the online environment, anything can be described in terms of entities and relationships, but design considerations factor in heavily because graphs can be computer-memory hungry and don’t like to be partitioned across separate machines. The most efficient way to analyze, traverse, or “walk” a big graph out to the nodes and connections you’re looking for can be to load the whole graph into the main memory (RAM) of a single physical or virtual server.

Because of the memory required, the scale of data amenable to whole graph analysis has historically been smaller than what other database structures are able to handle. However, some vendors have been developing workarounds that help users deal with just a piece of the graph at a time.

The structural uniqueness of graphs

The main difference between document trees and graphs is the degree of structure. Graphs are a further step forward when more structure is needed. While document objects in JSON (JavaScript Object Notation) and XML (Extensible Markup Language) have parents and children, graphs also have other relatives, friends, and acquaintances.

Another way to think about structure is to remember that taxonomies, like document objects, are hierarchical and have just parents and children. If you need a richer classification scheme, you would use an ontology, which is a flexible schema, or data domain description, that articulates specific data contexts.

Document objects are thus taxonomic in their parent-child hierarchies and tree-like. The data model of a graph is a path to richer and more realistic descriptions— they’re ontological in nature, in the sense that meaning can reside in any described relationship between any two entities or nodes. In the semantic world of Resource Description Format (RDF)2 graphs, ontologies are stored alongside instance data in the same “web.”

And here’s why taxonomies and ontologies are important in “schemaless” NoSQL stores: the schema or classification scheme, taxonomy, or ontology can be part of or derived from a data environment of rich data and metadata, and it can evolve as that environment evolves.

Ontologies have historically been hand-built, but the potential is opening up for rich domain description discovered in the data and its relationship metadata through machine-assisted or inferred relationship mapping. The connections and interactions between things are where most of the contextually based meaning in data resides. The connections in a large data aggregation platform such as that of Zephyr Health might be sparse as initially constituted, but new relationships can be inferred over time, and machine learning could help to derive additional context.



As long as you have sufficient computing horsepower, a graph can model a multivariate problem with greater accuracy than a tree can. The relationship metadata provides the descriptive power. Conceivably, autonomous software agents designed to manage the data model could evolve the relationship-driven model, the use case, and the data change.

How graphs are similar to other NoSQL stores

Instead of simply storing data as values with keys or as document objects or tables, graph stores contain nodes and connections. Fundamentally, keys (or identifiers) and values (which could be any groupings of data) are the atomic building blocks for key-value, wide-column, and document stores, and these can also be the building blocks for graphs. Tables, documents, and graphs provide additional structure, in different shapes, with an increasing number of interconnections. And, as stated earlier, graphs can have many more interconnections.

All data structures, from simple keys to hierarchies to graphs, can be represented to machines in the form of tags associated with the data.

With so much potential, why aren’t more people using graphs?

Why are other types of NoSQL stores predominant, while graph stores are sometimes considered a niche? Tables are familiar to most, trees are familiar to some, but graphs are unfamiliar and often puzzling to many.

That’s mainly due to lack of maturity and familiarity, which can be overcome with more exposure to the different, more powerful kind of modeling that graphs present for systems- level optimization challenges that are so core to business. Another part of the problem is that graph stores haven’t been easy to use. The current generation of graph stores is resolving that issue.

Graph store types

Three main types of graph stores have emerged during the past decade, each driven by a different data perishability and reuse category:

  • Property graphs, the most popular, consist of key-value pairs for each element. They don’t require standard semantics and are thus simpler to get started with. It is possible, however, to store semantic triples in property graphs.
  • Native RDF or semantic triple stores support the use of Internationalized Resource Identifiers (IRIs)3 for detailed, standard semantics, which some developers have found difficult to use. The RDF has a classic entity-relationship data model, which consists of subjects (entities), predicates or verbs (relationships), and objects (also entities). Adhering to the RDF standard offers the potential for global reuse, content mining, and a means of using the web for content management via dynamic semantic publishing, among other advantages.
  • Dynamic graphs can add new relationships at scale. These offer the most efficiency for labor-saving reusability and the chance to operationalize and integrate heterogeneous data environments, but they are the most immature.


The less perishable the data, the more long-term investment comes into view for further articulating the data model to achieve better integration and reuse potential. As a general rule, data the furthest to the right on this lifecycle continuum would justify investment in either RDF and/or a dynamic graph capability. Large enterprises should explore all three kinds of graph store technology at this point and consider how use cases differ for each. They should remember that data-driven apps and big data analytics are only in their infancy, and that graph databases will play a much bigger role once those areas evolve.



Outlook: The least mature NoSQL type, but the most promise

Not every enterprise faces the challenge of a large-scale aggregation like the one at Zephyr Health. Depending on the use case, native graph stores can be overkill. If the immediate purpose is to capture or cache the data, then a key-value or column store is more appropriate. If the purpose is aggregation, then a document store may be best, at least for initial data ingestion. If transactional integrity and concurrency are critical requirements, then an RDBMS or a NewSQL store fits best.

Much depends on the business role played, the point in the data lifecycle, and the use case. If the challenge is to model or integrate large, networked systems and to monitor and optimize the interconnections, that’s when graph stores come into play. The current generation of graph stores is most helpful in thoughtful, considered systems-level analytics at the back half of the data lifecycle. Operational analytics and dynamic process integration along the lines of EnterpriseWeb are just emerging.

In-memory technology provides one solution to a nagging latency problem for large-scale graph stores. Other solutions include cache sharding and other distributed graph store design innovations. Hybrid key-value/graph stores (such as Titan on Cassandra, Sqrrl on Accumulo, or OrientDB) or document/ graph stores (such as MarkLogic) promise other advantages but may also introduce complexities.

 


  1. Structured query language, or SQL, is the dominant query language associated with relational databases. NoSQL stands for not only structured query language. In practice, the term NoSQL is used loosely to refer to non-relational databases designed for distributed environments, rather than the associated query languages. PwC uses the term NoSQL, despite its inadequacies, to refer to non-relational distributed stores because it has become the default term of art.
  2. RDF is Resource Description Format, an official W3C Recommendation for Semantic Web data models.
  3. The IRI standard extends Uniform Resource Identifiers or URIs (a superset of URLs) with Universal Character Set support for languages such as Chinese, Japanese, Korean and Arabic.

Contacts

Chris Curran

Principal and Chief Technologist, PwC US Tel: +1 (214) 754 5055 Email

Vicki Huff Eckert

Global New Business & Innovation Leader Tel: +1 (650) 387 4956 Email

Mark McCaffery

US Technology, Media and Telecommunications (TMT) Leader Tel: +1 (408) 817 4199 Email