July 16, 2014
As data piles up, the ability for organizations to store, integrate and access it goes down. That’s because traditional data warehousing methods are ill equipped to deal with today’s data deluge. Conforming every tidbit of data to a single data model to make sense and use of it is simply unrealistic in the age of big data. We need a different way.
The data lake concept, which we describe in detail in “Data lakes and the promise of unsiloed data”, is a fundamentally different kind of data management that is showing promise. To put it directly, you “dump” data for analytics into repositories—often Hadoop based—without restructuring the data up front. The assumption is that users will discover, apply or reuse the structure they need when they’re doing their explorations. By taking this unkempt approach, companies aren’t trying to make the data accessible to a mass audience and they don’t always know what they’ll find. And, that’s OK. The lake will evolve and gain broader utility later.
Now that data storage and technology is cheap, information is vast and newer database technologies don’t require an agreed upon schema up front, discovery analytics is finally possible. With data lakes, companies employ data scientists who are capable of making sense of untamed data as they trek through it. They can find correlations and insights within the data as they get to know it.
As data scientists, other statisticians and business experts traverse the terrain, they leave guideposts for those who follow. Think of them as modern-day Lewis and Clarks scouting what’s up ahead to gather information for the rest of the colony to build the town and lay down track for the railroad.
Challenges and Opportunities of Data Lakes
A data lake is such a wildly different way of tackling the data dilemma that enterprises are grappling with how to adapt their processes and culture. Following are a few challenges that businesses need to overcome:
- Political Turf Wars: Even though Hadoop makes data accessibility possible, business owners can be resistant to sharing data for political reasons. Data is power.
- Complexity of Legacy Data: Many legacy systems contain a hodgepodge of software patches, workarounds and poor design. As a result, the raw data may provide limited value outside its legacy context. The data lake performs optimally when supplied with unadulterated data from source systems but sometimes the source has to make a little effort to make data understandable.
- Metadata Lifecycle Management: Data lakes require advanced metadata management methods, including machine assisted scans, characterizations of the data files and lineage tracking for each transformation. Should schema on read be the rule and predefined schema the exception? It depends on the sources. The former is ideal for working with rapidly changing data structures, while the latter is appropriate for sub-second query response on highly structured data.
- Desolate Data Islands: Business units often discover that data lakes are cheap and fast and they build them on their own and then abandon them. By circumventing the centralized IT function, business units can create a chain of desolate data islands instead of a “land of lakes” that can flow into each other.
- The Issue of Integration: The integration required to turn that data into actionable insights is a substantial challenge. While integrating the data takes place at the Hadoop layer, contextualizing the metadata (providing views of selected data, in other words) takes place at schema creation time. A secure “integration fabric” is necessary to link the lakes and centralize data from multiple sources to provide a comprehensive, rich repository of enterprise-wide information for analysis and insight.
Some say the data lake is a dream, but we know of organizations that are making this approach a reality. Let us know in the comments if you’re one of the forward-thinking enterprises that are experimenting with data lakes.