5 leading practices for data lakes

June 1, 2016

by and


The right approach to data lakes is essential to ensure you get the most useful insights from your massive data stores.

Mars’ surface is dotted with depressions that may have once held lakes, captivating scientists who can study the areas to find clues to the planet’s history. A new form of lake is emerging on earth, and instead of water, it holds pools of data that provide data scientists with rich beds of fodder that can sharpen and speed executive decisions.

Yet as more companies start to use data lakes, the right approach is essential or transformative insights sink like stones beneath the murk of a data swamp. These five best practices can help you capitalize on the opportunity:

Take a business-centric approach

Many businesses dive in without a strategic perspective, which leads to mistakes and mismatched priorities. Start with a business-centric—and importantly, value-based iterative—approach to drive the lake architecture, design, and implementation. Ask these questions to help to ensure your data lake has a business-centric operating model:

  • How will the data lake create and capture value?
  • What skills do we need to manage the lake and its technologies, and to derive information, insight, and value?
  • What incentive model will foster innovation and new insights?
  • How will we prioritize new demands?
  • What services will the lake provide?
  • Do we need data discovery and ideation services via self-service?
  • How will we divide responsibility between business and IT?
  • Will we allow external access from strategic partners and vendors?
  • How will the data lake co-exist with other ecosystems (e.g., data warehouse, analytical applications)?
  • What is our operations and support strategy?
  • Will the platform extend to multiple geographies?

The new “capability architect” is well-suited to take the lead in defining a holistic data lake operating model and services catalogue.

Talk to stakeholders early

Before building a data lake, talk to relevant stakeholders to understand priorities, and bake the findings into the strategy.

A Media & Entertainment client initially wanted to use a data lake to store massive volumes of digital data, perform complex computations, and deliver information to downstream applications. IT set to work without first connecting with business stakeholders to confirm their needs, which included self-service data discovery and other analytics capabilities. This feedback delayed implementation by six months.

Identify the right technologies

Identify the desired native technology components—primarily open source and other third-party commercial technologies with the right deployment model (cloud vs. on premise). Many capabilities like metadata management, semantic analytics, and visualization tools are still evolving in the open source space. Too often, businesses try to design a modern data lake based on traditional approaches; e.g., building cumbersome data models and selecting a data integration engine that doesn’t leverage native capabilities of Hadoop.

Consider ‘adaptive data preparation’ tools, which ingest a variety of data formats (e.g., clickstream data from websites and mobile devices, legacy transaction data, and third-party demographics) and leverage scalable data processing and in-memory engines like Spark with machine language techniques, to automatically tag and catalog data and identify data relationships. Consider analytical tools that bring logic to the data stored in the lake and not vice versa.

One of our financial services clients increased operational efficiency and performance ten-fold by storing large quantities of financial transactions (100+M rows) in-memory to enable on-demand analytics. The age of denormalized datasets and on-demand analytics is here. Explore tapping into the ecosystem of such tools and techniques.

Build the right organizational structure

Empower your capability architect to build a team that can smooth implementation, balance priorities and map to the vision and core capabilities. This role should also be able to communicate with an unconventional mix of stakeholders ranging from the Chief Marketing Officer to the Chief Financial Officer to the Chief Data Officer.

The capability architect should be knowledgeable and open-minded about emerging technologies to drive decisions that can make or break data lake implementation. As the lake will eventually connect ecosystems across an enterprise—such as data, analytics, digital, cloud and emerging technologies—the right organizational structure is critical.

Address risk and governance

Who will access the data lake? How will access be provided; e.g., SQL, visualization tools, API’s? Address the compliance and regulatory requirements around security and access, and identify the tools and processes to catalog, link data sets, enable data provenance, and provide access to the right people. Many technologies are available to keep your lake from becoming a swamp.

Data lakes present a tremendous opportunity to drive insights and make better decisions, but if they are not implemented properly, their potential can be missed. We consistently see these five practices within companies that are ahead of the data and analytics curve. If you want to capitalize on the data lake, look beyond technology alone and adopt a holistic business-driven approach. Are you among the leaders championing the data lake in your enterprise? What practices are paying off?




Chris Curran

Principal and Chief Technologist, PwC US Tel: +1 (214) 754 5055 Email

Vicki Huff Eckert

Global New Business & Innovation Leader Tel: +1 (650) 387 4956 Email

Mark McCaffery

US Technology, Media and Telecommunications (TMT) Leader Tel: +1 (408) 817 4199 Email