April 18, 2016
How data lakes can help reduce costs, increase efficiency, and boost innovation in the enterprise.
Business and technology leaders are struggling to keep pace with a massive glut of data from digitization, the internet of things, machine learning, and cybersecurity for starters. A data lake—which combines data assets, technology and analytics to create enterprise value at a massive scale—can help businesses gain control over their data.
A few years ago, our colleague Oliver Halter wrote an article about how data lakes reflect the future promise of big data and analytics. Fast forward to 2016, and these notions are no longer a pipe dream. The data lakes ecosystem is now a multi-billion dollar economy.
Although the term ‘data lake’ is not new, mainstream enterprise adoption has been sluggish. First, the core technology in areas like security and performance has evolved rapidly, making it hard for businesses to keep pace. Second, the Hadoop platform—often used to implement the data lake—lacks a human-centered design and an intuitive abstraction layer, which makes it difficult to adopt and sustain.
Fortunately, Hadoop is known for integration and, in combination with an ecosystem of tools and technologies, supports a wide variety of uses. The good news is the maturity curve of the Hadoop platform is starting to shift from an esoteric technology centric platform to a more user friendly, business centric platform. The emergence of cloud platforms is helping in this regard.
The role of the data lake in today’s enterprise
Data lakes are high-performance computing repositories that allow low-cost storage of large quantities and varieties of raw data, both structured and unstructured. You can persist data on disk or stream it in-memory. Unlike the rigid, predefined relational data model approach, data schema(s) inherent in the data lake continuously evolve based on unique business needs. Built-in flexibility enables many uses within a dynamic environment.
Our clients use data lakes for many reasons, e.g., to deal with high data volumes and a wide variety of data structures, and to deploy scalable data management and analytical processes in an agile and cost effective manner. Data lakes can allow dramatic increases in speed via the lake’s data management and analytical processes and scalable computing infrastructure.
One of our national retailer clients uses a data lake to monitor inventory across more than 2,000 retail outlets in real-time—allowing them to rapidly replenish fast moving products to drive sales. The retailer’s legacy enterprise data warehouse was inefficient, unable to handle large data volumes and complex data like weblogs, mobile interactions, and customer surveys. The data lake serves as a staging area or ‘cold storage’ for the data warehouse—it reduces costs to store data that is rarely used or that’s not proven. This storage can be helpful for annual regulatory reporting, or mining a terabyte of weblogs to look for the needle in the haystack.
Data lakes can also boost innovation. Data scientists and analysts can explore ad hoc data and generate ideas within a rapid test and learn environment to boost innovation. One large software client leverages the NoSQL data store component in the data lake to recommend product and services bundles for its online customers. The recommendations are based on past transactional histories and interactions via analytics driven, API based web and mobile applications. The scalable and flexible nature of the NoSQL data store can make it easier to iteratively develop web/mobile applications in an agile manner.
The data lakes ecosystem
In considering whether you need a data lake, it’s helpful to remember that it’s a rapidly evolving ecosystem that is complementary to the data warehouse. Many businesses optimize their technology investments by offloading workloads related to data discovery, ad hoc analysis, and API based applications to data lakes, which integrates with existing enterprise data warehouses.
In considering the technology landscape, most enterprises need a holistic framework to select the ‘right fit’ platform. Many businesses work with a handful of strategic technology vendors to avoid inertia during the selection phase.
While Hadoop is open source, major technology vendors provide pre-engineered, enterprise grade solutions that can be deployed on premise. Leading cloud platforms are ideal for data lake deployments since they reduce complexity, free up resources and promote agility. Many start-ups are disrupting this space by packaging all layers of Hadoop with data preparation, visualization, and analytics components.
Major technologies provide connectors and native integration into the Hadoop platform. Look under the hood and ensure that the components you require in Hadoop have the right integration. For instance, many technologies are still being integrated into Hadoop with the in-memory Spark engine. Importantly, data lakes require a different mindset geared toward ‘test, learn, prove value or fail fast’—a departure from the traditional enterprise data warehouse.
We’ve seen tremendous advancements since we began writing about data lakes in the PwC technology forecast series. Looking ahead, external market forces will propel enterprises of all sizes to embrace and adopt the data lake as a centerpiece of their enterprise data and analytics strategy.
If you are among the leaders currently championing the data lake in your enterprise, we welcome your perspective.