Data Lake

Data lakes and data warehouses are both widely used for storing big data, but they are not interchangeable terms. A data lake is a vast pool of raw data, the purpose for which is not yet defined.

An agile approach to data-lake development can help companies launch analytics programs quickly and establish a data-friendly culture for the long term.

Increases in computer-processing power, cloud-storage capacity and usage, and network connectivity are turning the current flood of data in most companies into a tidal wave—an endless flow of detailed information about customers’ personal profiles, sales data, product specifications, process steps, and so on. The data arrive in all formats and from a range of sources, including Internet-of-Things devices, social-media sites, sales systems, and internal-collaboration systems.

 

Despite an increase in the number of tools and technologies designed to ease the collection, storage, and assessment of critical business information, many companies are still unsure how best to handle these data. Business and IT leaders have told us they remain overwhelmed by the sheer volume and variety of data at their disposal, the speed at which information is traversing internal and external networks, and the cost of managing all this business intelligence. Increasingly, they are also being charged with an even more complicated task: harnessing meaningful insights from all this business information.

 

These executives must expand their data-management infrastructures massively and quickly. An emerging class of data-management technologies holds significant promise in this regard: data lakes. These storage platforms are designed to hold, process, and analyze structured and unstructured data.1 They are typically used in conjunction with traditional enterprise data warehouses (EDWs), but in general, they cost less to operate than EDWs. Cost savings result because companies can use affordable, easy-to-obtain hardware and because data sets do not need to be indexed and prepped for storage at the time of induction. Data are held in their native formats and reconfigured only when needed, as needed. Relational databases may also need to be managed as part of the data-lake platform, but only to ease end users’ ability to access some data sources.

 

There is a lot for companies to like about data lakes. Because data are loaded in “raw” formats rather than preconfigured as they enter company systems, they can be used in ways that go beyond just basic capture. For instance, data scientists who may not know exactly what they are looking for can find and access data quickly, regardless of format. Indeed, a well-maintained and governed “raw data zone” can be a gold mine for data scientists seeking to establish a robust advanced-analytics program. And as companies extend their use of data lakes beyond just small pilot projects, they may be able to establish “self-service” options for business users in which they could generate their own data analyses and reports.

 

We help companies to apply an agile approach to their design and rollout of data lakes—piloting a range of technologies and management approaches and testing and refining them before getting to optimal processes for data storage and access.