Introduction to the data lake concept

The term data lake is being tossed around a lot and in lots of conversations, I see it  interchanged with a hadoop cluster a lot, while in most cases a data lake is indeed implemented using the hadoop stack, I’d like to describe the conceptual idea behind the term


What is it?

A data lake is a data repository, that stores data in its raw form or otherwise very close to it, as well as structured, normalized data.

Data items are tagged with metadata to make sense of the data available in the lake and support data discovery, taxonomy, maintenance operations etc…

Data is made available for processing using a collection of data processing tools, each suitable for the various structure levels of the data e.g. SQL engines for relational data, scripts and other tool e.g. spark\R\pandas for none structured data

Data lakes usually exist to serve as an organization’s single data repository, enabling a centralized view of all of the data in the organization, and while it sounds a lot like an Enterprise Data Warehouse, there are some key differences:


  • EDW projects usually consist of ETL processes that transforms\aggregates\joins data from the organization’s transactional (OLTP) applications into a relational structure, suitable for analytical queries (schema on write)
  • The data is stored in a relational database, optimized for analytical queries

Data lake

  • Data is loaded into a raw storage, no attempts are made to put it into a relational or query optimized structure at this stage
  • Data is put into structure per usage scenario (schema on read)

Why do we need it

Using modern technologies, we are able to process data in different forms and formats, in addition to relational data generated by OLTP applications we can process:

  • Data generated by websites click streams, sensors, application logs etc…
  • Data in various formats e.g. text, images, audio
  • Data in batch as well as streaming mode

The EDW approach – store and process data in a relational form – is not suited to process the above data types and formats. instead, the data lake’s approach provides the flexibility we need:

we can use the most suitable tool to process each type of data and possibly build a pipeline of several complimenting tools to process the same source data e.g. extracting entities from text\audio\video using relevant tools and then finding relations between those entities using a graph processing engine.

We can apply different analytical technics, tools and approaches to the same source data, without having to consider limitations from it being optimized for a particular use case e.g. transformation of the data into a relational structure, aggregation of values etc… which is very common in EDW solutions

Tools and technologies

The data lake is a concept, we don’t have to use specific technology to implement one, but it wouldn’t be wise to ignore technology stacks that provide a set of complimenting tools that provides us the needed functionality for a data lake implementation.

but what do we need?

  • A highly available\scalable raw storage repository
  • A set of processing engines that can easily access the stored data
  • Ability to tag and attache metadata to data items

Hadoop is the most known platform for implementing data lakes, but existing cloud offerings provide powerful alternatives.

In a continuation blog post, I’ll discuss how we should leverage metadata to make data lake projects a success


One thought on “Introduction to the data lake concept

Add yours

  1. Most organizations understand the need for agile methodologies in the context of software development. Fewer have applied agile in the context of data management. Typically, the IT organization takes the lead on vetting potential technology options and approaches to building data lakes, with little input from the business units. Under an agile approach, IT and business leaders jointly outline and address relevant technology and design questions. For instance, will the data lake be built using a turnkey solution, or will it be hosted in the cloud (using private, public, or hybrid off-site servers)? How will the data lake be populated that is, which data sets will flow into the lake and when? Ideally, the population of the data lake should be based on the highest-priority business uses and done in waves, as opposed to a massive one-time effort to connect all relevant data streams within the data lake.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

Website Powered by

Up ↑