The term data lake is being tossed around a lot and in lots of conversations, I see it interchanged with a hadoop cluster a lot, while in most cases a data lake is indeed implemented using the hadoop stack, I’d like to describe the conceptual idea behind the term
What is it?
A data lake is a data repository, that stores data in its raw form or otherwise very close to it, as well as structured, normalized data.
Data items are tagged with metadata to make sense of the data available in the lake and support data discovery, taxonomy, maintenance operations etc…
Data is made available for processing using a collection of data processing tools, each suitable for the various structure levels of the data e.g. SQL engines for relational data, scripts and other tool e.g. spark\R\pandas for none structured data
Data lakes usually exist to serve as an organization’s single data repository, enabling a centralized view of all of the data in the organization, and while it sounds a lot like an Enterprise Data Warehouse, there are some key differences:
- EDW projects usually consist of ETL processes that transforms\aggregates\joins data from the organization’s transactional (OLTP) applications into a relational structure, suitable for analytical queries (schema on write)
- The data is stored in a relational database, optimized for analytical queries
- Data is loaded into a raw storage, no attempts are made to put it into a relational or query optimized structure at this stage
- Data is put into structure per usage scenario (schema on read)
Why do we need it
Using modern technologies, we are able to process data in different forms and formats, in addition to relational data generated by OLTP applications we can process:
- Data generated by websites click streams, sensors, application logs etc…
- Data in various formats e.g. text, images, audio
- Data in batch as well as streaming mode
The EDW approach – store and process data in a relational form – is not suited to process the above data types and formats. instead, the data lake’s approach provides the flexibility we need:
we can use the most suitable tool to process each type of data and possibly build a pipeline of several complimenting tools to process the same source data e.g. extracting entities from text\audio\video using relevant tools and then finding relations between those entities using a graph processing engine.
We can apply different analytical technics, tools and approaches to the same source data, without having to consider limitations from it being optimized for a particular use case e.g. transformation of the data into a relational structure, aggregation of values etc… which is very common in EDW solutions
Tools and technologies
The data lake is a concept, we don’t have to use specific technology to implement one, but it wouldn’t be wise to ignore technology stacks that provide a set of complimenting tools that provides us the needed functionality for a data lake implementation.
but what do we need?
- A highly available\scalable raw storage repository
- A set of processing engines that can easily access the stored data
- Ability to tag and attache metadata to data items
Hadoop is the most known platform for implementing data lakes, but existing cloud offerings provide powerful alternatives.
In a continuation blog post, I’ll discuss how we should leverage metadata to make data lake projects a success