Introduction to the data lake concept

The term data lake is being tossed around a lot and in lots of conversations, I see it  interchanged with a hadoop cluster a lot, while in most cases a data lake is indeed implemented using the hadoop stack, I'd like to describe the conceptual idea behind the term What is it? A data lake is a data... Continue Reading →

Deep dive: memory management in Apache spark

Memory allocation in spark has three key contention points, this post is a break down of the three, and a description of the progress that was made in each one The contention points are: Contention between memory allocated for execution and for storage (cache) Contention between tasks running in the same process Contention between operators executing in the same... Continue Reading →

Spark application logging

When coding a spark application, we often want to write some application logs to trace or track our application's progress. we would want to benefit from spark's log4j's configuration i.e log collection etc... so naturally, we would declare a logger instance at the class level and use it in our closure. Unfortunately we can't do... Continue Reading →

Powered by WordPress.com.

Up ↑