Introduction to the data lake concept

The term data lake is being tossed around a lot and in lots of conversations, I see it  interchanged with a hadoop cluster a lot, while in most cases a data lake is indeed implemented using the hadoop stack, I'd like to describe the conceptual idea behind the term What is it? A data lake is a data... Continue Reading →

Advertisements

Deep dive: memory management in Apache spark

Memory allocation in spark has three key contention points, this post is a break down of the three, and a description of the progress that was made in each one The contention points are: Contention between memory allocated for execution and for storage (cache) Contention between tasks running in the same process Contention between operators executing in the same... Continue Reading →

Spark application logging

When coding a spark application, we often want to write some application logs to trace or track our application's progress. we would want to benefit from spark's log4j's configuration i.e log collection etc... so naturally, we would declare a logger instance at the class level and use it in our closure. Unfortunately we can't do... Continue Reading →

Editing log4j configuration using cloudera manager

If you are using cloudera's distribution, you should be aware that hadoop's configuration files locations are not static, but are generated every time they are changed and placed in a new location under /var/run/cloudera-scm-agent/process/. because of that reason, our only option for editing log4j's configuration is using cloudera manager's (CM) safety valve. From cloudera's documentation:... Continue Reading →

Powered by WordPress.com.

Up ↑