The term data lake is being tossed around a lot and in lots of conversations, I see it interchanged with a hadoop cluster a lot, while in most cases a data lake is indeed implemented using the hadoop stack, I'd like to describe the conceptual idea behind the term What is it? A data lake is a data... Continue Reading →
Deep dive: memory management in Apache spark
Memory allocation in spark has three key contention points, this post is a break down of the three, and a description of the progress that was made in each one The contention points are: Contention between memory allocated for execution and for storage (cache) Contention between tasks running in the same process Contention between operators executing in the same... Continue Reading →
Deploying applications on yarn using Apache Twill – introduction
With the introduction of yarn, hadoop had transformed from a pure map reduce computation engine (and dfs), into a general cluster that supports different types of workloads, that coordinates their resource consumption. They did that by extracting resource management functionality into a separate component, yarn. Since then, several data processing engines made their way into... Continue Reading →
Spark application logging
When coding a spark application, we often want to write some application logs to trace or track our application's progress. we would want to benefit from spark's log4j's configuration i.e log collection etc... so naturally, we would declare a logger instance at the class level and use it in our closure. Unfortunately we can't do... Continue Reading →
Editing log4j configuration using cloudera manager
If you are using cloudera's distribution, you should be aware that hadoop's configuration files locations are not static, but are generated every time they are changed and placed in a new location under /var/run/cloudera-scm-agent/process/. because of that reason, our only option for editing log4j's configuration is using cloudera manager's (CM) safety valve. From cloudera's documentation:... Continue Reading →
Setting up a central logging infrastructure for hadoop and spark
logs are critical for troubleshooting, but when an application is distributed across multiple machines, things gets complicated. things gets even more complicated when your application uses 3rd party APIs, and the answer you are looking for is hiding in one of those other systems logs (which are distributed as well). you end up going through lots... Continue Reading →
Getting more business value with Task based UI
We control the user’s interaction style with our system through the provided UI, but what style of interaction will provide our business with more value? Is it CRUD or task oriented? IT systems are all about providing business value, acting on the organization’s data, They enable employees to perform tasks quicker, They enable managers and analysts to understand... Continue Reading →