Setting up a central logging infrastructure for hadoop and spark

logs are critical for troubleshooting, but when an application is distributed across multiple machines, things gets complicated.
things gets even more complicated when your application uses 3rd party APIs, and the answer you are looking for is hiding in one of those other systems logs (which are distributed as well).
you end up going through lots of different logs, trying to track down the root cause and correlate events between the logs.

A common approach is setting up a centralized logging solution.

In this post I will share our experience with setting up this kind of solution for a hadoop cluster running spark jobs on top of yarn.

Tools selection

In our setup, we chose to go with the ELK stack which provides a complete – log collection,storage and visualization solution
This stack is comprised of:

Elasticsearch – a search engine type of data store, ideal for full text random search, we used it to store the logs collected and parsed by logstash.
Logstash – a widely used logs collection and parsing tool
Kibana – a visualization tool on top of elasticsearch, we used it to query and visualize our applications behaviour.

Problem description

Collecting log files from known, static locations

A pretty easy and useful solution is to configure logstash to listen for syslog messages, and configure syslog on our cluster’s machines to send the various logs to the logstash listener.

collecting logs from a dynamic location

collecting logs is pretty straightforward, when you know the files location before hand, but you need a different solution for
when the log files location is unknown at the time you set up log collection.

This is exactly the case when executing spark job on top of yarn.

Each spark job execution is a new yarn application, and the logs location for a yarn application is dynamic and determined by the yarn application and container’s ids.

hadoop components use log4j for logging, so this is what we did to solve this issue:

1. set up logstash to listen for log events over a TCP socket from log4j SocketAppender
the following snippet contains only the code for the input section


input {
  log4j {
  host => "10.232.83.135"
  port => 4560
 }
}
output {
  elasticsearch {
  host => localhost
 }
}

2. added a SocketAppender appender to yarn’s log4j.properties file

main.logger=RFA,SA

log4j.appender.SA=org.apache.log4j.net.SocketAppender
log4j.appender.SA.Port=4560
log4j.appender.SA.RemoteHost=h135
log4j.appender.SA.ReconnectionDelay=10000
log4j.appender.SA.Application=NM-${user.dir}

we use cloudera’s distribution, and used cloudera manager’s UI to add the edit yarn’s log4j.properties file, read this blog post for more information.

3. added a SocketAppender appender to spark’s log4j.properties file and used the –files switch when submitting spark jobs

Identifying application\containers logging records in the central logs storage

One problem that came from collecting all logs to on place, is that the information about the
source of log messages is lost, i.e we don’t know which yarn application\container a log message belongs to.

Some facts:

  • log4j supports referring to environment system variable in log4j’s configuration
  • spark sets the output log directory’s dynamic location in a system property named ‘spark.yarn.app.container.log.dir’
  • log4j’s SocketAppender supports a field named ‘Application’ that is used to identify the client application in the logs

so what we did is we’ve set the Application field’s value to be the ‘spark.yarn.app.container.log.dir’ system property’s value, and in that way we can recognize each log message’s origin yarn application\container

Advertisements

One thought on “Setting up a central logging infrastructure for hadoop and spark

Add yours

  1. Hi. Thanks for your post, it helped a lot. I thought that I should share some things in the same area.

    We have started to use it for the hadoop components in a hdp-cluster and we had to somehow identify the sub-components/instances from each other. E.g in the Hive component, there are several types of instances that shares the same log4j.properties, like metastore, cli, hiveserver2.

    By editing the hive-env.sh and adding a “-D” property to the HADOOP_CLASSPATH, we could distinguish the different components:

    hive-env.sh:
    export HADOOP_OPTS=”$HADOOP_OPTS -Dhive_service_name=${SERVICE}”

    hive-log4j:
    log4j.appender.SA.Application=hive-${hive_service_name}

    //Selle

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Powered by WordPress.com.

Up ↑

%d bloggers like this: