Spark application logging

When coding a spark application, we often want to write some application logs to trace or track our application’s progress.
we would want to benefit from spark’s log4j’s configuration i.e log collection etc…
so naturally, we would declare a logger instance at the class level and use it in our closure.

class LoggingInSpark extends App{
  val logger = Logger.getLogger(getClass)

  val conf = new SparkConf().setAppName("spark-logging")

  val sc = new SparkContext(conf)

  sc
    .textFile("path/to.file")
    .map(x=>x.toString)
    .filter(row=>{
    if(row.contains("something")){
      logger.trace("trace something")
      false
    }
    true
  })
}

Unfortunately we can’t do that, when spark distributes our job across the cluster,it serializes our closures and all objects that are referenced by them. the Logger object is not serializable, and the job fails with an appropriate message.

To work around this issue, we have to create a Logger instance in every closure that is distributed across the cluster, or implement a fancy singleton behavior that locks the logger’s reference when initializing, opts out of the serialization of the logger’s reference etc…

I had the pleasure of attending strata conf a couple of months ago, and during a Q&A session with the Databricks guys, I’ve asked them about this issue, they said there is a solution in the works, but it missed the 1.2.0 release.

 

While browsing around spark’s code base, I came across the org.apache.spark.Logging trait which is spark’s internal solution to this exact issue.
it locks the logger’s reference when initializing, opt it out of the closure’s serialization and even provides a set of convenience methods such as logTrace, logDebug etc…

class LoggingInSpark
   extends App with Logging /* mixing in the Logging trait */{
  val conf = new SparkConf().setAppName("spark-logging")

  val sc = new SparkContext(conf)

  sc
    .textFile("path/to.file")
    .map(x=>x.toString)
    .filter(row=>{
    if(row.contains("something")){
      logTrace("trace something")

      false
    }
    true
  })
}

FYI – it is marked with the DeveloperApi annotation, which means the API might change in future versions.

Advertisements

One thought on “Spark application logging

Add yours

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Powered by WordPress.com.

Up ↑