Skip to content

Instantly share code, notes, and snippets.

@vinodkc
Last active November 2, 2023 18:52
Show Gist options
  • Save vinodkc/637a0efa40ddbf1e1a38fc35d8702f96 to your computer and use it in GitHub Desktop.
Save vinodkc/637a0efa40ddbf1e1a38fc35d8702f96 to your computer and use it in GitHub Desktop.

Spark Event Log Job Trimmer

There are many instances, where Spark event log size grow very high, especially in the case of streaming jobs and it is difficult to transfer such a big file to another small cluster for offline analysis. Following shell script will help you to reduce the spark event log size by excluding old jobs from the event log file, so that you still can analyze issues with recent jobs.

After running this shell script on a Linux/Mac terminal, a trimmed output will be saved in the input folder with an extension _trimmed and you have to use that file for further analysis.

Usage instructions:

  1. Copy & paste below code snippet into a file trimsparkeventlog.sh
#!/bin/bash

if [ "$#" -ne 2 ]; then
    echo -e  "Usage: ./trimsparkeventlog.sh  <file path of spark event log file> <percentage of required Jobs >"
    echo -e "eg: ./trimsparkeventlog.sh ~/Downloads/application_1605334641754_0001  40"
    exit 0
fi

fileName=$1
percentageofRequiredJob=$2


: 'Find total number of jobs in this event log'

totalJobCount=`awk 'BEGIN{jobCount=0} /^{"Event":"SparkListenerJobStart"/ {++jobCount} END {print jobCount}' $fileName`
let "requiredJobCount = (( totalJobCount *  percentageofRequiredJob))/100"
requiredJobCount=$(( requiredJobCount <= 0 ? 1 : ( requiredJobCount > totalJobCount ? totalJobCount : requiredJobCount)))

trimmedFileName="${fileName}_trimmed"

: 'Exclude old unwanted jobs from the event log'

awk  -v totalJobCount="$totalJobCount" -v requiredJobCount="$requiredJobCount" 'BEGIN{currentJobCount=0; ignoredJobCount=totalJobCount - requiredJobCount;stopPrint=0 }
{
   if($0 ~ /SparkListenerJobStart/) {
     ++currentJobCount;
     if(currentJobCount > ignoredJobCount) {
       print
       stopPrint=0;

     } else {
       stopPrint=1;
     }
  } else {
       if(stopPrint == 0) {
         print
       }
  }
}' $fileName > $trimmedFileName
echo "Output saved at $trimmedFileName"
availableJobs=`grep SparkListenerJobStart $trimmedFileName | wc -l`
echo "New event log $trimmedFileName has $availableJobs recent jobs"
  1. chmod u+x trimsparkeventlog.sh

  2. Run the script trimsparkeventlog.sh

    Usage : ./trimsparkeventlog.sh <file path of spark event log file> <percentage of required Jobs >

    Eg : ./trimsparkeventlog.sh ~/Downloads/application_1605334641754_0001 40

Note: 📙

  • If percentageofRequiredJobs=100, no jobs will be excluded (output file content will be the same as input file content )
  • If percentageofRequiredJobs=40, those recent 40% of total jobs will be preserved, and rest 60% of jobs will be discarded.
  • If percentageofRequiredJobs <= 0, the recent 1 job will be preserved and rest of the jobs will be ignored.

Please don't rename the output file with any file extensions, SHS doesn't like extensions except .inprogress 😁

If you find a bug in this script, please contact me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment