Skip to content

Instantly share code, notes, and snippets.

@pycoder2000
Last active July 9, 2022 22:17
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save pycoder2000/4a0417ab196d134df3dd555db2f79e8b to your computer and use it in GitHub Desktop.
Save pycoder2000/4a0417ab196d134df3dd555db2f79e8b to your computer and use it in GitHub Desktop.

Steps to create an ETL Pipeline with Cronjob

  1. Create fat Jar file

    mvn clean package -DskipTests
  2. Connect with Dev Sever

    ssh -o ServerAliveInterval=10 -i C:\Users\parth.b.desai\Desktop\byte_121.pem centos@172.31.96.121
  3. Move to recon_master

    cd /data/nirav/code/recon_master
  4. Create directory with <jobname>

    mkdir <jobname>
    cd <jobname>
  5. Create directory date and Test_SavetableAPI

    mkdir data
    mkdir Test_SavetableAPI
    mkdir python_script
  6. Add path to data and Test_SavetableAPI in application.conf

    pathvariables{
      zomatopath = "/data/nirav/code/recon_master/zomato_settlement_report/data/"
      tempwriteoutpath = "/data/nirav/code/recon_master/zomato_settlement_report/Test_SavetableAPI/"
    }
  7. Add these paths to ApplicationConfig.scala

    //Custom variables for pushing zomato data
    def ZOMATO_PATH = config.getString("monitor.sparketl.sample.job.pathvariables.zomatopath")
    def TEMP_WRITEOUT_PATH = config.getString("monitor.sparketl.sample.job.pathvariables.tempwriteoutpath")
  8. Copy application.conf from local system to Dev (In another terminal : Not in dev server)

    scp -i C:\Users\parth.b.desai\Desktop\byte_121.pem C:\Users\parth.b.desai\Desktop\Github\SparkSQL_ETL\src\test\resources\application.conf centos@172.31.96.121:/data/nirav/code/recon_master/config/application.conf
  9. Copy fat Jar from local system to Dev

    scp -i C:\Users\parth.b.desai\Desktop\byte_121.pem C:\Users\parth.b.desai\Desktop\Github\SparkSQL_ETL\target\spark_etl-1.0-SNAPSHOT-jar-with-dependencies.jar centos@172.31.96.121:/data/nirav/code/recon_master/
  10. Add <jobname>.sh script to shell_Script

    cd /data/nirav/code/recon_master/shell_Script/
    nano <jobname>.sh
  11. (Optional) Add a python script if needed

    cd /data/nirav/code/recon_master/<jobname>/python_script
    nano ConvertToCSV.py
  12. Add the following to <jobname>.sh

    #!/bin/bash
    myarray=(`find /data/nirav/code/recon_master/<jobname>/data/ -maxdepth 1 -name "*.<fileformat>"`)
    if [ ${#myarray[@]} -gt 0 ]; then 
        /data/spark-3.1.1-bin-hadoop2.7/bin/spark-submit --class com.byteprophecy.monitor.recon.<classname> --master local[*] --deploy-mode client --jars /data/nirav/code/recon_master/spark_etl-1.0-SNAPSHOT-jar-with-dependencies.jar /data/nirav/code/recon_master/config/application.conf /data/nirav/code/recon_master/config/application.conf
        if [ $? -eq 0 ]; then
            echo "Successfully executed"
            echo "Deleting files..."
            rm  /data/nirav/code/recon_master/<jobname>/data/*.<fileformat>
            echo "Files deleted"
        else
            echo "Failed to execute. Please check logs for more information."
        fi
    else 
        echo "File not found"
    fi

    (For Excel Files)

    #!/bin/bash
    myarray=(`find /data/nirav/code/recon_master/<jobname>/data/ -maxdepth 1 -name "*.<fileformat>"`)
    if [ ${#myarray[@]} -gt 0 ]; then 
        /data/spark-3.1.1-bin-hadoop2.7/bin/spark-submit --class com.byteprophecy.monitor.recon.<classname> --master local[*] --deploy-mode client --driver-cores 1 --driver-memory 4G --jars /data/nirav/code/recon_master/spark_etl-1.0-SNAPSHOT-jar-with-dependencies.jar /data/nirav/code/recon_master/config/application.conf /data/nirav/code/recon_master/config/application.conf
        if [ $? -eq 0 ]; then
            echo "Successfully executed"
            echo "Deleting files..."
            rm  /data/nirav/code/recon_master/<jobname>/data/*.<fileformat>
            echo "Files deleted"
        else
            echo "Failed to execute. Please check logs for more information."
        fi
    else 
        echo "File not found"
    fi
  13. Create folder for <jobname> in logs

    cd /data/nirav/code/recon_master/logs
    mkdir <jobname>
  14. (Optional) Copy test data if needed

    scp -i C:\Users\parth.b.desai\Desktop\byte_121.pem C:\Users\parth.b.desai\Downloads\Cashless Summary Report-Domino's_Mar'21____This_has_SalesData.xlsb centos@172.31.96.121:/data/nirav/code/recon_master/cash_summary_report_salesdata/data/
  15. Create cronjob

    */10 * * * * sh /data/nirav/code/recon_master/shell_Script/<jobname>.sh > /data/nirav/code/recon_master/logs/<jobname>/`date +\%Y\%m\%d\%H\%M\%S`-cron.log 2>&1>

    OPTIONAL : Execute script if you want to test the job before scheduling it with sh /data/nirav/code/recon_master/shell_Script/<jobname>.sh

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment