androidfred/how_to_run_long_running_script_jobs.md

## how_to_run_long_running_script_jobs.md

      
    Raw
  

              how_to_run_long_running_script_jobs.md
            
          
    How to run long running scripts, jobs

Running long running script jobs is tricky because:

running them from your local machine, interruptions will result from loss of internet or VPN connectivity, or from your local machine going to sleep, running out of power etc etc
running them from jumpboxes 1) interruptions will result from the session timeout 2) even if you send the script job to run in background, interruptions will still result from jumpbox churn (jumpbox machines are often churned all the time for security reasons) 3) jumpbox urls are often just load balancers, there's no guarantee that, even after sending a script job to run in background on a jumpbox, and you later ssh to jumpbox again, you will end up on the same machine (so you'll never be able to find, let alone terminate the process if there's a problem etc)
in general, long running script jobs require using commands we don't use all that often (like nohup), making sure we keep track of and are able to kill the process if required, making sure to log and keep track of the script job progress etc etc.

Summary

Before creating and running a long running script job, ask yourself if a long running script job really is the correct solution vs making a service endpoint or AWS Lambda etc.
# using prod env as an example
$ ssh jumpbox
# this is a stable utility box in prod env - find an appropriate box in your environment
$ ssh 1.2.3.4

$ touch long_running_script_job.sh # create the script file
$ chmod +x long_running_script_job.sh # make it executable
$ vim long_running_script_job.sh
# i <paste the script that you've prepared and :wq save it>

$ nohup ./long_running_script_job.sh &
[1] 4603
nohup: ignoring input and appending output to ‘nohup.out’
# make a note of the process id 4603 (THIS IS FRIGGIN IMPORTANT)
# sanity check the output of the script (because your script has built-in persistent log output, right? right?!)
$ less long_running_script_job_output.log
# do other sanity checks eg the script calls service endpoints as expected etc

# to kill the process in case something isn't right
$ kill 4603
# <undo and clean up anything that went wrong, start over>
# otherwise, if everything looks right, just leave it to run

# <utility box ssh session or jumpbox ssh session will eventually expire>
# that's ok, to get back to the same box
$ ssh jumpbox.us-west-2.prod.com
$ ssh 1.2.3.4

# now we're back on the same box and can eg sanity check the most recent output of the script again
$ less long_running_script_job_output.log
# or kill the process in case something isn't right
$ kill 4603
# otherwise, if everything looks right, just leave it to run

# <script will eventually finish running>
# transfer the long_running_script_job_output.log to s3 or to your local machine
# and add it to the JIRA ticket for the task for record keeping
Script job vs AWS Lambda vs service endpoint

Script job


Good for things that are truly one off, or run manually on an arbitrary, uneven schedule.
Relatively simple, "quick and dirty".


Because it's usually something you write and run kind of ad hoc, there's no formal Git version control, no code review, no tests, no Splunk logs, it's not deployed anywhere etc etc.

Bash


Once written runs anywhere. (as long as you stick to standard commands)


Bash just isn't a very clear language - Bash scripts often end up being a hodgepodge of commands duct-taped together with confusing syntax (even simple things like while loops etc quickly get quite impenetrable), it's usually not as clear and structured as eg Python, especially for more complex jobs.

Python


Because Python is a "real" language with proper libraries etc, it's usually more clear and structured than eg Bash.


Once written doesn't necessarily run anywhere - Python tends to make more use of dependencies which aren't necessarily present everywhere, and different environments may even have different versions of Python itself etc etc.

AWS Lambda


Can be triggered manually OR by events that are either scheduled OR organically generated.
Git version control, code review, tests, Splunk/AWS logs etc etc.


Usually more complicated than a simple, quick and dirty script job.

Service endpoint


It's just code like any other service code.
Can be triggered manually OR by events that are either scheduled OR organically generated.
Git version control, code review, tests, Splunk/AWS logs etc etc.


Requires special job running code, scheduling, thread pools etc.
Even if the code is perfectly written in terms of special job running code, scheduling, thread pools etc, service machines and their processes are interrupted on new deploys or even just changes in load etc, which means you need to build in mechanisms to restart jobs and ability to keep track of progress so they pick back up where they left off.
Usually more complicated than a simple, quick and dirty script job.

Example in detail

I recently came across one of those tasks that lent itself well to being solved by a long running script, and figured it was a good example to document some thoughts and how to do it.
The task was "disable all foos that haven't bar'd in for 18 months".
It sounds like something that can be done by just running a one-off MySQL statement, but inactivating foos consists of not only marking them as inactive in db, but of other logic like eg sending out events etc as well, so simply marking them as inactive in db isn't sufficient. At the time of writing, the foo-service PUT v1/foos/<fooId>/status/INACTIVE endpoint is what performs the complete foo deactivation process, so we need something that calls that endpoint.
At first, I actually figured this is probably something that we'd want to be doing on an ongoing basis. That means, rather than a running a one off long running script job, it's probably something that would lend itself better to being solved by an AWS lambda that listens to some event. Note that there is no event that organically generates when a foos last bar is more than 18 months ago (unlike eg events that are organically generated when a foo is created or whatever), but it doesn't matter - AWS Lambdas can be triggered not only by organically generated events but also by scheduled events that you can set up as appropriate. So I figured I'd make a lambda that is triggered by events that are generated once every 24hs, that calls an endpoint that gets all the foos who haven't bar'd for 18 months (or pull them straight from the db), and calls PUT v1/foos/<fooId>/status/INACTIVE for each of those foos.
However, it turns out that, while we do want something like that in the future, this particular ask at hand was fine to do just as a one off. So I pivoted to using a script job instead.
First, I needed to identify all foos who hadn't bar'd for 18 months, and their last bar date. The foo-service PUT v1/foos/<fooId>/status/INACTIVE endpoint accepts a "reason" string that's kept for auditing, so it made sense to send in the last bar date for the foo there, eg {"reason": "foo <fooId> has not bar'd for at least 18 months, last activity <date>"}.
I used the following query to obtain the data
SELECT foo.foo_id, foo_activity.activity_timestamp
FROM foo_activity foo_activity
         JOIN foo foo ON foo.foo_id = foo_activity.foo_id
WHERE foo.status = 'ACTIVE'
  AND foo_activity.activity_timestamp < '2019-08-22 00:00:00'
to dump it to a file, I did
# set up a connection to the db
$ ssh -f -N -L 3323:foo-db.rds.amazonaws.com:3306 myClientName@jumpbox.us-west-2.prod.com
# dump the query output to a csv
$ mysql -e "SELECT <query>" < '2019-08-22 00:00:00'" --host=127.0.0.1 --port=3323 --user master -p foo_db > dumpfile.csv
the file looked like this
foo_id,activity_timestamp
123,2019-05-13 12:50:03.128
456,2019-05-14 12:50:03.128
...

I uploaded the dump to the JIRA ticket for the task for record keeping.
I deleted the top line that contained the column names since they weren't needed. Then, I got to work on the bash script.
$ touch long_running_script_job.sh # create the script file
$ chmod +x long_running_script_job.sh # make it executable
$ vim long_running_script_job.sh
#!/usr/bin/env bash # this is important, it makes the script run with bash specifically as opposed
                    # to some other shell that may be the default on any given machine

while IFS=, read -r fooId date; do # read line by line, assigning the first column value to $fooId
                                  # and the second column value to $date
  # print the variable values to sanity check it's working
  echo "$fooId"
  echo "$date"
  sleep 2 # sleep 2 seconds
done <"$1" # use first argument to script as input to a while loop
running it to sanity check it worked
$ ./long_running_script_job.sh dumpfile.csv
123,2019-05-13 12:50:03.128
456,2019-05-14 12:50:03.128
Cool, now, let's make it call the endpoint with a curl. The curl will look like this
$ curl --request PUT \
    --url "https://foo-service-transit.test.prod.com/v1/foos/$fooId/status/INACTIVE" \
    --header "content-type: application/json" \
    --header "clientId: myClientName" \
    --user "admin:$pwd" \
    --data "{\"reason\": \"foo $fooId has not bar'd for at least 18 months, last used activity $date\"}"
But curl doesn't print anything on a 204 response, which is what the endpoint returns on a successful call. I can use -i to make curl print some basic info about the response
HTTP/1.1 204

and then pipe that to
| head -n 1| cut -d $' ' -f2)

to cut out the status code, so I can print and log the outcome.
To run the curl in the script, cut out the status code and assign it to a variable, I need to use curlResponseStatus=$(<the thing I want to do>). Putting it all together, I get (NOTE, I'm using the foo-service test URL here for sanity checking this from my local machine before doing anything to do with prod + change the clientId as appropriate)
#!/usr/bin/env bash
while IFS=, read -r fooId date; do
  # make sure to set environment variable $pwd before calling as it's used by the script
  # to set it safely, use $ read -s -p "Password: " pwd
  curlResponseStatus=$(curl --request PUT \
    --url "https://foo-service-transit.test.prod.com/v1/foos/$fooId/status/INACTIVE" \
    --header "content-type: application/json" \
    --header "clientId: myClientName" \
    --user "admin:$pwd" \
    --data "{\"reason\": \"foo $fooId has not bar'd for at least 18 months, last used activity $date\"}" \
    -i | head -n 1| cut -d $' ' -f2) # -i makes curl output basics about the response, then we cut out the status code
      if [[ $curlResponseStatus == "204" ]]
      then
        # print to console
        echo "Success: foo $fooId with last bar date $date successfully inactivated"
        # and to log file so we can keep track of progress (IMPORTANT!)
        echo "Success: foo $fooId with last bar date $date successfully inactivated" >> long_running_script_job_output.log
      else
        # print to console
        echo "Error: attempt to inactivate foo $fooId with last bar date $date resulted in $curlResponseStatus"
        # and to log file so we can keep track of progress (IMPORTANT!)
        echo "Error: attempt to inactivate foo $fooId with last bar date $date resulted in $curlResponseStatus" >> long_running_script_job_output.log
      fi
  sleep 2
done <"$1"
$ read -s -p "Password: " pwd
$ ./long_running_script_job.sh dumpfile.csv
# <outcome>
# <outcome>

# <ctrl + c to cancel the command>

# sanity check the log
$ less long_running_script_job_output.log
# <outcome>
# <outcome>
Looks good!
Now, unlike my local machine, which can run long running processes for as long as my machine doesn't sleep, run out or power, lose connectivity to internet or VPN etc, to run this in test or prod, I need to first ssh into a jumpbox and then from there to a stable ip utility box, and once on that box, I can't just run the command like locally, because it will interrupt when the session gets terminated - I need to send it to run in the background and note the process id (THIS IS FRIGGIN IMPORTANT) so I can stop the script job if anything goes wrong etc.
To run the command in the background (on my local machine for now), I surround the command with nohup <command> & ie
# clear the log
$ rm -rf long_running_script_job_output.log

$ read -s -p "Password: " pwd
$ nohup ./long_running_script_job.sh dumpfile.csv &
[1] 4603
nohup: ignoring input and appending output to ‘nohup.out’
# make a note of the process id 4603

# sanity check the output of the script
$ less long_running_script_job_output.log
# <outcome>
# <outcome>
Cool, now I can run it in the background and monitor the progress! Kill the process with
$ kill 4603
Then I tweaked the script to run in prod
#!/usr/bin/env bash
while IFS=, read -r fooId date; do
  # make sure to set environment variable $pwd before calling as it's used by the script
  # to set it safely, use $ read -s -p "Password: " pwd
  curlResponseStatus=$(curl --request PUT \
    --url "https://foo-service-transit.prod.com/v1/foos/$fooId/status/INACTIVE" \
    --header "content-type: application/json" \
    --header "clientId: myClientName" \
    --user "admin:$pwd" \
    --data "{\"reason\": \"foo $fooId has not bar'd for at least 18 months, last used activity $date\"}" \
    -i | head -n 1| cut -d $' ' -f2) # -i makes curl output basics about the response, then we cut out the status code
      if [[ $curlResponseStatus == "204" ]]
      then
        # print to console
        echo "Success: foo $fooId with last bar date $date successfully inactivated"
        # and to log file so we can keep track of progress
        echo "Success: foo $fooId with last bar date $date successfully inactivated" >> long_running_script_job_output.log
      else
        # print to console
        echo "Error: attempt to inactivate foo $fooId with last bar date $date resulted in $curlResponseStatus"
        # and to log file so we can keep track of progress
        echo "Error: attempt to inactivate foo $fooId with last bar date $date resulted in $curlResponseStatus" >> long_running_script_job_output.log
      fi
  sleep 0.1 # make one call every 10 milliseconds
done <"$1"
and ran it in prod
# using prod environment as an example
$ ssh jumpbox.us-west-2.prod.com
# this is a stable utility box in prod - find an appropriate box in your environment
$ ssh 1.2.3.4

# <transfer the dumpfile.csv to the utility box>
$ touch long_running_script_job.sh # create the script file
$ chmod +x long_running_script_job.sh # make it executable
$ vim long_running_script_job.sh
# i <paste the script that you've prepared and :wq save it>

$ read -s -p "Password: " pwd
$ nohup ./long_running_script_job.sh dumpfile.csv &
[1] 4603
nohup: ignoring input and appending output to ‘nohup.out’
# make a note of the process id 4603 (THIS IS FRIGGIN IMPORTANT)

# sanity check the output of the script
$ less long_running_script_job_output.log
# do other sanity checks eg if the script calls service endpoints, check splunk they're as expected

# to kill the process in case something isn't right
$ kill 4603
# <undo and clean up anything that went wrong, start over>
# otherwise, if everything looks right, just leave it to run

# <utility box ssh session or jumpbox ssh session will eventually expire>
# that's ok, to get back to the same box
$ ssh jumpbox.us-west-2.prod.com
$ ssh 1.2.3.4

# now we're back on the same box and can eg sanity check the most recent output of the script again
$ less long_running_script_job_output.log
# or kill the process in case something isn't right
$ kill 4603
# otherwise, if everything looks right, just leave it to run

# <script will eventually finish running>
# transfer the long_running_script_job_output.log to s3 or to your local machine
# and add it to the JIRA ticket for the task for record keeping