Skip to content

Instantly share code, notes, and snippets.

@thisrod
Last active November 8, 2019 00:26
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save thisrod/c1b25d95bf262f32750c33ae72c265ce to your computer and use it in GitHub Desktop.
Save thisrod/c1b25d95bf262f32750c33ae72c265ce to your computer and use it in GitHub Desktop.
How Julia threads might work on a cluster

Here's a sequence of events.

The memoisation part would be useful locally, not just on a cluster. E.g. look up the order parameter that I prepared earlier, then compute the sound wave spectrum in an new way.

It would be useful to have a @cluster_loop macro, which memoised the result of doing 100 iterations, and skipped to there when you asked for 200. Maybe save the Fibonacci numbers of iterations as well as the final one.

  1. I run a test script, which goes something like this:
for n = 1:15
    results[n] = @cluster f(randn(2^n,2^n)) (wall_time => "8 hours", mem => 1e9)
end

for x = results
    plot(x)
end
  1. The first call to cluster finds an ozstar.swin.edu.au line in my .ssh/known_hosts, figures that cluster will do, and looks up how it works in the cluster navigation database that comes with Cluster.jl.

  2. @cluster introspects the current environment, writes it to JLD and Project.yaml files, transfers them to the cluster, and queues the n = 1 job.

  3. The next 14 calls to @cluster notice that there is already a JLD file on the cluster whose checksum matches the one they generated, and use that.

  4. 15 jobs are queued, and results is an array of 15 promises.

  5. The first call to plot forces the first promise. This is a small job, so it has already finished, and the result gets transfered back to my desktop and plotted.

  6. Futher calls to plot block until the job is complete and the promise can be forced, then retrieve results and plot them.

  7. At some point, I go home for dinner, and shut down my desktop.

  8. The next morning, I come in to work, start up my desktop, and run the script again.

  9. The calls to @cluster detect that f has already been evaluated with an environment matching the checksum of their JLD file and so on, and return promises to the memoised results.

  10. The first 10 calls to plot draw graphs.

  11. The 11th call forces its promise, which throws an error and reports that the cluster ran out of memory.

  12. I change mem => 1e9 to mem => 1e12, and run the script again.

  13. The first 10 calls to @cluster notice that their jobs have run successfully, ignore the change in resource requests by default, and return promises to the memoised results.

  14. Calls 11 through 15 notice that their jobs failed last time, and that the resource requests have changed, so they try again. The JLD file with the environment is still on the cluster, so the inputs don't get transferred.

  15. The script plots 10 graphs, then the 11th promise blocks, and polls the cluster every few minutes to see if the job has finished.

  16. A few days later, all the jobs have finished. I rerun the script, and get 15 graphs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment