thisrod/cluster.md

## cluster.md

      
    Raw
  

              cluster.md
            
          
    Here's a sequence of events.
The memoisation part would be useful locally, not just on a cluster.  E.g. look up the order parameter that I prepared earlier, then compute the sound wave spectrum in an new way.
It would be useful to have a @cluster_loop macro, which memoised the result of doing 100 iterations, and skipped to there when you asked for 200.  Maybe save the Fibonacci numbers of iterations as well as the final one.

I run a test script, which goes something like this:

for n = 1:15
    results[n] = @cluster f(randn(2^n,2^n)) (wall_time => "8 hours", mem => 1e9)
end

for x = results
    plot(x)
end


The first call to cluster finds an ozstar.swin.edu.au line in my .ssh/known_hosts,
figures that cluster will do, and looks up how it works in the cluster navigation database
that comes with Cluster.jl.


@cluster introspects the current environment, writes it to JLD and Project.yaml files,
transfers them to the cluster, and queues the n = 1 job.


The next 14 calls to @cluster notice that there is already a JLD file on the cluster
whose checksum matches the one they generated, and use that.


15 jobs are queued, and results is an array of 15 promises.


The first call to plot forces the first promise.  This is a small job, so it has
already finished, and the result gets transfered back to my desktop and plotted.


Futher calls to plot block until the job is complete and the promise can be forced,
then retrieve results and plot them.


At some point, I go home for dinner, and shut down my desktop.


The next morning, I come in to work, start up my desktop, and run the script again.


The calls to @cluster detect that f has already been evaluated with an environment
matching the checksum of their JLD file and so on, and return
promises to the memoised results.


The first 10 calls to plot draw graphs.


The 11th call forces its promise, which throws an error and reports that the cluster ran out of memory.


I change mem => 1e9 to mem => 1e12, and run the script again.


The first 10 calls to @cluster notice that their jobs have run successfully, ignore
the change in resource requests by default, and return promises to the memoised results.


Calls 11 through 15 notice that their jobs failed last time, and that the
resource requests have changed, so they try again.  The JLD file with the environment is still
on the cluster, so the inputs don't get transferred.


The script plots 10 graphs, then the 11th promise blocks, and polls the
cluster every few minutes to see if the job has finished.


A few days later, all the jobs have finished.  I rerun the script, and get 15 graphs.