Here's a sequence of events.
The memoisation part would be useful locally, not just on a cluster. E.g. look up the order parameter that I prepared earlier, then compute the sound wave spectrum in an new way.
It would be useful to have a @cluster_loop
macro, which memoised the result of doing 100 iterations, and skipped to there when you asked for 200. Maybe save the Fibonacci numbers of iterations as well as the final one.
- I run a test script, which goes something like this:
for n = 1:15
results[n] = @cluster f(randn(2^n,2^n)) (wall_time => "8 hours", mem => 1e9)
end
for x = results
plot(x)
end
-
The first call to cluster finds an
ozstar.swin.edu.au
line in my.ssh/known_hosts
, figures that cluster will do, and looks up how it works in the cluster navigation database that comes withCluster.jl
. -
@cluster
introspects the current environment, writes it to JLD and Project.yaml files, transfers them to the cluster, and queues then = 1
job. -
The next 14 calls to
@cluster
notice that there is already a JLD file on the cluster whose checksum matches the one they generated, and use that. -
15 jobs are queued, and
results
is an array of 15 promises. -
The first call to
plot
forces the first promise. This is a small job, so it has already finished, and the result gets transfered back to my desktop and plotted. -
Futher calls to
plot
block until the job is complete and the promise can be forced, then retrieve results and plot them. -
At some point, I go home for dinner, and shut down my desktop.
-
The next morning, I come in to work, start up my desktop, and run the script again.
-
The calls to
@cluster
detect thatf
has already been evaluated with an environment matching the checksum of their JLD file and so on, and return promises to the memoised results. -
The first 10 calls to
plot
draw graphs. -
The 11th call forces its promise, which throws an error and reports that the cluster ran out of memory.
-
I change
mem => 1e9
tomem => 1e12
, and run the script again. -
The first 10 calls to
@cluster
notice that their jobs have run successfully, ignore the change in resource requests by default, and return promises to the memoised results. -
Calls 11 through 15 notice that their jobs failed last time, and that the resource requests have changed, so they try again. The JLD file with the environment is still on the cluster, so the inputs don't get transferred.
-
The script plots 10 graphs, then the 11th promise blocks, and polls the cluster every few minutes to see if the job has finished.
-
A few days later, all the jobs have finished. I rerun the script, and get 15 graphs.