Skip to content

Instantly share code, notes, and snippets.

@vsoch

vsoch/README.md Secret

Last active June 19, 2023 22:41
Show Gist options
  • Save vsoch/2073136f0833983efc92b4eeb52d49dd to your computer and use it in GitHub Desktop.
Save vsoch/2073136f0833983efc92b4eeb52d49dd to your computer and use it in GitHub Desktop.
Debugging HTCondor Setup

Debugging HTCondor Setup

I've been following guidance from this example and the basic cluster is working, but I'm running into trouble when trying to run a parallel universe job. The basic setup has USE_POOL_PASSWORD=yes and I'm able to submit basic jobs. This is great! However, I want to use "universe parallel" so I need to update from that.

Context

For more context I’ve created the start of a Kubernetes operator to run HTCondor. I have been able to get basic hello world jobs running, and today was tackling working on MPI to run LAMMPS (this is the basic workflow we are using to look at latency for different schedulers as operators).

Steps

I’ve taken the following steps:

Parallel Universe

I found this “parallel universe” and first tried submitting a job using it. I saw that (despite having USE_POOL_PASSWORD=yes) it was trying to use IDTOKEN auth. My submit.sh for a basic example looks like this:

universe = parallel
executable = /bin/sleep
arguments = 5
machine_count = 2
output = /tmp/sleep.out
error = /tmp/sleep.err
log = /tmp/sleep.log
request_cpus   = 1

queue

My cluster has one config manager, one submit node, and 2 execute notes. A basic "hello world" job like this works great:

# Example 1
# Simple HTCondor submit description file
# Everything with a leading # is a comment

executable   = /usr/bin/echo
arguments    = hello world

output       = /tmp/hello-world.out
error        = /tmp/hello-world.err
log          = /tmp/hello-world.log

request_cpus   = 1
queue

I read that I needed to add dedicated hosts, and I was able to do that!

$ condor_status -const '!isUndefined(DedicatedScheduler)' \
      -format "%s\t" Machine -format "%s\n" DedicatedScheduler
htcondor-sample-execute-0-0.htc-service.htcondor-operator.svls c.cluster.local     DedicatedScheduler@htcondor-sample-manager-0-0.htc-service.htcondor-operator.svc.cluster.local
htcondor-sample-execute-0-1.htc-service.htcondor-operator.svc.cluster.local     DedicatedScheduler@htcondor-sample-manager-0-0.htc-service.htcondor-operator.svc.cluster.local

But when I saw them connecting, it was trying to use this IDTOKEN auth:

06/19/23 22:20:52 (pid:73) Adding submitter DedicatedScheduler@htcondor-sample-submit-0-0.htc-service.htcondor-operator.svc.cluster.local to the submitter map for default pool.
06/19/23 22:20:52 (pid:73) SECMAN: FAILED: Received "DENIED" from server for user condor_pool@ using method IDTOKENS.
06/19/23 22:20:52 (pid:73) Failed to send RESCHEDULE to negotiator htcondor-sample-manager-0-0.htc-service.htcondor-operator.svc.cluster.local: SECMAN:2010:Received "DENIED" from server for user condor_pool@ using method IDTOKENS.
06/19/23 22:20:57 (pid:73) Received a superuser command
06/19/23 22:20:57 (pid:73) Number of Active Workers 0
06/19/23 22:20:59 (pid:73) Received a superuser command
06/19/23 22:20:59 (pid:73) Number of Active Workers 0

And then reporting (after failure) no active workers. To summarize, the current setup here works for the second job, but not the first.

ID Token Auth

I said "OK, great, let's do that!" and went through the process to generate a token. Since we don't have any concept of a shared filesystem, I added a one-off pod to start the same manager node, generate the token, and then save to the operator to write to a read only config map. We actually save this to /htcondor_operator/token and then copy over to /root/secrets/token as I saw in the various HTCondor docker example.

This led me to seeing in the logs token requests like this:

06/19/23 21:56:38 (pid:66) Token requested not yet approved; please ask collector htcondor-sample-manager-0-0.htc-service.htcondor-operator.svc.cluster.local admin to approve request ID 0097030.
06/19/23 21:56:38 (pid:66) Number of Active Workers 0
06/19/23 21:56:43 (pid:66) Token requested not yet approved; please ask collector htcondor-sample-manager-0-0.htc-service.htcondor-operator.svc.cluster.local admin to approve request ID 0097030.
06/19/23 21:56:48 (pid:66) Token requested not yet approved; please ask collector htcondor-sample-manager-0-0.htc-service.htcondor-operator.svc.cluster.local admin to approve request ID 0097030.

I figured out that I could manually accept them, however the only way to auto-accept was based on an ip address. This wouldn't be a good design for Kubernetes - we don't have stable addresses. So instead I wrote a dumb loop for the manager to run after starting as a background process (also not great, but hey, just want to get this working!):

yum install -y jq

# Ideally we can provide this via a config or the condor_token_request_auto_approve that takes a hostname
while true
do
    for requestid in $(condor_token_request_list -json | jq -r .[].RequestId); do
        echo "yes" | condor_token_request_approve -reqid ${requestid}
    done
    sleep 15
done

That didn't actually work - I kept seeing in the log when I tried to run this job that it would essentially fail to connect and then I'd have zero hosts. I scrolled too far down to save the actual message (sorry!).

Security Config

I spent some time today trying to figure out if I could tweak the security config to my liking. Specifically, I added these two lines at the end:

echo "SEC_DEFAULT_AUTHENTICATION_METHODS = FS, PASSWORD, TOKEN, IDTOKENS" >> /etc/condor/config.d/01-security.conf 
echo "ALLOW_WRITE = *" >> /etc/condor/config.d/01-security.conf 

Yes, absolutely not great practice, but at this point I really just wanted something to work! But then still:

06/19/23 22:25:42 (pid:71) Adding submitter DedicatedScheduler@htcondor-sample-submit-0-0.htc-service.htcondor-operator.svc.cluster.local to the submitter map for default pool.
06/19/23 22:25:42 (pid:71) SECMAN: FAILED: Received "DENIED" from server for user condor_pool@htcondor-sample-manager-0-0.htc-service.htcondor-operator.svc.cluster.local using method PASSWORD.
06/19/23 22:25:42 (pid:71) Failed to send RESCHEDULE to negotiator htcondor-sample-manager-0-0.htc-service.htcondor-operator.svc.cluster.local: SECMAN:2010:Received "DENIED" from server for user condor_pool@htcondor-sample-manager-0-0.htc-service.htcondor-operator.svc.cluster.local using method PASSWORD.
06/19/23 22:25:44 (pid:71) Received a superuser command
06/19/23 22:25:44 (pid:71) Number of Active Workers 0
06/19/23 22:25:45 (pid:71) Received a superuser command
06/19/23 22:25:45 (pid:71) Number of Active Workers 0
06/19/23 22:25:46 (pid:71) Received a superuser command
06/19/23 22:25:46 (pid:71) Number of Active Workers 0
06/19/23 22:25:57 (pid:71) Received a superuser command
06/19/23 22:25:57 (pid:71) Number of Active Workers 0

And then I was like hmm maybe condor_pool is some special user I need to add? In the security config in the container it only wants to enable this if I have turned password auth off.

if $(USE_POOL_PASSWORD:no)
    SEC_DEFAULT_AUTHENTICATION_METHODS = $(SEC_DEFAULT_AUTHENTICATION_METHODS), PASSWORD

    ALLOW_ADVERTISE_STARTD = condor_pool@*/* $(ALLOW_ADVERTISE_STARTD)
    ALLOW_ADVERTISE_SCHEDD = condor_pool@*/* $(ALLOW_ADVERTISE_SCHEDD)
    ALLOW_ADVERTISE_MASTER = condor_pool@*/* $(ALLOW_ADVERTISE_MASTER)
endif

Cue Austin Danger powers! Let's turn it on anyway :)

# Austin DANGER POWERS!
echo 'ALLOW_ADVERTISE_STARTD = condor_pool@*/* $(ALLOW_ADVERTISE_STARTD)' >> /etc/condor/config.d/01-security.conf 
echo 'ALLOW_ADVERTISE_SCHEDD = condor_pool@*/* $(ALLOW_ADVERTISE_SCHEDD)' >> /etc/condor/config.d/01-security.conf 
echo 'ALLOW_ADVERTISE_MASTER = condor_pool@*/* $(ALLOW_ADVERTISE_MASTER)' >> /etc/condor/config.d/01-security.conf 

That also didn't work.

06/19/23 22:38:04 (pid:49) Reloading job factories
06/19/23 22:38:04 (pid:49) Loaded 0 job factories, 0 were paused, 0 failed to load
06/19/23 22:38:04 (pid:49) TransferQueueManager stats: active up=0/100 down=0/100; waiting up=0 down=0; wait time up=0s down=0s
06/19/23 22:38:04 (pid:49) TransferQueueManager upload 1m I/O load: 0 bytes/s  0.000 disk load  0.000 net load
06/19/23 22:38:04 (pid:49) TransferQueueManager download 1m I/O load: 0 bytes/s  0.000 disk load  0.000 net load
06/19/23 22:38:13 (pid:49) Received a superuser command
06/19/23 22:38:13 (pid:49) Number of Active Workers 0
06/19/23 22:38:31 (pid:49) Received a superuser command
06/19/23 22:38:31 (pid:49) Received a superuser command
06/19/23 22:38:31 (pid:49) SetAttribute modifying attribute Scheduler in nonexistent job 1.1
06/19/23 22:38:31 (pid:49) Found 0 potential dedicated resources in 0 seconds
06/19/23 22:38:31 (pid:49) Skipping job 1.0 because it requests more nodes (2) than exist in the pool (0)
06/19/23 22:38:31 (pid:49) Adding submitter DedicatedScheduler@htcondor-sample-submit-0-0.htc-service.htcondor-operator.svc.cluster.local to the submitter map for default pool.
06/19/23 22:38:31 (pid:49) SECMAN: FAILED: Received "DENIED" from server for user condor_pool@htcondor-sample-manager-0-0.htc-service.htcondor-operator.svc.cluster.local using method PASSWORD.
06/19/23 22:38:31 (pid:49) Failed to send RESCHEDULE to negotiator htcondor-sample-manager-0-0.htc-service.htcondor-operator.svc.cluster.local: SECMAN:2010:Received "DENIED" from server for user condor_pool@htcondor-sample-manager-0-0.htc-service.htcondor-operator.svc.cluster.local using method PASSWORD.
06/19/23 22:38:34 (pid:49) Received a superuser command
06/19/23 22:38:34 (pid:49) Number of Active Workers 0

I'd be happy to answer any questions - show more configs! Let me know what you'd like to see.

What I'd Like

I'd like to be able to modify this setup to add support for the parallel universe. I suspect it's a wonky combination of configuration values that someone only a day into using this scheduler (let alone configuring it) would know how to do! Any advice would be greatly appreciated. For more context, I'd like to be able to provide HTCondor as an operator in Kubernetes. I think I'm close but do not consider it fully supported until the MPI use case is also working. Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment