Skip to content

Instantly share code, notes, and snippets.

@maelvls
Last active February 22, 2022 18:32
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save maelvls/74c6584d95017a79e8ab107030b060e6 to your computer and use it in GitHub Desktop.
Save maelvls/74c6584d95017a79e8ab107030b060e6 to your computer and use it in GitHub Desktop.
cert-manager ACME solver uses the `jwk` field instead of `kid` in neworder call for non-letsencrypt calls

Investigation: cert-manager ACME solver uses the jwk field instead of kid in neworder call for non-letsencrypt calls

In the Stackover question 70897574, user1563721 suggests that cert-manager's ACME solver is not behaving as it should with non-let's encrypt servers. More specifically, that new-order is called using kid instead of jwk. In the remainder of this page, I detail how to reproduce this issue using Pebble (a smaller version of Boulder, which is the ACME server Let's Encrypt uses).

Related:

Install cert-manager but turn off the deployment:

kind create cluster
helm upgrade --install cert-manager jetstack/cert-manager --version v1.7.1 --namespace cert-manager --set installCRDs=true --create-namespace
kubectl scale deployment --replicas 0 -n cert-manager cert-manager

Make sure me resolves to 127.0.0.1. That's due to the fact that Go binaries do not honor the HTTPS_PROXY variable for the 127.0.0.1 and localhost domains:

sudo perl -ni -e 'print if !/^\d+\.\d+\.\d+\.\d+ me$/d' /etc/hosts
sudo tee -a /etc/hosts <<<"127.0.0.1 me"

Run mitmproxy:

curl -L https://raw.githubusercontent.com/maelvls/kubectl-incluster/main/josejson.py >/tmp/josejson.py
curl -L https://raw.githubusercontent.com/maelvls/kubectl-incluster/main/watch-stream.py >/tmp/watch-stream.py
mitmproxy --listen-host 0.0.0.0 -p 9090 --ssl-insecure -s /tmp/watch-stream.py -s /tmp/josejson.py \
  --set client_certs=$(kubectl incluster --print-client-cert >/tmp/me.pem && echo /tmp/me.pem)

Run pebble in a separate terminal session:

go install github.com/letsencrypt/pebble/cmd/pebble@latest
go install github.com/jsha/minica@latest
cd /tmp
minica --domains me -ip-addresses 127.0.0.1,0.0.0.0
pebble -config /dev/stdin <<EOF
{
  "pebble": {
    "listenAddress": "me:14000",
    "managementListenAddress": "0.0.0.0:15000",
    "certificate": "/tmp/me/cert.pem",
    "privateKey": "/tmp/me/key.pem",
    "httpPort": 5002,
    "tlsPort": 5001,
    "externalAccountBindingRequired": false
  }
}
EOF

Run the cert-manager controller in yet another terminal session:

HTTPS_PROXY=:9090 go run ./cmd/controller -v=2 --leader-elect=false --kubeconfig=<(kubectl incluster --replace-ca-cert ~/.mitmproxy/mitmproxy-ca-cert.pem | perl -pe "s/\d+.0.0.\d+/me/")

Finally, create the issuer:

kubectl apply -f- <<EOF
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: minica
spec:
  dnsNames:
    - a.b.c
  issuerRef:
    kind: Issuer
    name: minica
  secretName: minica
---
apiVersion: cert-manager.io/v1
kind: Issuer
metadata:
  name: minica
spec:
  acme:
    email: foo@bar.com
    server: https://me:14000/dir
    skipTLSVerify: true
    privateKeySecretRef:
      name: minica
    solvers:
      - http01:
          ingress:
            class: traefik
EOF

At this point, there is no issue. The new account is created with no problem, and subsequent new order calls work. Here is the trace using mitmproxy:

Screenshot from 2022-02-22 15-20-45

The call itself contains the kid:

Screenshot from 2022-02-22 15-23-27

🚧 I uploaded the mitmproxy trace in case you want to check it out. You can read it with the following command:

curl -L https://raw.githubusercontent.com/maelvls/kubectl-incluster/main/josejson.py >/tmp/josejson.py
mitmproxy -s /tmp/josejson.py -r <(curl -L https://gist.github.com/maelvls/74c6584d95017a79e8ab107030b060e6/raw/da6a0c1b39d8cd7623bfef7f4dae80b3db881459/pebble-mitmproxy-traces.tar.gz | tar xzO pebble-happy.mitmproxy)

After re-starting pebble, then renewing the certificate with the following command:

curl -L https://github.com/jetstack/cert-manager/releases/download/v1.7.1/cmctl-linux-amd64.tar.gz | tar xz cmctl
./cmctl renew minica

then things go sideways:

Screenshot from 2022-02-22 14-22-30

The call to /order-plz contains the jwk field instead of kid:

Screenshot from 2022-02-22 14-23-20

The response makes it clear, /order-plz wants jwk, not kid: Screenshot from 2022-02-22 14-24-28

🚧 To see the recorded mitmproxy trace:

curl -L https://raw.githubusercontent.com/maelvls/kubectl-incluster/main/josejson.py >/tmp/josejson.py
mitmproxy -s /tmp/josejson.py -r <(curl -L https://gist.github.com/maelvls/74c6584d95017a79e8ab107030b060e6/raw/da6a0c1b39d8cd7623bfef7f4dae80b3db881459/pebble-mitmproxy-traces.tar.gz | tar xzO pebble-unhappy.mitmproxy)

The problem

It seems like cert-manager is too optimistic regarding 400 errors. The error:

{
    "detail": "unable to find existing account for only-return-existing request",
    "status": 400,
    "type": "urn:ietf:params:acme:error:accountDoesNotExist"
}

should be understood by cert-manager as a non-retriable error and cert-manager should try to register the account again, without the {"returnExistingOnly": true} part.

I realized this issue was already been documented in 2020:

The acme library first makes an account lookup call using POST /new-acct with a payload of {"onlyReturnExisting": true}. If the account lookup fails, then that results in urn:ietf:params:acme:error:accountDoesNotExist, and we fail to obtain the kid (key id) which is needed for the /new-order payload. Nevertheless, the code persists and makes the call anyway, resulting in the error message we're seeing here.

After a debugging session with delve, I discovered that the 4 calls (with the red line above) originate from one single function in acmeorders/sync.go:

acmeOrder, err := cl.AuthorizeOrder(ctx, authzIDs, options...)
if acmeErr, ok := err.(*acmeapi.Error); ok {
  if acmeErr.StatusCode >= 400 && acmeErr.StatusCode < 500 {
    log.Error(err, "failed to create Order resource due to bad request, marking Order as failed")
    c.setOrderState(&o.Status, string(cmacme.Errored))
    o.Status.Reason = fmt.Sprintf("Failed to create Order: %v", err)
    return nil
  }
}

It seems like we decided not to re-check the registration in acme/setup.go, and that is causing this bug.

// If the Host components of the server URL and the account URL match,
// and the cached email matches the registered email, then
// we skip re-checking the account status to save excess calls to the
// ACME api.

Removing this check fixes things, but we get many more calls:

Screenshot from 2022-02-22 16-51-25

The culprit for all these newAccount calls (/sign-me-up in Pebble) with {"onlyReturnExisting": true} is a function, named accountKID, that gets called every time a POST request is made in the x/crypto/acme library:

// accountKID returns a key ID associated with c.Key, the account identity
// provided by the CA during RFC based registration.
// It assumes c.Discover has already been called.
//
// accountKID requires at most one network roundtrip.
// It caches only successful result.
//
// When in pre-RFC mode or when c.getRegRFC responds with an error, accountKID
// returns noKeyID.
func (c *Client) accountKID(ctx context.Context) keyID {
	c.cacheMu.Lock()
	defer c.cacheMu.Unlock()
	if !c.dir.rfcCompliant() {
		return noKeyID
	}
	if c.kid != noKeyID {
		return c.kid
	}
	a, err := c.getRegRFC(ctx) // 💣️ Returns urn:ietf:params:acme:error:accountDoesNotExist
	if err != nil {
		return noKeyID
	}
	c.kid = keyID(a.URI)
	return c.kid
}

The above error is ignored, and the "actual" POST request e.g. NewOrder will still happen. But since the Key ID isn't set, the POST request will be performed using the jwk field instead of kid.

I came to the conclusion that the x/crypto/acme library may not be RFC compliant. When the NewOrder function is called without having run Register first, you will end up with a POST request with jwk instead of kid.

This file has been truncated, but you can view the full file.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment