On Twitter the other day, I was lamenting the state of OCSP stapling support on Linux servers, and got asked by several people to write-up what I think the requirements are for OCSP stapling support.
Support for keeping a long-lived (disk) cache of OCSP responses.
This should be fairly simple. Any restarting of the service shouldn't blow away previous responses that were obtained. This doesn't need to be disk, just stable - and disk is an easy stable storage for most server operators.
Validate the server responses to make sure it is something the client will accept.
There's a number of ways to botch this on the server, and sadly, a number of ways in which CAs can botch their response generators. The most immediate and obvious issues are situations where you have a 'revoked' response, or when you receive an OCSP 'tryLater' or 'internalError' response. However, there's also more subtle issues, like making sure the OCSP Response as actually well-formed (sometimes uploads to CDNs are botched), is time valid for the current time (sometimes the CDNs server stale files), is for the certificate requested (yes, sadly, really), and any sort of PKI-related errors (for example, the delegated OCSP signer's certificate being expired).
Refreshes the response, in the background, with sufficient time before expiration.
A rule of thumb would be to fetch at notBefore + (notAfter - notBefore) / 2, which is saying "start fetching halfway through the validity period". You want to be able to handle situations like the OCSP responder giving you junk, but also sufficient time to raise an alert if something has gone really wrong.
What you do NOT want to do is start OCSP fetching the first time you need it, or waiting until the response is fully expired - that creates really terrible experiences all around, and makes your CA an even bigger point of failure.
That said, even with background refreshing, such a system should observe the Lightweight OCSP Profile of RFC 5019.
This more or less boils down to "Use
GETrequests whenever possible, and observe HTTP cache semantics." Given how complicated the cache semantics can be to get right in a client, this can be surprisingly hard to implement correctly.
As with any system doing background requests on a remote server, don't be a jerk and hammer the server when things are bad.
The Internet is a strange and wonderful place, and sometimes servers and networks have issues. When a server supporting OCSP stapling has trouble getting a request, hopefully it does something smarter than just retry in a busy loop, hammering the OCSP server into further oblivion. This may seem implied by the previous two remarks, but it's worth spelling out.
Distributed or proxiable fetching
From talking with server operators, a variety of situations are brought up as challenges for OCSP stapling. One common bucket is the problem of front-end and back-end splits - there may be thousands of FE servers, all with the same certificate, all needing to staple an OCSP response. You don't want to have all of them hammering the OCSP server - ideally, you'd have one request, in the backend, and updating them all.
A variation of this problem is FEs that aren't actually allowed to initiate outbound connections. Sometimes it's required that the FE talk to a proxy server, sometimes it's just outright blocked - so a system should be robust in handling that distribution.
This may not be a problem for the OCSP daemon to solve - it could be that the matter is just treated as a general configuration management/distribution problem - but at least it should be clear to those deploying the config what the tradeoffs are. For example, is it possible for the config distribution system to mangle responses? Should FEs still check the validity of incoming responses?
The ability to serve old responses while fetching new responses.
That is, it shouldn't be mutually exclusive - it's not that there is the 'ONE TRUE RESPONSE' - some flexibility for multiple responses is needed.
Some idea of what to do when "things go bad".
What happens when it's been 7 days, no new OCSP response can be obtained, and the current response is about to expire? Do you:
- Stop the (web/email/ftp/xmpp) service?
- Stop serving stapled OCSP responses?
Especially in a world where Must-Staple becomes more prevalent, what should the action be taken when things go awful? If it's a Must-Staple cert, it might be more beneficial to fully stop the service (thus causing monitoring to really flip out) rather than serve bad responses or no response, both of which may result in even worse user experiences.
Configurable OCSP responder per-certificate-being-checked.
The CA/Browser Forum's Baseline Requirements allows CAs to omit the
authorityInfoAccessextension for situations where the subscriber has agreed to staple. This agreement can be done via contractual means or technical means, which is to say that it's not predicated on the Must-Staple extension in the certificate. The reason for this omission is to allow for smaller certificates, which offsets (a very small amount) of the size increase of the OCSP response.
For these certificates, the server operator will need to configure what the OCSP responder URL is for that certificate.
Staple by default.
If you can get all the above worked out, with sane behaviours, there is very little reason that OCSP stapling shouldn't be on by default. Make it happen!
If this seems like an unfairly long list, the reality is that virtually all of this is supported by Microsoft IIS services today. The Microsoft documentation is a bit spread out, but this is good for starters, and this is good for further reading.
Given this long list of things, which do seem somewhat 'basic', it seems a shame to require every TLS server to reimplement this. This seems ideal to have as a common, stand-alone daemon/service, which can then interface with a variety of TLS servers (IMAP, SMTP, HTTP, FTP, XMPP, etc).
Perhaps the most basic interface for this is simply dropping the OCSP response to a well-known path pre-agreed with the server. The server can monitor for changes to this file. When changes are noticed, it can start serving the new response. While some logic (such as shutting down the service) may be more complicated, that at least starts with some basic functionality.