The Keylime integrity verification system currently operates on a pull, or server-initiated, basis whereby a verifier directs a number of enrolled nodes to attest their state to the server on a periodic basis. This model is not appropriate for enterprise environments, as each attested node thereby acts as an HTTP server. The requirement to open additional ports for each node and the associated increase in attack surface is unacceptable from a compliance and risk management perspective.
This document aims to outline the challenges that need to be overcome in order to support an alternate push model in which the nodes themselves are responsible for driving the attestation cycle. These include changes to the registration and enrolment mechanisms, attestation and verification processes, and data model. We hope to elicit feedback from the Keylime community on these topics to arrive at a robust, forward-thinking solution which considers the latest developments in verification.
Thore Sommer (@THS-on) has previously put together a draft proposal on how some aspects of this could work. We make reference to this where relevant.
To begin, we first provide an overview of the current state of Keylime as it relates to this topic before moving onto a discussion of the inherent challenges and our ideas for overcoming them.
Contents |
---|
Keylime's stated purpose, as given on the official website, is to "bootstrap and maintain trust". The paper which presents the original design explains that this entails (1) provisioning of identity for each node and (2) monitoring of the nodes to detect integrity deviations.
Since this paper was presented at ACSAC in 2016, new approaches to handling the identity component of the equation have come on the scene, such as SPIFFE/SPIRE, a fellow CNCF project.
Keylime consists of three main components: a registrar, which acts as a simple store of identities associated with each node, a verifier, which analyses node state and detects changes, and an agent, which is installed on each node and reports information to the registrar and verifier.
The ACSAC paper also leaves a number of responsibilities to the "tenant", that is, the customer of a cloud platform. These tasks have been extracted into a management CLI, referred to by the same name. In the rest of this document, when we mention the tenant, we are referring to the command-line tool.
The Keylime registrar, verifier and tenant are all fully trusted to perform their respective responsibilities faithfully. Nodes, including their installed agents and any running workloads, are not trusted until the registrar, verifier and tenant collaborate to obtain and validate a set of trusted measurements of node state. From that point, a node is deemed trusted, at least until that trust is revoked (either by an automatic mechanism or through manual intervention by an administrator).
The authenticity of measurements are assured by the TPM of each node (the TPMs are thereby also considered to be trusted).
In the default configuration, the registrar and verifier share a single TLS certificate and corresponding private key for server authentication and secure channel establishment. The verifier and tenant share a TLS certificate and private key for client authentication. Both server and client certificate are produced by a common Keylime CA, the certificate for which is preloaded and trusted by all components (verifier, registrar, tenant and agent).
The agent has its own TLS certificate this document calls the NKcert (mtls_cert
in the REST APIs and source code) which it generates on first startup using a set of transport keys (collectively, the NK, or individually, the NKpub and NKpriv) chosen at random. NKpub is sometimes referred to simply as the pubkey
(e.g., in the API docs). The NKcert is registered with the registrar on agent startup and then verified as being linked to a trusted TPM when the agent is first enrolled for periodic verification.
The agent also uses the TPM to create an attestation key (AK), referred to as an "attestation identity key" (AIK) in older TCG specs (and in places in the Keylime source code), associated with the TPM's endorsement hierarchy. The public portion of attestation key (AKpub) is reported to the registrar and verifier during registration/enrolment of the agent. Additionally, the registrar receives the TPM's public endorsement key (EKpub) and endorsement certificate (EKcert).
In the current architecture, the Keylime server components are entrusted with these specific responsibilities:
Registrar | Verifier | Tenant |
---|---|---|
|
|
|
Keylime performs its functions via a handful of protocols between the various Keylime components. The relevant ones are as follows:
- Registration protocol: Enables the agent on first start to register its EKpub, EKcert and AKpub with the registrar and prove that the AK and EK are linked.
- Enrolment protocol: Four-way protocol between the tenant, registrar, verifier and agent to enrol the agent for periodic verification by the verifier and provision nodes with credentials.
- Attestation protocol: The verifier uses this to request TPM quotes from the agent according to a configured interval.
A high-level overview of these protocols is given in the diagram below:
In the current pull model, an end user needs to use the tenant to enrol an agent with the verifier. As part of this process, the tenant contacts the agent to obtain an identity quote which cryptographically links the agent's NKpub to the AKpub.
Of course it is not possible to simply reverse the directionality of the tenant–agent interaction, as the tenant is not a long-running process which exposes a REST interface. As a result, this responsibility needs to be fulfilled by another component.
It seems most natural to move this functionality out of the enrolment process and into the registration protocol. Verification of the NKpub would be performed by the registrar, thereby tying all identities of the agent used in subsequent protocol runs together at time of registration.
The earlier draft proposal suggested replacing the identity quote with a signature produced by TPM2_Certify (TPM 2.0 Part 3, §18.2). We agree that this is a better approach than the current method of extending PCR 16 with the NKpub and generating a quote.
To implement this, one would simply need to augment the existing registration messages with additional fields for a nonce and signature (see diagrams below). Since no existing fields would be affected, this change would be backwards compatible with the existing pull model.
Current registration protocol | Future registration protocol |
---|---|
The tenant also currently retrieves the EKcert from the registrar and verifies it against a trust store containing TPM manufacturer certificates. If no EKcert is available (for example, if the agent is running on a VM deployed in the cloud), then the tenant allows the user to specify a script to use to verify the EK using custom logic.
The original vision for the registrar from ACSAC paper was to perform the task of an Attestation CA per section 9.5.3.1 (3) of the TPM 2.0 architecture specification (formerly called a Privacy CA). This means that other components in the system would trust the registrar to verify the authenticity of the AK all the way up the chain to the TPM manufacturer's CA certificate.
We suggest returning to this original design such that “the registrar [...] checks the validity of the TPM EK with the TPM manufacturer” (quoting from the paper) which is also consistent with the previous draft proposal. The registrar already checks part of the chain (from AK to EK) so it would seem natural to extend this the rest of the way.
This would require no changes to the protocols; EK verification logic simply needs to be added to the registration request handler.
There is the slight challenge of handling the case where no EKcert is provided by the agent. We would recommend against adopting the script-based approach used by the tenant in the current pull model, and propose instead that the user is given the option of configuring a webhook which the registrar can query for a decision on whether a given EK should be trusted. This has a number of benefits:
- Does not require the registrar to invoke a shell command, avoiding the associated performance impacts.
- More consistent with service-oriented architectures.
- Allows users to change the decision logic without making changes to the registrar.
- Keeps the attack surface of the registrar as small as possible.
- By confining the decision logic to a separate node, it can be verified by Keylime if the user manually checks that node's EK and marks it as trusted in the registrar's database.
This proposed web hook functionality can be added to the existing registration protocol while remaining backwards compatible with previous versions and is illustrated by the sequence diagram:
Note that the outgoing request to the webhook URI is performed in a non-blocking way, so the registrar can reply to an agent's registration request without waiting for a response from the outside web service. If no well-formed response is received from the web service, it should reattempt the request using an exponential backoff, similar to what the verifier currently does when a request for attestation data from an agent fails.
As part of the current enrolment process, the user specifies a payload which is delivered to the agent and placed in a directory to be consumed by other software. The reason for this is to support the provisioning of identities to workloads running on the node (e.g., TLS certificates or long-lived shared secrets). The payload may optionally contain a script file, which is executed by the agent.
Considering the current landscape of the identity and access management space, a more modern approach to solving this problem would likely be to have Keylime report verification results to a SPIRE attestor plugin which could then handle provisioning of workload identities (see enhancement proposal 100). This offloads issues related to revocation and suitability for cloud-native workloads.
The arbitrary nature of the payloads mechanism also raises concerns as to the attack surface of the agent and the whole Keylime system. Not only can Keylime server components query a node to report on its state but they also have the power to modify a node's state and execute arbitrary code. Enterprise users would consider this unacceptable.
As a result, we recommend that the payload feature is not implemented in the push model. This gives users the choice to opt into a more secure design which considers a stronger threat model without taking features away from existing users. And for users which do require identity provisioning alongside push support, they have the option of using SPIFFE/SPIRE.
Assuming the above recommendations are implemented, most of the tenant's role in enrolment has effectively been eliminated from the push model. Beyond those benefits already discussed, this is worth pursuing as the Keylime project starts to consider future deployment scenarios which do not involve the tenant.
Since users will be able to rely on the registrar to perform verification of a node's entire identity chain from the NK and AK all the way to a trusted TPM manufacturer CA certificate, they can interact with Keylime exclusively through its REST APIs without having to implement these checks themselves. This makes it easy, as an example, to combine horizontal autoscaling with a serverless function (where invoking the tenant would be impractical) to automatically start verification of newly provisioned VMs.
Outside of the binding of a nodes' various cryptographic identities, the only enrolment functions which the tenant performs are:
- It retrieves certain information about the agent (such as its AKpub) from the registrar and provides that to the verifier.
- It provides the verifier with the verification policies it should use to verify the attestations received from the agent.
Continuing in the vain of making routine tasks automatic where possible, it would make sense to have the agent provide information about itself to the verifier on first run, just as it does to register itself at the registrar. The verifier would then begin receiving attestations from the agent right away but indicate that no policy is configured in its response. Subsequent attestations would proceed according to an exponential backoff until a policy is added via the tenant (or API request from some third-party component).
This automatic enrolment could optionally be backported to the pull model to bring the user experience of both into alignment. The difference would be that the verifier would not begin request attestations from the agent until a policy is configured.
Historically, the Keylime protocols were envisioned to work over unencrypted HTTP before a number of security issues were identified with this approach. Luckily, most communication is now protected by TLS. However, agent–registrar communication still happens over HTTP. There is no reason for this, as far as we can tell: the registrar has a TLS certificate preloaded as trusted by the agent. If we consider the registrar to be performing the function of a TPM Attestation CA (i.e., Privacy CA), this is a problem as anyone can intercept the traffic and associate a given AK with its EK (an eavesdropper can even determine the outcome of the challenge–response between the registrar and remote TPM), defeating the privacy-preserving properties of the AK.
The lack of TLS for registration also makes it easier for an attacker to interleave registration messages between two different runs of the protocol to cause the TPM of one node to be associated with the UUID of another, later resulting in the wrong verification policies being applied to the nodes.
Because of these threats, in the push model, TLS should be required during registration. We recommend that this is also required in future versions of the pull model protocols.
To obtain an integrity quote in the current pull architecture, the verifier issues a request to the agent, supplying the following details:
- A nonce for the TPM to include in the quote
- A mask indicating which PCRs should be included in the quote
- An offset value indicating which IMA log entries should be sent by the agent
The agent then replies with:
- The UEFI measured boot log (kept in
/sys/kernel/security/tpm0/binary_bios_measurements
) - A list of IMA entries from the given offset
- A quote of the relevant PCRs generated and signed by the TPM using the nonce
In a push version of the protocol where the UEFI logs, IMA entries and quote are delivered to the verifier as an HTTP request issued by the agent, the agent needs a mechanism to first obtain the nonce, PCR mask and IMA offset from the verifier. We suggest simply adding a new HTTP endpoint to the verifier to make this information available to an agent correctly authenticated with the expected certificate via mTLS.
As such, the push attestation protocol would operate in this manner:
-
When it is time to report the next scheduled attestation, the agent will request the attestation details from the verifier.
-
If the request is well formed, the verifier will reply with a new randomly-generated nonce and the PCR mask and IMA offset obtained from its database. Additionally, the verifier will persist the nonce to the database.
-
The agent will gather the information required by the verifier (UEFI log, IMA entries and quote) and report these in a new HTTP request along with other information relevant to the quote (such as algorithms used).
-
The verifier will reply with the number of seconds the agent should wait before performing the next attestation and an indication of whether the request from agent appeared well formed according to basic validation checks. Actual processing and verification of the measurements against policy can happen asynchronously after the response is returned.
This protocol is contrasted against the current pull protocol in the sequence diagrams which follow:
Pull attestation protocol | Push attestation protocol |
---|---|
One drawback of this approach is that the number of messages a verifier needs to process is doubled. However, this is unlikely to significantly impact performance as the most intensive operations performed by the verifier remain those related to verification of the received quotes. Any such impact should be offset by the increased opportunity for horizontal scaling presented by the push model (as it makes it easy to load balance multiple verifiers). Further optimisations of the protocol can be explored once work on the simple version presented above has been completed.
Currently, when an AK is successfully bound to the EK of the node by way of TPM2_ActivateCredential, the registrar sets the node's active
field in the database to true
. Prior to this event, performing an HTTP GET request to get information about the node results in a 404 error even if it exists in the database.
To support the push model, the registrar will be performing checking of all the keys associated with an agent. It thus becomes necessary to store and report more granular information about the trust status of agents.
Our suggestions for representing this information is as follows:
- Add an
ek_trust_status
field of typeInteger
containing one of these values: NOT_TRUSTED (0) / TRUSTED_BY_CERT (1) / TRUSTED_BY_WEBHOOK (2) - Add a
last_ek_trust_decision
field of typeInteger
containing the timestamp at which the trust decision was made - Add a
mtls_cert_activated
field of typeInteger
and treated as a boolean - Rename the
active
field toak_activated
for consistency (a backwards-compatible change asactive
is not exposed by the REST API)
Since the configuration of how the registrar trusts EKs can change over time (if the user add/removes certificates from the registrar's trust store or if the logic of the webhook changes), it is worth storing some extra information about how the decision was made and when.
These fields would be available through the REST API. It would be best if the GET endpoint returns the JSON representation of an agent regardless of whether all checks have passed or not, to give consumers of the API full access to the status of an agent and its keys.
Keylime defines a number of states (in keylime/common/states.py) for driving the verifier's event loop and reporting the status of an agent. These states, of course, have not been considered in the context of the push model. However, certain states can be mapped to their push model equivalents.
When a verifier is configured to operate in push mode, we suggest that the operational state integer stored in the database take on these meanings instead:
Integer | Name (pull) | Name (push) | Meaning (push) |
---|---|---|---|
1 | START | ENROLLED | Agent has provided AKpub and other details but has not sent its first attestation |
2 | SAVED | NO_POLICY | There was no policy configured for the agent when it last sent an attestation |
3 | GET_QUOTE | AWAITING_QUOTES | The last attestation received from the agent verified successfully |
7 | FAILED | MALFORMED_QUOTE | The last attestation received from the agent was invalid |
9 | INVALID_QUOTE | POLICY_VIOLATION | The last attestation received from the agent did not verify against policy |
The other values (4, 5, 6, 8 and 10) relate to payloads or arise from the verifier-driven nature of the pull protocols and will not be used.
The following changes should be made to the agent's configuration options (usually set in /etc/keylime/agent.conf
):
- Add an
operation_mode
option which accepts eitherpush
orpull
. - Add
verifier_ip
andverifier_port
options to specify how the agent should contact the verifier when operating in push mode. - Disallow setting
enable_agent_mtls
to false when operating in push mode. - Update comments to indicate which values won't have an effect when push mode is turned on (e.g., those related to payloads).
It is suggested that operation_mode
should not be configurable via environment variable and that the agent checks the ownership of the config file on startup, outputting a warning if it can be written to by any user other than root.
These changes should be made to the verifier's configuration options:
- Add an
operation_mode
option which accepts eitherpush
orpull
.
Finally, these changes should be made to the registrar's configuration options:
- Add an
ek_trust_store
option to set the directory where the registrar should look for trusted certificates when verifying EKs. - Add an
ek_webhook
option to set the URI the registrar should use to determine whether it should trust an EK when no EKcert is available.
Many thanks to Thore Sommer (@THS-on) for sharing his ideas for implementing push model support and to Marcus Heese (@mheese) for many helpful discussions around threat model and operational concerns. We also greatly appreciate the feedback and guidance received from the maintainers and community members in the June and July community meetings.
Once you are in an exponential backoff it may then take a while until the attestation actually starts. There should probably be a limit set to a few seconds on how far to back off until the first attestation with the policy starts. Like maybe 10s ?
If nothing happens before a policy is set we could require that a policy be set first using the tenant tool. This ordering of requiring a policy first could be used to open access for a particular agent to the verifier while preventing it to connect to the verifier first and keeping it busy for no reason...
Should the UUID of the agent be written into the client side certificate and make the client side certificate unique? This would at least make it a bit more difficult to just register an agent that then claims a UUID from a configuration file and prevent re-use of possibly stolen cert. On the other hand it requires issuing a certificate per agent, which may be an operational pain.