Skip to content

Instantly share code, notes, and snippets.

@aarongable
Last active August 7, 2023 22:30
Show Gist options
  • Star 2 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save aarongable/78167fc1464b6a8a0a7065112ac195e9 to your computer and use it in GitHub Desktop.
Save aarongable/78167fc1464b6a8a0a7065112ac195e9 to your computer and use it in GitHub Desktop.
Draft proposal for overhauling the CCADB Incident Reporting Requirements (https://www.ccadb.org/cas/incident-report)

Incidents

Incidents happen. Things do not always go as planned, and that can be okay. However, when incidents occur, the underlying issue (i.e., root cause) should be identified and remediated to discourage the incident from occurring again. Formally documenting the incident in a report encourages an understanding of all contributing root cause(s), and it presents the opportunity to clearly communicate a remediation plan to reduce the probability of its reoccurrence.

Depending on the root programs in which a CA Owner participates, it may be required to:

These reports provide lessons learned and transparency about the steps the CA takes to address the immediate issue and prevent future issues. If the underlying problem goes unfixed, then other issues that share the same root cause will subsequently surface. Additionally, incident reports help the Web PKI ecosystem as a whole because they promote continuous improvement, information sharing, and highlight opportunities to define and adopt improved practices, policies, and controls.

Incident Reports

The purpose of incident reporting is to help us work together to build a more secure web. Therefore, the incident report should share lessons learned that could be helpful to all CA Owners in building better systems. The incident report should explain how systems or processes failed, how the mis-issuance or incident was made possible, and why the problem was not detected earlier. In addition to the timeline of responding to and resolving the incident, the incident report should explain how the CA Owner’s systems or processes will be made more robust, and how other CAs may learn from the incident.

Each incident should result in an incident report written as soon as the problem is fully diagnosed and (temporary or permanent) measures have been put in place to ensure it will not reoccur. If the permanent fix will take significant time to implement, you should not wait until this is done before issuing the report. Incident reports should be published as soon as possible, and certainly within two weeks of the initial issue being reported.

There should be a single incident report for each distinct matter, and CA Owners should submit an additional, separate incident report when:

  • Policy requires the revocation of one or more certificates by a certain deadline, such as those in BR section 4.9, but that deadline will not be or has not been met by the CA Owner.
  • In the process of researching one incident, another incident with a distinct root cause and/or remediation is discovered.
  • After an incident is marked resolved in Bugzilla, the incident reoccurs.

The incident report may well repeat things previously said in discussions or Bugzilla comments. This is entirely expected. The report should be a summary of previous findings. The existence of data in discussions or Bugzilla comments does not excuse a CA Owner from the task of compiling a proper incident report.

Open incident reports should be updated:

  • At least weekly unless a Root Store Operator has agreed to a different schedule by setting a "Next Update" date in the "Whiteboard" field of the bug; and
  • When Action Items are changed, completed, or delayed.

Creating an Incident Report

Create a new Bugzilla issue by filling out the Summary and Description fields of this form. If a Bugzilla issue has already been created for this incident (e.g. by an external security researcher) you may skip this step.

An initial report should be filed within 48 hours of being made aware of the incident. If a full incident report is not yet ready, you should provide a preliminary report containing an executive summary of the incident and a date by which the full report will be posted. The full incident report must be posted within two weeks of the incident.

To create the full incident report, copy the markdown template below and fill out each section according to the following instructions and requirements:

  1. The Summary section should contain a short description of the nature of the issue. This provides just enough context for new readers to understand the details in the rest of the report.
  2. The Impact section should contain a short description of the size and nature of the incident. For example: how many certificates, OCSP responses, or CRLs were affected; whether the affected objects share features (such as issuance time, signature algorithm, or validation type); and whether the CA had to cease issuance during the incident.
  3. The Timeline section must include a detailed timeline of all actions leading up to and taken during the incident. All times should be in UTC and have at least minute-level granularity. In addition, it must indicate the following events:
    • Any policy, process, or software changes that contributed to the Root Cause
    • The time at which the incident began
    • The time at which the CA became aware of the incident
    • The time at which the incident ended
    • The times at which issuance ceased and resumed, if relevant
  4. The Root Cause Analysis section must contain a detailed analysis of the conditions which combined to give rise to the issue. It is unusual for an incident to have a single root cause; often there must be a confluence of several issues such as a software bug, insufficient checks, and a malformed request. Make sure that all contributing causes are identified and described, including noting when they first arose and how they avoided detection until now.
  5. The Lessons Learned section should contain the following subsections:
    • What went well: a list of things that caused the incident to have less impact than it otherwise could have, such as early detection, rapid response, or good safety mechanisms. This section provides an opportunity for other CAs to learn from the good practices of this CA.
    • What didn't go well: a list of things that caused the incident to have more impact than it otherwise would have, such as missing checks or unclear documentation. Each item here must have at least one corresponding Action Item below, and provides opportunities for other CAs to ensure they make similar improvements if they haven't already.
    • Where we got lucky: a list of things that went well, but which cannot be relied upon, such as early detection by an external security researcher or limited impact simply due to a small number of requests. Items here should generally also have corresponding Action Items, so that the CA doesn't have to rely on luck in the future.
  6. The Action Items section must contain a list of remediation items that will be undertaken to ensure that similar incidents do not reoccur in the future. Note that it is not sufficient for these action items to simply stop this incident, they must create additional protections to prevent future incidents. Each Action Item should state:
    • A short description of the action to be taken.
    • A classification of whether the action will help Prevent future incidents, Mitigate the impact of future incidents, or Detect future incidents. CAs are encouraged to propose action items in all three categories, with an emphasis on prevention and mitigation.
    • A date by which the action item will be complete.
  7. Finally, the Appendix is for all supporting data: log files, graphs and charts, etc. In particular, in the case of incidents which directly impacted certificates, the Appendix must include a listing of the complete certificate details of all affected certificates. The recommended format is to ensure that all affected certificates are logged to CT, then to attach a text file where each line is of the form https://crt.sh/?sha256=[sha256 fingerprint of the certificate]. When the incident being reported involves an SMIME certificate, if disclosure of personally identifiable information in the certificate may be contrary to applicable law, please provide at least the certificate serial number and SHA256 hash of the certificate.

Incident Report Template

## Incident Report

### Summary



### Impact



### Timeline

All times are UTC.

YYYY-MM-DD:
- HH:MM Example

### Root Cause Analysis



### Lessons Learned

#### What went well

* 

#### What didn't go well

* 

#### Where we got lucky

* 

### Action Items

| Action Item | Kind | Due Date |
| ----------- | ---- | -------- |
| Example | Prevent | 2038-01-19 |

### Appendix

#### Details of affected certificates

Example Incident Reports

Here are some examples of good practice, where a CA did most or all of the things recommended above:

Note that these incidents conformed to an earlier version of the incident reporting template.

Audit Incident Reports

When audits are performed, an audit statement may document qualifications or non-conformities (i.e., findings) that were identified during the audit. Audit incident reports are created as a bug in Bugzilla under the CA Program:CA Certificate Compliance component and should include at least the following topics:

  1. Issue # (where each non-conformity, qualification, and/or modified opinion is represented as a separate issue):
  2. Issue Description:
  3. Root Cause of Issue:
  4. Remediation Plan for this Issue:

The remediation plan includes the action(s) for resolving the issue, the status of each action, and the date each action will be completed. The audit incident report summary in Bugzilla should include the CA Owner name and “Findings in 20XX Audit”, where XX is the year the audit period or point-in-time ended (e.g., CA ABC: Findings in 2022 Audit).

Audit incident reports should be updated when:

  • Identifying changes to the presented issue remediation plan(s),
  • Completion of issue remediation action(s), or
  • Delays in completing issue remediation action(s).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment