Headline
CVE-2022-3500: Fix proper exception handling and impedance match in `tornado_requests` by galmasi · Pull Request #1128 · keylime/keylime
A vulnerability was found in keylime. This security issue happens in some circumstances, due to some improperly handled exceptions, there exists the possibility that a rogue agent could create errors on the verifier that stopped attestation attempts for that host leaving it in an attested state but not verifying that anymore.
Summary
This patch fixes an impedance match problem between the verifier and the python tornado package. TLDR exception handling is incorrectly implemented in the glue layer (tornado_requests.py) resulting in both uncaught exceptions and potentially returned None values that the verifier cannot handle.
Technical details
- The uncaught exception addressed by this fix occurs in cloudlab_verifier_tornado in the function invoke_get_quote.
- The immediate cause for the failure is an operating system level failure that occurs in rare circumstances. We could only make it happen in virtual deployments, at large enough scale (1k+ agents) and with sufficient traffic over a virtual e1000 network adapter. We were unable to recreate this exception in controlled circumstances; as is, we are seeing approximately 3 network device driver crashes per verifier instance per day.
- With a relatively low probability (< 5% based on rough back of the envelope calculations) the network device driver crash causes an uncaught exception in invoke_get_quote. We call this an Old Attestation Failure (OAF) event, because the net effect is that the verifier thread quits, leading to a situation where a particular agent no longer receives get_quote requests. The verifier’s state machine remains in the verified state, and the Verifiermain database is no longer updated for this agent.
- 95% of the time the outcome is a caught exception in invoke_get_quote. This results in a standard communication failure exception followed by a keylime retry with exponential backoff, and the system recovers normally. The only record of such an event is the retry notification in the log.
- We believe that the uncaught communication exception is tied to the network device driver failing mid-communication. We believe neither tornado nor tornado_requests is handling this situation appropriately.
- Given the above deployment numbers above, we saw approximately one OAF event per verifier every 6 days. With 10 verifiers handling 1500 nodes, this resulted in about one OAF event every half day.
- OAF events do not show up in any keylime logs. Apparently exceptions thrown by coroutines to not print stack traces in the log unless async debugging is on.
We became aware of OAF events only after the patch by @maugustosilva (#1091 #1093) to dump attestation timers into the VerifierMain database.
cc: @mdrocco @maugustosilva
Related news
An update for keylime is now available for Red Hat Enterprise Linux 9. Red Hat Product Security has rated this update as having a security impact of Moderate. A Common Vulnerability Scoring System (CVSS) base score, which gives a detailed severity rating, is available for each vulnerability from the CVE link(s) in the References section.This content is licensed under the Creative Commons Attribution 4.0 International License (https://creativecommons.org/licenses/by/4.0/). If you distribute this content, or a modified version of it, you must provide attribution to Red Hat Inc. and provide a link to the original. Related CVEs: * CVE-2022-3500: keylime: exception handling and impedance match in tornado_requests
### Impact This vulnerability creates a false sense of security for keylime users -- i.e. a user could query keylime and conclude that a parcitular node/agent is correctly attested, while attestations are not in fact taking place. **Short explanation**: the keylime verifier creates periodic reports on the state of each attested agent. The keylime verifier runs a set of python asynchronous processes to challenge attested nodes and create reports on the outcome. The vulnerability consists of the above named python asynchronous processes failing silently, i.e. quitting without leaving behind a database entry, raising an error or producing even a mention of an error in a log. The silent failure can be triggered by a small set of transient network failure conditions; recoverable device driver crashes being one such condition we saw in the wild. ### Patches The problem is fixed in keylime starting with tag 6.5.1 ### Workarounds This [patch](https://github.com/keylime/keylime/pull/112...