Headline
CVE-2018-16878: High: cumulative patchset to fix CVE-2019-3885, CVE-2018-16877, CVE-2018-16878 + additional unmasked null pointer deref by jnpkrn · Pull Request #1749 · ClusterLabs/pacemaker
A flaw was found in pacemaker up to and including version 2.0.1. An insufficient verification inflicted preference of uncontrolled processes can lead to DoS
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation 8 Commits 8 Checks 0 Files changed
Conversation
Copy link
Contributor
** jnpkrn commented Apr 17, 2019**
This could possibly lead to unsolicited information disclosure by the means of standard output of the immediately preceding agent/resource execution leaking into the log stream under some circumstances. It was hence assigned CVE-2019-3885.
The provoked pathological state of pacemaker-execd daemon progresses towards crashing it for hitting segmentation fault.
[0/4: make crm_pid_active more precise as to when detections fail]
It would be bad if the function claimed the process is not active when the only obstacle in the detection process was that none of the detection methods worked for a plain lack of permissions to apply them. Also, do some other minor cleanup of the function and add its documentation. As an additional measure, log spamming is kept at minimum for repeated queries about the same PID.
[1/4: new helpers to allow IPC client side to authenticate the server]
The title problem here could possibly lead to local privilege escalation up to the root’s level (and implicitly unguarded by some additional protection layers like SELinux unless the defaults constrained further).
Main problem is that the authenticity assumptions were built on two, seemingly mutually supporting legs leading to two CVEs assigned:
* procfs (mere process existence and right path to binary check) used to verify (this part was assigned CVE-2018-16878), and
* one-way only client-server authentication, putting the client here at the mercy of the server not necessarily cross-verified per the above point if at all (this part was assigned CVE-2018-16877)
whereas these two were in fact orthogonal, tearing security assumptions about the “passive influencers” in the pacemaker’s daemon resilience-friendly constellation (orchestrated by the main of them, pacemakerd) apart. Moreover, procfs-based approach is discouraged for other reasons.
The layout of the basic fix is as follows: * 1/4: new helpers to allow IPC client side to authenticate the server (this commit, along with unifying subsequent solution for both CVEs) * 2/4: pacemakerd to trust pre-existing processes via new checks instead (along with unifying solution for both CVEs) * 3/4: other daemons to authenticate IPC servers of fellow processes (along with addressing CVE-2018-16877 alone, for parts of pacemaker not covered earlier) * 4/4: CPG users to be careful about now-more-probable rival processes (this is merely to mitigate corner case fallout from the new approaches taken to face CVE-2018-16878 in particular; courtesy of Yan Gao of SUSE for the reporting this)
With "basic", it is meant that it constitutes a self-contained best effort solution with some compromises that can only be overcome with the assistance of IPC library, libqb, as is also elaborated in messages of remaining “fix” commits. Beside that, also conventional encapsulation of server-by-client authentication would be useful, but lack thereof is not an obstacle (more so should there by any security related neglectations on the IPC client side and its connection initiating arrangement within libqb that has a potential to strike as early as when the authenticity of the server side is yet to be examined).
One extra kludge that’s introduced for FreeBSD lacking Unix socket to remote peer PID mapping is masquerading such an unspecified PID with value of 1, since that shall always be around as “init” task and, deferring to proof by contradiction, cannot be pacemakerd-spawned child either even if PID 1 was pacemakerd (and running such a child alone is rather nonsensical). The code making decisions based on that value must acknowledge this craze and refrain from killing/signalling the underlying process on this platform (but shall in general follow the same elsewhere, keep in mind systemd socket-based activation for instance, which would end up in such a situation easily!).
[2/4: pacemakerd to trust pre-existing processes via new checks instead]
In pacemakerd in the context of entrusting pre-existing processes, we now resort to procfs-based solution only in boundary, fouled cases, and primarily examine the credentials of the processes already occupying known IPC end-points before adopting them.
The commit applies the new helpers from 1/1 so as to close the both related sensitive problems, CVE-2018-16877 and CVE-2018-16878, in a unified manner, this time limited to the main daemon of pacemaker (pacemakerd).
To be noted that it is clearly not 100% for this purpose for still allowing for TOCTTOU, but that’s what commit (3/3) is meant to solve for the most part, plus there may be optimizations solving this concern as a side effect, but that requires an active assistance on the libqb side (ClusterLabs/libqb#325) since any improvement on pacemaker side in isolation would be very cumbersome if generally possible at all, but either way means a new, soft compatibility encumberment.
As a follow-up to what was put in preceding 1/3 commit, PID of 1 tracked as child’s identification on FreeBSD (or when socket-based activation is used with systemd) is treated specially, incl. special precaution with child’s PID discovered as 1 elsewhere.
v2: courtesy of Yan Gao of SUSE for early discovery and report for what’s primarily solved with 4/4 commit, in extension, child daemons in the initialization phase coinciding with IPC-feasibility based process scan in pacemakerd in a way that those are missed (although they are to come up fully just moments later only to interfere with naturally spawned ones) are now considered so that if any native children later fail for said clash, the pre-existing counterpart may get adopted instead of ending up with repeated spawn-bury loop ad nauseam without real progress (note that PCMK_fail_fast=true could possibly help, but that’s rather a big hammer not suitable for all the use cases, not the ones we try to deal with gracefully here)
[3/4: other daemons to authenticate IPC servers of fellow processes]
Now that CVE-2018-16877 issue alone is still only partially covered based on the preceding commits in the set, put the server-by-client authentication (enabled and 1/3 and partially sported in 2/3) into practice widely amongst the communicating pacemaker child daemons and towards CPG API provided by 3rd party but principally using the same underlying IPC mechanism facilitated by libqb, and consequently close the remaining "big gap".
As a small justification to introducing yet another “return value” int variable, type-correctness is restored for those that shall be cs_error_t to begin with.
[4/4: CPG users to be careful about now-more-probable rival processes]
In essence, this comes down to pacemaker confusing at-node CPG members with effectively the only plausible to co-exist at particular node, which doesn’t hold and asks for a wider reconciliation of this reality-check.
However, in practical terms, since there are two factors lowering the priority of doing so:
1/ possibly the only non-self-inflicted scenario is either that some of the cluster stack processes fail – this the problem that shall rather be deferred to arranged node disarming/fencing to stay on the safe side with 100% certainty, at the cost of possibly long-lasting failover process at other nodes (for other possibility, someone running some of these by accident so they effectively become rival processes, it’s like getting hands cut when playing with a lawnmower in an unintended way)
2/ for state tracking of the peer nodes, it may possibly cause troubles in case the process observed as left wasn’t the last for the particular node, even if presumably just temporary, since the situation may eventually resolve with imposed serialization of the rival processes via API end-point singleton restriction (this is also the most likely cause of why such non-final leave gets observed in the first place), except in one case – the legitimate API end-point carrier won’t possibly acknowledged as returned by its peers, at least not immediately, unless it tries to join anew, which verges on undefined behaviour (at least per corosync documentation)
we make do just with a light code change so as to
* limit 1/ some more with in-daemon self-check for pre-existing end-point existence (this is to complement the checks already made in the parent daemon prior to spawning new instances, only some moments later; note that we don’t have any lock file etc. mechanisms to prevent parallel runs of the same daemons, and people could run these on their own deliberation), and to
* guard against the interferences from the rivals at the same node per 2/ with ignoring their non-final leave messages altogether.
Note that CPG at this point is already expected to be authenticity-safe.
Regarding now-more-probable part, we actually traded the inherently racy procfs scanning for something (exactly that singleton mentioned above) rather firm (and unfakeable), but we admittedly got lost track of processes that are after CPG membership (that is, another form of a shared state) prior to (or in non-deterministic order allowing for the same) carring about publishing the end-point.
Big thanks is owed to Yan Gao of SUSE, for early discovery and reporting this discrepancy arising from the earlier commits in the set.
This is now more likely triggerable once the problems related to CVE-2018-16878 are avoided.
Copy link
Contributor Author
****jnpkrn** commented Apr 17, 2019**
Oops, Travis is failing in part because now we require resolution of
hacluster/haclient, but arguably, the environment not having those
is rather hostile for the cluster stack (basically an error in the
deployment), so I suggest to deal with this afterwards.
For illustration:
(lrmd_ipc_connect) info: Connecting to executor
(crm_user_lookup) info: User hacluster lookup: Invalid argument
(crm_client_new) warning: Could not find user and group IDs for user hacluster
Copy link
Contributor Author
****jnpkrn** commented Apr 17, 2019**
Last patch is a tentative attempt to calm Travis down.
Copy link
Contributor Author
****jnpkrn** commented Apr 17, 2019**
No longer tentative, it apparently helped.
Copy link
Contributor Author
****jnpkrn** commented Apr 18, 2019**
Just to rectify Med: controld: fix possible NULL pointer dereference
commit a bit after-the-fact (there was no time for detailed analysis
yesterday given the disclosure rush), it was related to the situation
that for some reason neither of these daemons runs:
- corosync
- pacemakerd
- pacemaker-based
(or this one runs but is not authentic [or name service for user name
to UID translation et al. has just fallen apart], which amounts to the
same now, hence there’s still a bit of truth about "more likely
triggerable", though CVE-2018-1687 should have been mentioned instead,
mea culpa, again, not enough time for fact checking)
and pacemaker-controld just comes to life
(very contrieved scenario:
- corosync up, pacemakerd has just spawned pacemaker-controld
- at this point, corosync crashes
- the above somehow (possibly with the involvement of the supervisor
above pacemakerd since the prerequisite, corosync, ceased to run,
or it could be CPG API both of these use that could trigger something
unexpected on corosync’s death) causes both pacemakerd and
pacemaker-based disappear, while pacemaker-controld continues to
run (wasn’t connected to CPG yet) - pacemaker-controld would attempt to connect to corosync, and after
a while falls on its nose for the lack of guard the commit just added
[though it now seems – also? – as shenanigans in the FSM transitions
once 30 attempts to connect to CIB fall short – it’s all very messy
and undocumented in actual words what the controlled failure
modes are with intended behaviours and recovery, sadly sadly sadly])
More realistic scenario is that the admin decided for some reason to
run /usr/lib/exec/pacemaker-controld in isolation without corosync
(and pacemaker-based) running (perhaps to test pacemakerd’s ability
to adopt pre-existing processes but simply getting the timing wrong).
All in all, that commit is also a very desirable, even if it rather
fixes a long-standing flaw, good that my colleague Martin Juříček
conducting some tests discovered that (thanks).
Copy link
Contributor Author
****jnpkrn** commented May 28, 2019**
Related news
Gentoo Linux Security Advisory 202309-9 - Multiple vulnerabilities have been found in Pacemaker, the worst of which could result in root privilege escalation. Versions greater than or equal to 2.0.5_rc2 are affected.