JP Vossen via plug on 10 Jun 2024 18:14:34 -0700


[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]

Re: [PLUG] root "pkill: killing pid * failed: Operation not permitted"


On 6/7/24 04:07 PM, JP Vossen wrote:
What could cause "pkill: killing pid * failed: Operation not permitted" *when run by root*?

After patching and reboots the other day I started getting daily Anacron emails from Logrotate on most (but not all) of 50+ VMs saying:
```
/etc/cron.daily/logrotate:
pkill: killing pid NNN failed: Operation not permitted
```

The culprit is the (quite horrible, but mandatory) Crowdstrike `falcon-agent` service, running from the stock vendor RPM that has not changed since April, and we've had patching reboots since then.

The really confusing thing is that *most* of them are doing this, but *not* all, and I can't find any differences!  The 50+ VMs are a mix of (quite horrible, but mandatory) Oracle Linux 7.9 (EoL soon, thus migrating) and 8.10, but the problem doesn't follow the distro.  Also, a few of the ones that complained on Wed did not complain on Thu, so they "fixed" themselves?

When I manually run the relevant line from `/etc/logrotate.d/falcon-sensor` *as root*, it either silently works or fails with the error above, according to whether I get the Anacron emails or not.  So it's not that `/usr/bin/pkill -HUP falcon-sensor` is a problem, and it is running as root.  It just...sometimes works and sometimes doesn't.

The `falcon-sensor` process itself is running as root, as is the parent `falcond`.  Restarting via `systemctl restart falcon-sensor` doesn't help, neither does a stop then start.  Every VM has the same `falcon-sensor-7.01.0-15604.el7.x86_64` or `falcon-sensor-7.01.0-15604.el8.x86_64` and both vendor RPMs have identical *stock* `/etc/logrotate.d/falcon-sensor` and `/usr/lib/systemd/system/falcon-sensor.service` files.

`/usr/bin/pkill` is also the same on working and broken servers, and SELinux is disabled.  They all have the same kernel (current for either OEL-7 or 8) and it is *not* UEK (the Oracle Unusable Enterprise Kernel that always ends in tears).  They all have plenty of free disk space.

This is over-simplified (and skipping some related nodes), but I have 5 groups of 8 identical "agent" VMs, and for 4 groups 7 fail and 1 works.  The 5th group only had 3 bad out of 8, then that "fixed" itself.


First, thanks to all who replied!

Second, no matter how many times I read and re-read these kinda of emails before sending, I never get it as clear as I'd like, though replies help clarify where I went wrong.

To restate a bit differently: a vendor supplied logrotate script is now emitting a warning on 7 out of every 8 servers; the 8th one is still working fine.  That particular package didn't change, and I can find no differences between working and whining...

So...what could suddenly cause `pkill -HUP` "killing pid * failed: Operation not permitted" *when run by root* as part of an unchanged vendor logrotate script?


More details:

*I* am not trying to do `pkill -HUP` the process, the *vendor supplied* `/etc/logrotate.d/falcon-sensor` is doing that.  And until I patched and rebooted last week, all 50+ servers Just Worked Fine.

NOTE: this is -HUP ("signal hang up", AKA, wake up and re-read your config file), not kill!  The point, again, is *vendor supplied* Logrotate.

I did not choose or change the `/usr/bin/pkill -HUP falcon-sensor` line to put in their Logrotate script, Crowdstrike did, and up to last week, it worked.  That said, if I run that line manually as root I do get the same "expected" results.  That is, it works on working servers and fails on whining ones.

Roughly 1 out of every 8 servers *still* works fine, the other 7 are now whining.  I can't find a single difference between the ones that work and the ones what whine.  I've checked:
	* Same RPM for Crowdstrike (AKA falcon-sensor, as listed above))
	* `rpm -qa | sort -u | md5sum` is identical on working and whining
	* SELinux = disabled, and I checked a few different ways
	* `/usr/bin/pkill` has the same md5, perms, root:root, and no "capabilities"
	* Same OEL-7.9 & running kernel version
	* They are all VMware VMs and all rebooted at the same time (more-or-less)
	* `logrotate --debug ...` failed to shed any light
	* Probably other things I'm forgetting

`systemctl restart falcon-sensor` has no effect except to change the PID.  A stop, check `ps` to make sure it's really gone, then start has no effect either.

Speaking of the process itself, both working and whining show ps STAT "Sl":
	S    interruptible sleep (waiting for an event to complete)
	l    is multi-threaded (using CLONE_THREAD, like NPTL pthreads do)

Clearly, *something* changed 7/8ths of the time...but I can't find it, or even think of what it *could* be.  So...what could suddenly cause `pkill -HUP` to stop working?

We do have vendor support but due to internal processes that's a PITA, and last time we asked them anything, they were utterly useless, and that was a simpler question.  (Why does your PoS so-called "security" tool *fail to start* after patching reboots about 35% of the time?!?  So yeah, this tool sucks, and I only have it because I am forced to.)

Thanks again for thinking about this,
JP
--  -------------------------------------------------------------------
JP Vossen, CISSP | http://www.jpsdomain.org/ | http://bashcookbook.com/

___________________________________________________________________________
Philadelphia Linux Users Group         --        http://www.phillylinux.org
Announcements - http://lists.phillylinux.org/mailman/listinfo/plug-announce
General Discussion  --   http://lists.phillylinux.org/mailman/listinfo/plug