Re: [PLUG] Collecting k8s events

I would look into Loki with Graphana, Prometheus, OpenSearch, and for tracing I would consider Jaeger.

I need to take time and talk to you on IRC for how fluentd works and related to set up a backend. For the backend, if you want for a time series database that isn't Loki, look at TimeScale.

-Will C

On Tue, Jun 13, 2023, 09:42 Rich Freeman via plug <plug@lists.phillylinux.org> wrote:

I'm running k8s at home, and occasionally I'll find pods getting
restarted with little idea of why. The reason for a pod restart is
captured as a k8s event, but k8s only retains these for one hour and
has no alerting, so odds are that when I notice that something has
restarted, these events are gone.

It seems like the typical solution is some kind of log aggregator
solution (though not all support events - fluentd doesn't appear to,
for example).

It seems like the typical approach is to use an agent (many exist) to
collect this data, and then dump it into some kind of
search/visualization tool. Elasticsearch and Grafana appear to be the
leading options.

What would make the most sense? Keep in mind this is for the home, so
we're talking 10-20 containers, not thousands. I'd prefer that the
monitoring solution not be the most complex part of the entire
cluster. That said, if it is easy to deploy in k8s then that
mitigates the complexity (though I'd still prefer that it not eat up
gigabytes of RAM 24x7).

I am not concerned with pretty charts of CPU use and so on. This is
about logs, not metrics. Of course being able to be alerted when a
host is running out of disk is still useful, but I don't have so many
hosts that it is hard to be aware of things like that.

Oh, for those who aren't familiar with k8s, I'm talking about k8s
events, not application logs. Logs are created by applications and
are what you're already familiar with. Events are basically logs at
the cluster level. So if an application has an error and terminates,
it might helpfully write to the log before it dies. If the cluster
sends the application a SIGTERM for whatever reason, the application
log is just going to say that it died because it got a SIGTERM, but
the events would capture why the cluster sent that SIGTERM. So it is
important to capture both. Events would also tell you about hosts
crashing and so on, though as with most OS-level logs they might not
say much as to why.

--
Rich
___________________________________________________________________________
Philadelphia Linux Users Group -- http://www.phillylinux.org
Announcements - http://lists.phillylinux.org/mailman/listinfo/plug-announce
General Discussion -- http://lists.phillylinux.org/mailman/listinfo/plug