Fix remote logging #138

Open
opened 2024-08-22 23:34:17 +02:00 by oysteikt · 5 comments
Owner

What's the current state of remote logging? We do have a loki instance on ildkule, but it's not doing anything interesting at the moment?

What's the current state of remote logging? We do have a loki instance on ildkule, but it's not doing anything interesting at the moment?
oysteikt added the
security
salt
big
logging
nixos
labels 2024-08-22 23:34:17 +02:00
oysteikt added this to the Kanban project 2024-08-22 23:34:17 +02:00
oysteikt changed title from Fix proper remote logging to Fix remote logging 2024-08-22 23:34:22 +02:00
Owner

Its was too slow to be useful, the times I tried using it anyways.

Right now it's broken since we moved all our logging/graphs/uptime tracking to openstack.

Which doesn't work (yet?) Felix is working on it I think

Its was too slow to be useful, the times I tried using it anyways. Right now it's broken since we moved all our logging/graphs/uptime tracking to openstack. Which doesn't work (yet?) Felix is working on it I think
Owner

Loki ingests logs just fine (when the server is up), but the grafana search and filtering can be quite slow and heavy, as it is naturally a resource intensive task. My main issue with loki is however that configuring alerts and rules in Grafana is awful, both for prometheus and loki data.

I think rsyslog is a great system for us, alternatively something like https://www.elastic.co/blog/get-system-logs-and-metrics-into-elasticsearch-with-beats-system-modules if we want to try something entirely different.
rsyslog is performant and cool, and has a mature and sane ecosystem of filtering and alerting (output modules like ommail and omhttp).

Whatever system we use will probably both require a performant server, and still be sluggish, as logs are heavy.
Both filebeat, rsyslog and promtail are however surprisingly lightweight on each of the clients.

Loki ingests logs just fine (when the server is up), but the grafana search and filtering can be quite slow and heavy, as it is naturally a resource intensive task. My main issue with loki is however that configuring alerts and rules in Grafana is awful, both for prometheus and loki data. I think rsyslog is a great system for us, alternatively something like https://www.elastic.co/blog/get-system-logs-and-metrics-into-elasticsearch-with-beats-system-modules if we want to try something entirely different. rsyslog is performant and cool, and has a mature and sane ecosystem of filtering and alerting (output modules like ommail and omhttp). Whatever system we use will probably both require a performant server, and still be sluggish, as logs are heavy. Both filebeat, rsyslog and promtail are however surprisingly lightweight on each of the clients.
Author
Owner

Earlier today I had a look into the state of journald-remote, and it was in surprisingly good shape.

It has these continuous https streams (which we can authenticate with client certs and CA if we'd like, or just authenticate by IP), and it continuously streams all journald logs to another machine, with good systemd watchdog integration and everything. It allows us to run journalctl --merge on the receiving machine, and see all logs for all hosts, tagged by hostname. We can also easily pump the logs into something like signoz (free elasticsearch alternative), loki or meilisearch (with a frontend) to get better search possibilities.

There's also journalwatch if we'd like to add alerts directly, but there might be better options after ingestion?

There's a branch with most of the required setup at https://git.pvv.ntnu.no/Drift/pvv-nixos-config/src/branch/systemd-journald-remote, but it's kinda blocked by #146

Earlier today I had a look into the state of journald-remote, and it was in surprisingly good shape. It has these continuous https streams (which we can authenticate with client certs and CA if we'd like, or just authenticate by IP), and it continuously streams all journald logs to another machine, with good systemd watchdog integration and everything. It allows us to run `journalctl --merge` on the receiving machine, and see all logs for all hosts, tagged by hostname. We can also easily pump the logs into something like signoz (free elasticsearch alternative), loki or meilisearch (with a frontend) to get better search possibilities. There's also [journalwatch](https://gitlab.com/distrosync/nixos/-/blob/master/modules/journalwatch.nix) if we'd like to add alerts directly, but there might be better options after ingestion? There's a branch with most of the required setup at https://git.pvv.ntnu.no/Drift/pvv-nixos-config/src/branch/systemd-journald-remote, but it's kinda blocked by #146
Owner

FWIW I've heard journald-remote breaks a lot, I would have suggested it earlier otherwise.

I think getting a faster machine for loki or ELK is what we want, and if we have that then the pushing of the logs is a non-issue. (unless the point was mostly to get a text-based greppable thing, which could be nice!)

FWIW I've heard journald-remote breaks a lot, I would have suggested it earlier otherwise. I think getting a faster machine for loki or ELK is what we want, and if we have that then the pushing of the logs is a non-issue. (unless the point was mostly to get a text-based greppable thing, which could be nice!)
Author
Owner

FWIW I've heard journald-remote breaks a lot, I would have suggested it earlier otherwise.

I've heard you say it before, but I do think this information could be outdated. I skimmed through the issue list when I tested it out, and it seems like they have been actively working to improve it lately, fixing stuff like https://github.com/systemd/systemd/issues/9858, and with PRs like https://github.com/systemd/systemd/pull/31313 (although not yet merged). I have not tested it for long, but I'm currently using it in my homelab.

> FWIW I've heard journald-remote breaks a lot, I would have suggested it earlier otherwise. I've heard you say it before, but I do think this information could be outdated. I skimmed through the issue list when I tested it out, and it seems like they have been actively working to improve it lately, fixing stuff like https://github.com/systemd/systemd/issues/9858, and with PRs like https://github.com/systemd/systemd/pull/31313 (although not yet merged). I have not tested it for long, but I'm currently using it in my homelab.
oysteikt added the
disputed
label 2024-09-14 22:56:20 +02:00
Sign in to join this conversation.
No description provided.