I use some batch scripts in my proxmox installation. They are in cron.hourly and daily checking for virus and ram/CPU load of my LXC containers. An email is send on condition.

What are your tipps or solution without unnecessary load on disc io or CPU time. Lets keep it simple.

Edit: a lot of great input about possible solutions. In addition TIL “that keep it simple” means a lot of different things to people.😉

I use netdata, it’s very good at digesting thousands of metrics to sharing actionable. The cloud portion is proprietary, but you can toggle off the data collection. I did turn on the cloud portion though, I get email notifications when something breaks. Might sound counter to the self hosted mantra, but a self hosted monitoring system isn’t very helpful when your own systems go down.

I set up custom bash scripts collecting information (df, docker json, smartCTL etc) Either parse existing json info or assemble json strings and push it to Homeassistant REST api (cron) In Homeassistant data is turned into sensors and displayed. HA sends messages of sensors fail.
Info served in HA:

  • HDD/SSD (size, smartCTL errors, spin up/down, temperature etc)
  • Availability/health of docker services
  • CPU usage/RAM/temperature
  • Network interface/throughput/speed/connections
  • fail2ban jails

Trying to keep my servers as barebones as possible. Additional services/apps put strain on CPU/RAM etc. Found out most of data necessary for monitoring is either available (docker json, smartCTL json) or can be easily caught, e.g.

df -Pht ext4 | tail -n +2 | awk '{ print $1}

It was fun learning and defining what must be monitored or not, and building a custom interface in HA.

Fermiverse
creator
link
fedilink
3
edit-2
1Y

Thats basically the way I do it.

pvesh get /cluster/resources --output-format json-pretty | jq --arg k "lxc/$container_id" -r 'map(select(.id == $k))[].name, map(select(.id == $k))[].mem, map(select(.id == $k))[].maxmem, map(select(.id == $k))[].cpu')

Example using pvesh in proxmox. The data is available, just have to use it. I also prefer barebone approach.

At last we keep it simple ;)

https://github.com/awesome-foss/awesome-sysadmin#monitoring

I use netdata (agent only, not the cloud/SaaS stuff)

I’ll keep it very simple: I don’t.

If I’m trying to do something and I notice an issue, then I’ll investigate it. But if it’s not affecting anything, is it really a problem?

I was kind of the same, but I still collected metrics, because I just love graphs.

Over time I ended up setting alerts for failures I wish I was aware of earlier. Some examples:

  • HDD monitoring - usually drive is showing signs of failure couple days before it fails, so I have time to shop around for replacement. If I had no alert set, I’d probably only notice when both sides of a mirror failed which would mean couple days of downtime, lot of work with backup restoration and very limited time to find drive for reasonable price
  • networking issues - especially VPN, it’s much better to know that it is broken before you leave house
  • some core services like DNS. With two Adguard instances it’s much better to be alerted when one is down, than to realize that you suddenly have no DNS when both fail and you can’t even google stuff without messing with your connection settings.
  • SSD writes - same as HDDs, but in this case the alert is around 90% declared TBW lifetime claimed by manufacturer and I tend to replace them proactively as they are usually used as system disk without mirror, which holds no valuable data, but would again lead to extended unplanned downtime
  • CPU usage being maxed out for long time - I had one service fail in a way where it consumed 100% of all cores. This had no impact on other services because process scheduler did its job, but I ended up burning kilowats of electricity as this continued unnoticed for weeks. This was before energy prices went up, but it was still noticeable power consumption. (Had double CPU server back then, that consumed a lot of juice when maxed out)

What do you use to collect these metrics?

I use Telegraf for most of the metrics.

Create a post

A place to share alternatives to popular online services that can be self-hosted without giving up privacy or locking you into a service you don’t control.

Rules:

  1. Be civil: we’re here to support and learn from one another. Insults won’t be tolerated. Flame wars are frowned upon.

  2. No spam posting.

  3. Posts have to be centered around self-hosting. There are other communities for discussing hardware or home computing. If it’s not obvious why your post topic revolves around selfhosting, please include details to make it clear.

  4. Don’t duplicate the full text of your blog or github here. Just post the link for folks to click.

  5. Submission headline should match the article title (don’t cherry-pick information from the title to fit your agenda).

  6. No trolling.

Resources:

Any issues on the community? Report it using the report flag.

Questions? DM the mods!

  • 1 user online
  • 30 users / day
  • 79 users / week
  • 215 users / month
  • 844 users / 6 months
  • 1 subscriber
  • 1.42K Posts
  • 8.13K Comments
  • Modlog