Setting up server monitoring, then throwing it away
If you spend any time around places where folks talk about self-hosting, you've probably seen at least two things in abundance: screenshots of server dashboards, and screenshots of groups of cool-looking charts. The latter ones often are taken from monitoring software of some sort. They look impressive and suggest serious business. Just look at this dash taken from Grafana's web!
![](https://zblesk.net/blog/content/images/2022/06/image-6.png)
However, I never felt like setting something like that up. At my scale and for my use-cases, it seemed completely unnecessary. I also would've bet it's difficult to set it up like that.
That changed when my main server started having issues. Basically every other day something overloaded it, I couldn't even connect, and only a restart would help.
So I thought - right. Let's give this a try, preferably without wasting too much time.
NetData
I have heard about NetData before, so I went and tried it.
I have to say, the first impression was beyond good. It was easy to set up with docker and immediately showed a dashboard of the system's stats over HTTP. Since I have the server on Tailscale VPN, I could immediately access that dashboard securely.
![](https://zblesk.net/blog/content/images/2022/06/image-7.png)
That might help a bit if I'm looking in the stats in the exact moment things start going wrong. After all, if the monitoring endpoint goes down with the server, I can't see a thing.
Prometheus & Grafana
To solve that, I used other two popular components: Prometheus for collecting the data from NetData.
![](https://zblesk.net/blog/content/images/2022/06/image-8.png)
Then Grafana to take data from Prometheus and visualize them.
Both of those took a lot longer to get up and running. I had issues starting them up in Docker, and getting them to talk to each other.
Success?
When the data started flowing as it was supposed to, I had to set up the dashboard in Grafana. That is - take all the NetData inputs and display them in a helpful manner. This is the part I thought would be time-consuming and/or difficult.
I just set up a few basic metrics and called it a day; I've spent far too much on this anyway.
![](https://zblesk.net/blog/content/images/2022/06/image-9.png)
This stack then helped confirm my suspicions that my config changes to Huginn (which I just love; wrote about it here and here) were the culprit. Sometimes, the worker threads would cause a massive spike in disk reads and CPU usage, which brought the server to a screeching halt.
Success, right?
Ditching the entire thing
The problem is that even though the outages' frequency decreased, they still happened.
Long story short: now the thing crashing my server was... NetData.
When I first set it up, I saw CPU usage spikes firmly in the "monitoring should not do this". Up to 30% of CPU. So I've been wary and set CPU limits in the docker-compose file.
But after yet another crash, I found out that now NetData is using up the entire memory.
That's when I decided it was not worth the effort, and just removed it. Maybe I could have placed more limiters on it or installed it outside Docker in the hopes of improving the perf, but fortunately, I don't need monitoring that badly. And if I needed it again - the compose file is still there, I could just spin it up again.
Here's to hoping I won't need to. 🤞🏻
Comments ()