Pages

Saturday, 24 September 2011

Why I'm moving from Nagios to Zabbix

I need a replacement for Nagios. It's been around for over ten years and is possibly the most widely deployed system monitoring software. But it doesn't tick every box for me. There are people who will hark on about the power of Nagios and it's massive user base with billions of plugins - but this doesn't cut it because there are inherent weakness that are difficult to overcome with hacking it into some kind of Frankenstein-ish beast. Here's my main gripes:

File based configuration doesn't scale well - Managing Nagios config files by hand will drives me mad. It's not that the files are particularly hard to read, it's just that as number of checks grows, it becomes difficult and time consuming to manage these files. As a result, some checks just don't get created. Yes, there are tools with web interfaces that help, but I've never found these to be as flexible as I would like.

Lack of integrated graphing - Decent graphs save so much time when I'm trying to investigate a problem. There are some good graphing tools like Cacti and some monitoring systems that try and integrate them together, but the result is less than optimal. At some point I just know that despite the interface, these are separate component working in different ways and aren't really aware of each other.

No integrated support for industry standards like SNMP or IPMI - You can get a lot of info via SNMP and it's especially useful on hardware (where you may not be able to install a nagios agent). Using an array of scripts is not a flexible or efficient way collecting data on thousands of items.

Need better checks with less effort - While you can script the collection of virtually any piece of data, I don't really want to spend that much of my life creating scripts. I would also like more complex checks that can more precisely define a condition that you want to be notified of. For example, I don't really care if a web server is receiving 50 requests a second. But I do care if it's receiving 50 requests a second and other web servers in the load-balanced cluster are only getting 10 hits per second.

 So my key requirements for a new monitoring system were as follows
  • Easier configuration
  • Integrated graphing
  • Built in support for SNMP (IPMI is a bonus)
  • The ability to make more complex checks
While commercial solutions are an option, they're not cheap - they can easily pass the £100k mark for a few hundred servers, so I started looking at open source solutions first. A friend of a friend recommended Zabbix. I had a look at the feature set; web based configuration, integrated graphing, SNMP and native agents, complex checks - looks like we're in business. A year on and overall, I'm happy to recommend it. Here's my likes and dislikes:

Like
  • Easy configuration of checks - Monitoring new items is easy once you understand how Zabbix works and making changes to many machines is a doddle. 
  • Integrated graphing - You can create graphs easily and mix data from different hosts, which is great. Better yet is the ability to generate a graph of any numeric data collected on demand, so you don't even need to configure anything.
  • Complex checks - Zabbix has a range of mathematical functions to build an expression which evaluate whether to trigger an alert. So lets says I'm checking the load on a system. I can set up a trigger to alert if the load is five times larger the average load of the last hour, and stays at this level for at least ten minutes. This is a very powerful capability.
  • Maps - I've haven't really used these in anger yet but you can create a maps of selected host, host groups or individual checks. This is really useful to visually show dependencies between system especially when you have a chain of connected processes and you're tryin to diagnose a problem
  • IT Services - Zabbix 'IT Services' are used to show a high level view of an IT service which may consist of multiple hosts and checks. When an alert is triggered, this view shows how a entire service may be affect. I deal with systems with many interconnected components and when one fails it's hard to remember all the other things that might be affected. IT services are good for problem diagnosis and showing a high level view of what client services are affected.
  • Zabbix has a business model to make money - Even though I haven't paid for support yet (my rollout is far from complete), I'm reassured to see a strategy that can work in the business world. The truth is while some developers will work for free, it's very difficult to compete with commercial competitors because most developers like to pay their rent as well. Zabbix has a fast development cycle, commerical training and a conference a few days away. There's also a book on Zabbix which I found really useful.
Dislike
  • Logical, but unergonomic interface - It seems to take one click too many to get where you want. The interface needs to better reflect the workflow of an engineer investigating a problem. Although with each release, improvements have been made.
If you're considering moving up from Nagios, Zabbix is well worth a look.

11 comments:

  1. Great post, very clarifying :)

    During your quest for a replacement for Nagios, did you find another system that worth a look?

    ReplyDelete
  2. I found hyperic to be worth a look. Very nice interface and the agent looked well featured. Hyperic had a better interface, easier auto discovery (zabbix 2.0 has autodiscovery now), a more structured way of displaying monitored items and some other nice to have stuff. However for all the stuff I really needed, zabbix did much of that.

    As it's free to deploy that certainly made it a lot easier to sell to management. Keep in mind that you'll still have to do a bit more set up work that something like Hyperic because it's not as polished.

    ReplyDelete
  3. Sorry for not responding earlier, email notifications accidentally turned off!

    ReplyDelete
  4. Have you had a look at Open Monitoring Distribution? (http://omdistro.org/) It's still Nagios, in theory, but a much better way of using it, all nicely integrated "out of the box".

    ReplyDelete
  5. I've heard of, but no tried omdistro. While new modules do provide extra functionality, Nagios does not have the same modular ethos as something like Linux. So the modules don't work together easily and you're still limited by the Nagios design.

    ReplyDelete
  6. Can you write another post with your experience with zabbix over the past year?

    Are you still happy with the swtich and recommend Zabbix? Did you find any tools to ease the migration of Nagios-> Zabbix?

    ReplyDelete
  7. I'm certainly happy i switched. Like any monitoring software there were niggles but it's worth it for the enhanced functionality.  I've changed company recently but my new company is also using Zabbix, and in the process of moving to zabbix 2.0. I've used 2.0 for a month or so and would like to base my next update on that. Haven't quite got the time right now but it's on my list.

    With regards to Nagios migration tools, i didn't use any and didn't look very much. The agent was well featured so once you've made a template it was quite easy to apply to multiple hosts quickly. I didn't get the chance to convert some custom Nagios scripts before i left the company, but i don't think it would be difficult.

    ReplyDelete
  8. Hi, did you try NetXMS? It's GPL and I believe its pretty good at addressing typical problems of "older" open-source net monitoring systems.

    ps. Yes, I'm affiliated :)

    ReplyDelete
  9. Hadn't heard of NetXMS until now. Looks very Windows in design - it will be a hard sell to my Unix oriented team!

    ReplyDelete
  10. Hi Andrew,

    NetXMS is not Windows-oriented if this is what you mean by "looks very Windows". The NetXMS server runs natively both on Unix (Linux, MacOS X etc) and on Windows. And by native I mean pure OS code and APIs, there are no special porting layers involved. So if you live in a pure Unix environment - just don't download version for Windows :)

    As for the management console - it runs virtually on any platform since it is eclipse-based. And there is a Web console if you prefer to use a web browser.

    ReplyDelete