A monitoring system needs to meet the expected requirements. The first thing you, as the system/network administrator need to do, is get management buy-in on deploying a supervisory and data acquisition system to meet corporate goals. The second is to define the scope of the monitoring system and its particularities.
Shinken can scale out horizontally on multiple servers or vertically with more powerful hardware. Shinken deals automatically with distributed status retention. There is also no need to use external clustering or HA solutions. Shinken can be distributed in multiple IP networks for example to a secondary datacenter or DR site.
Scalability can be described through a few key metrics
And to a lesser extent, as performance data is not expected to overload a Graphite instance (Which a single server can do up to 80K updates per second) or even RRDTool+RRDcache with a hardware RAID 10 of 10K RPM disks.
Passive checks do not need to be scheduled by the monitoring server. Data acquisition and processing is distributed to the monitored hosts permitting lower acquisition intervals and more data points to be collected.
Active checks benefit from Shinken's powerful availability algorithms for fault isolation and false positive elimination.
A typical installation should make use of both types of checks.
Thought needs to be used in determining what protocol to use and how many data points need to be collected will influence the acquisition method. There are many ways to slice an apple, but only a few scale beyond a few thousand services.
What is a big deployment? It depends on check frequency and number of services. 10K NSCA per minute based passive services is nothing for Shinken. 10K SSH checks per minute is unrealistic. 10K SNMP checks per minute can grind a server to a halt if not using an efficient poller. Large deployments could easily ask for 20K, 50K, 80K services per minute per scheduler.
Large numbers of active checks need to use poller modules
Other integrated poller modules can be easily developed as required for ICMP(ping), SSH, TCP probes, etc.
Check_mk also uses a daemonized poller for its windows and unix agents which also makes it a good choice for scaling data acquisition from hosts. Note that WATO, the configuration frontend is not compatible with Shinken. Check_mk is limited to RRD backends, but can send the performance data to Shinken along with the state, permitting a Shinken Broker to forward the data to Graphite.
The broker is a key component of the scalable architecture. Only a single broker can be active per scheduler. A broker can process broks(messages) from multiple schedulers. In most modern deployments, Livestatus is the broker module that provides status information to the web frontends. (Nagvis, Multisite, Thruk, Splunk, etc.) or Shinken's own WebUI module. The broker needs memory and processing power.
Avoid using any broker modules that write logs or performance data to disk without making use of a real time series database like RRDtool/RRDcache or Graphite.
Make use of sqlite3 or mongodb to store Livestatus retention data. MongoDB integration with Livestatus is considered somewhat experimental, but can be very beneficial if performance and resiliency are desired. Especially when using a spare broker. MongoDB will ensure historical retention data is available to the spare broker, whereas with SQLite, it will not, and manual synching is required.
Shinken has a great dependency resolution model.
Automatic root cause isolation, at a host level, is one method that Shinken provides. This is based on explicitly defined parent/child relationships. This means that on a service or host failure, it will mark it as soft-down and automatically reschedule an immediate check of the parent(s). Once the root failure(s) are found, any children are marked as unknown status instead of soft down and the real failure is promoted to hard-down with notifications and actions firing.
This model is very useful in reducing false positives. What needs to be understood is that it depends on defining a dependency tree. A dependency tree is restricted to single scheduler. Shinken provides a distributed architecture, that needs at least two trees for it to make sense.
Splitting trees by a logical grouping makes sense. This could be groups of services, geographic location, network hierarchy or other. Some elements may need to be duplicated at a host level (ex. ping check) like common critical elements (core routers, datacenter routers, AD, DNS, DHCP, NTP, etc.). A typical tree will involve clients, servers, network paths and dependent services. Make a plan, see if it works. If you need help designing your architecture, a professional services offering is in the works by the Shinken principals and their consulting partners.
Typically pollers and Schedulers use up the most network, CPU and memory resources. Use the distributed architecture to scale horizontally on multiple commodity servers. Use at least a pair of Scheduler daemons on each server. Your dependency model should permit at least two trees, preferably 4.
Shinken 1.01 currently routes data from a Shinken receiver through the Arbiter. As the Arbiter is the only one who knows which Scheduler is responsible for the data received. The Arbiter is tasked with administrative blocking functions that can inhibit the responsiveness of the acquisition from a Receiver. This is not an issue for small Shinken installations, as the Arbiter daemon will be blocked for very short periods of time. Only large Shinken installations with tens of thousands of services may experience delays related to passive event routing.
In all cases, should the Arbiter process be blocking(doing another task and not forwarding broks(messages) from a Receiver to a Scheduler) data will NOT be lost. It will be queued in the Receiver until the Arbiter can process them.
The only consideration here is to make sure to configure Shinken Receiver daemons. These will receive the NSCA messages and queue them to be sent to the Arbiter and on to the Scheduler for processing. The receiver will buffer data until an Arbiter is available to forward the broks (messages).
A much more evolved protocol for sending data than NSCA. Uses curl from the command line to send your data, or submits check results using an HTTP post in your software.
The python NSCAweb listener, https://github.com/smetj/nscaweb, can be hacked to act as a Shinken Receiver module. Similar implementations are in use.
Net-SNMP's snmptrapd and SNMP trap translator are typically used to receive, process, and trigger an alerts. Once an alert has been identified an execution is launched of send_nsca, or other method to send the result data to a Shinken Receiver daemon. There is no actual Shinken receiver module to receive SNMP traps, but the point is to get the data sent to the Shinken Receiver daemon.
The snmptt documentation has a good writeup on integrating with Nagios, which also applies to Shinken.
Various open source and commercial SDKs are available to implement a Shinken Receiver module for getting date from OPC-DA or OPC-UA servers. This is a common protocol for industrial applications. If planning to receive data from industrial controls and systems, it is mandatory to use Graphite to store the performance data. RRDtool is not suitable to store exact time series without interpolation. There are no planned implementations of this module, but should someone be interested in implementing one, support will be provided.
Adding a Shinken Receiver module to act as a consumer of AMQP messages can be implemented without much fuss. There are no planned implementations of this module, but should someone be interested in implementing one, support will be provided.
Typically for networking devices, SNMP v2c is the most efficient method of data acquisition. Security considerations should be taken into account on the device accepting snmpv2c requests so that they are filtered to specific hosts and restricted to the required OIDs, this is device specific. Snmpv2c does not encrypt or protect the data or the passwords.
There is a myriad of SNMP monitoring scripts, most are utter garbage for scalable installations. This is simply due to the fact that every time they are launched a perl or python interpreter needs to be launched, modules need to be imported, the script executed, results get returned and then the script is cleared from memory. Rinse and repeat, very inefficient. Only two SNMP polling modules can meet high scalability requirements.
Shinken's integrated SNMP poller that can scale to thousands of SNMP checks per second. It is currently in testing.
Check_mk also has a good SNMP acquisition model.
Shinken provides an integrated NRPE check launcher. It is implemented in the poller as a module that allows to bypass the launch of the check_nrpe process. It reads the check command and opens the connection itself. It allows a big performance boost for launching check_nrpe calls.
The command definitions should be identical to the check_nrpe calls.
The definition is very easy (and you probably just have to uncomment it) :
define module{
module_name NrpeBooster
module_type nrpe_poller
}
Then you add it in you poller object :
define poller {
[...]
modules NrpeBooster
}
Then just tag all your check_nrpe commands with this module :
define command {
command_name check_nrpe
command_line $USER1$/check_nrpe -H $HOSTADRESS$ -c $ARG1$ -a $ARG2$
module_type nrpe_poller
}
It's done. All the checks that will use this command will be eaten by the nrpe module to be launched by it.