NOC tools

Defining the problem space

A NOC ("Network Operations Centre") has as its primary task to keep a network up and running, making sure what is working continues working, what stops working gets fixed and generally keeping tabs on the health of the network.

To assist the NOC in doing this, there are multiple different tools (in the OSS sector, we find Webwatch, Nagios and other tools; in the proprietary sector, we find HP OpenView and other competitors). There's also the issue of monitoring traffic levels (done by Cacti / MRTG on the OSS side and (among others) HP OpenView on the proprietary side).

Most of these systems share one aspect. they're designed to have a centralised monitoring platform with, possibly, multiple display hosts. In the modern world of excessively firewalled networks and the like, this means placing your data collection hosts on the outside (especially if you use SNMP traps for status monitoring) or have static IP mappings through one or more firewalls (or, optionally, suitable tunnels through the firewall, but that's a nightmare from a security POV, so I shall ignore that for now).

From an operations point of view, it is good having a single interface, through which all of the relevant data can be watched, through which alerts can be raised and where prior history can be checked. Ideally, the tool should also have a configuration file format that is sufficiently easy to machine-generate (and hand-edit), so we can populate the configuration from one (or more) asset databases.

Furthermore, it is desirable that the system is broken up in components, so we can have multiple machines doing the monitoring, feeding data to/from each other and also have multiple NOC "consoles", where data and alerts are displayed.

Proposed design

Design rationales

Monitored objects as containers
The reason for this is that it is shrinks the number of alarms shown. This is good, because a server (say) that develops one type of problem is quite likely to develop more problems that will "clutter up" the overall view.

This doesn't mean that the underlying specific monitors are unavailable. We can, at least in principle, create a view for "disk" and populate it with disk-specific monitors.

View-based displays
While "show us everything, all the time" is one valid method of display, it frequently becomes hard to see what's actually happening. As "see what happens" is the primary use of any system monitor, obscuring this is not ideal.

Thus, views. There are multiple possible ways to arrange views, but for illustration, I will describe one set of computers, with a few useful views of them.

The set-up is two identical data centres. Each data centre has ten front-end web servers, forty back-end web servers, two data-base servers, one SAN, two routers and four switches.

  1. One semi-useful view would be aggregating the two data centres to synthetic objects, showing the highest alert level within the data centre. Not so easy to see what changes, though.
  2. Another would be to present all the machines in one data centre on one screen and all the machines in the other data centre on another screen. This gives you a good overview of the machines, while still keeping things visible.
  3. Yet another possible view would be to present the different classes of machines on different screens (or, possibly, merge multiple views on one physical display), so we'd have one display for front-end web servers (total: 20), one for back-end web servers (total: 80), one for data-base servers (total: 4), one for SANs (total: 2) and one for network equipment (total: 4 routers, 4 switches).
No CGI-script-based web interface
The reason for this is that if the underlying monitoring software ever goes away, this should be instantly obvious. With a setup where you have a browser (on one or more machines) querying an Apache server (say), running CGI scripts in the background, you can end up in a position where the backend is no longer monitoring, but the only way you can find this out is by trying to drill down and look into something. The fault status display essentially becomes static.

Previous versions

Previous version
Initial version

This is one of Ingvar's essays

All fields below are mandatory, your email address will not be displayed by the site. All comments are sent to a moderation queue, so do not be surprised that it doesn't show up immediately.

Name:
Email (will not be displayed):
Comment: