NOC tools
Defining the problem space
A NOC ("Network Operations Centre") has as its primary task to keep a
network up and running, making sure what is working continues working,
what stops working gets fixed and generally keeping tabs on the health
of the network.
To assist the NOC in doing this, there are multiple different tools
(in the OSS sector, we find Webwatch, Nagios and other tools; in the
proprietary sector, we find HP OpenView and other
competitors). There's also the issue of monitoring traffic levels
(done by Cacti / MRTG on the OSS side and (among others) HP OpenView
on the proprietary side).
Most of these systems share one aspect. they're designed to have a
centralised monitoring platform with, possibly, multiple display
hosts. In the modern world of excessively firewalled networks and the
like, this means placing your data collection hosts on the outside
(especially if you use SNMP traps for status monitoring) or have
static IP mappings through one or more firewalls (or, optionally,
suitable tunnels through the firewall, but that's a nightmare from a
security POV, so I shall ignore that for now).
From an operations point of view, it is good having a single
interface, through which all of the relevant data can be watched,
through which alerts can be raised and where prior history can be
checked. Ideally, the tool should also have a configuration file
format that is sufficiently easy to machine-generate (and hand-edit),
so we can populate the configuration from one (or more) asset
databases.
Furthermore, it is desirable that the system is broken up in
components, so we can have multiple machines doing the monitoring,
feeding data to/from each other and also have multiple NOC "consoles",
where data and alerts are displayed.
Proposed design
- The monitoring servers are componentised, with all data collection for
any single target being done from the same server.
- Inter-server connections are TCP sessions, authenticated at the
start.
- The only information that is actively pushed across an
inter-server connection is changes in alert status.
- Relevant monitors are represented as "proxy objects" on servers
inquiring about them.
- There is a method of querying a server for all the objects and
proxy objects it holds.
- View data is kept in its own data structures, away from the list of what objects to monitor.
- A monitored object (let's call it "equipment") is a container that contains one or more specific monitors (ping monitor, traffic levels, CPU usage, disk, temperature, anything else that can be monitored automatically). The alert level of the "equipment" is the maximum of alert levels of the specific monitors it contains.
- Configuration files are easily modified, both programatically and by hand.
- Wide use of views to aggregate the displayed data.
- To the fullest extent possible, view clients are tied very tightly to an instance of the mnitoring software (that is, eiter queries a web server running in the same image as the web server or has a window displayed by an instance that participates in the inter-server network protocol).
Design rationales
- Monitored objects as containers
- The reason for this is that it is shrinks the number of alarms shown. This is good, because a server (say) that develops one type of problem is quite likely to develop more problems that will "clutter up" the overall view.
This doesn't mean that the underlying specific monitors are unavailable. We can, at least in principle, create a view for "disk" and populate it with disk-specific monitors.
- View-based displays
- While "show us everything, all the time" is one valid method of display, it frequently becomes hard to see what's actually happening. As "see what happens" is the primary use of any system monitor, obscuring this is not ideal.
Thus, views. There are multiple possible ways to arrange views, but for illustration, I will describe one set of computers, with a few useful views of them.
The set-up is two identical data centres. Each data centre has ten front-end web servers, forty back-end web servers, two data-base servers, one SAN, two routers and four switches.
- One semi-useful view would be aggregating the two data centres to synthetic objects, showing the highest alert level within the data centre. Not so easy to see what changes, though.
- Another would be to present all the machines in one data centre on one screen and all the machines in the other data centre on another screen. This gives you a good overview of the machines, while still keeping things visible.
- Yet another possible view would be to present the different classes of machines on different screens (or, possibly, merge multiple views on one physical display), so we'd have one display for front-end web servers (total: 20), one for back-end web servers (total: 80), one for data-base servers (total: 4), one for SANs (total: 2) and one for network equipment (total: 4 routers, 4 switches).
- No CGI-script-based web interface
- The reason for this is that if the underlying monitoring software ever goes away, this should be instantly obvious. With a setup where you have a browser (on one or more machines) querying an Apache server (say), running CGI scripts in the background, you can end up in a position where the backend is no longer monitoring, but the only way you can find this out is by trying to drill down and look into something. The fault status display essentially becomes static.
Previous versions
Previous version
Initial version
This is one of Ingvar's essays