mon - monitor services for availability, sending alarms
upon failures.
SYNOPSIS
mon [-dfhlSv] [-a dir] [-A authfile] [-b dir] [-B dir] [-c
config] [-D dir] [-i secs] [-k num] [-l dir] [-m num] [-p
num] [-P pidfile] [-r delay] [-s dir]
DESCRIPTION
mon is a general-purpose scheduler for monitoring service
availability and triggering alerts upon detecting fail-
ures. mon was designed to be open in the sense that it
supports arbitrary monitoring facilities and alert methods
via a common interface, which are easily implemented
through shell scripts, Perl scripts, C, or any other lan-
guage.
OPTIONS
-a dir Path to alert scripts. Default is
/usr/lib/mon/alert.d. Multiple alert paths may be
specified by separating them with a colon. All
paths must be absolute.
-b dir Base directory for mon. scriptdir, alertdir, and
statedir are all relative to this directory unless
specified from /. Default is /usr/lib/mon.
-B dir Configuration file base directory. All config files
are located here, including mon.cf, monusers.cf,
and auth.cf.
-A authfile
Authentication configuration file. By default this
is /etc/mon/auth.cf if the /etc/mon directory
exists, or /usr/lib/mon/auth.cf otherwise.
-c file
Read configuration from file. This defaults to IR
/etc/mon/mon.cf " if the " /etc/mon directory
exists, otherwise to /etc/mon.cf.
-d Enable debugging mode.
-D dir Path to state directory. Default is the first of
/var/state/mon, /var/lib/mon, and
/usr/lib/mon/state.d which exists.
-f Fork and run as a daemon process. This is the pre-
ferred way to run mon.
-h Print help information.
Sleep interval, in seconds. Defaults to 1. This
shouldn't need to be adjusted for any reason.
-k num Set log history to a maximum of num entries.
Defaults to 100.
-l Load state from the last saved state file. Cur-
rently the only supported saved state is disabled
watches, services, and hosts.
-L dir Sets the log dir. See also logdir in the configura-
tion file.
-m num Set the throttle for the maximum number of pro-
cesses to num.
-p num Make server listen on port num. This defaults to
32777.
-S Start with the scheduler stopped.
-P pidfile
Store the server's pid in pidfile, the default is
the first of /var/run/mon/mon.pid,
/var/run/mon.pid, and /etc/mon.pid whose directory
exists. An empty value tells mon not to use a pid
file.
-r delay
Sets the number of seconds used to randomize the
startup delay before each service is scheduled.
Refer to the global randstart variable in the con-
figuration file.
-s dir Path to monitor scripts. Default is
/usr/lib/mon/mon.d. Multiple alert paths may be
specified by separating them with a colon. All
paths must be absolute.
-v Print version information.
DEFINITIONS
monitor
A program which tests for a certain condition,
returns either true or false, and optionally pro-
duces output to be passed back to the scheduler.
Common monitors detect host reachability via ICMP
echo messages, or connection to TCP services.
period A period in time as interpreted by the Time::Perl
module.
scheduler. The scheduler calls upon an alert when
it detects a failure from a monitor. An alert pro-
gram accepts a set of command-line arguments from
the scheduler, in addition to data via standard
input.
hostgroup
A single host or list of hosts, specified as names
or IP addresses.
service
A collection of parameters used to deal with moni-
toring a particular resource which is provided by a
group. Services are usually modeled after things
such as an SMTP server, ICMP echo capability,
server disk space availability, or SNMP events.
watch A collection of services which apply to a particu-
lar group.
OPERATION
When the mon scheduler starts, it reads a configuration
file to determine the services it needs to monitor. The
configuration file defaults to /etc/mon.cf, and can be
specified using the -c parameter.
The scheduler enters a loop which handles client connec-
tions, monitor invocations, and failure alerts. Each ser-
vice has a timer, specified in the configuration file as
the interval variable, which tells the scheduler how fre-
quently to invoke a monitor process. The scheduler may be
temporarily stopped. While it is stopped, client access
still functions, but it just doesn't schedule things. This
is useful in conjunction while resetting the server,
because you can do this: save the hosts and services which
are disabled, reset the server with the scheduler stopped,
re-disabled those hosts and services, then start the
scheduler. It also allows making atomic changes across
several client connections. See the moncmd man page for
more information.
MONITOR PROGRAMS
Monitor processes are invoked with the arguments specified
in the configuration file, appended by the hosts from the
applicable host group. For example, if the watch group is
"servers", which contain the hostnames "smtp", "nntp", and
"ns", and the monitor line reads as follows,
monitor fping.monitor -t 4000 -r 2
then the exectuable "fping.monitor" will be executed with
MONITOR_DIR is /usr/lib/mon/mon.d, or the path specified
by the -s option. If all hosts in the hostgroup have been
disabled, then a warning is sent to syslog and the monitor
is not run. This behavior may be overridden with the
"allow_empty_group" option in the service definition. If
the final argument to the "monitor" line is ";;" (it must
be preceded by whitespace), then the host list will not be
appended to the parameter list.
In addition to environment variables defined by the user
in the service definition, mon passes certain variables to
monitor process.
MON_LAST_SUMMARY
The first line of the output from the last time the
monitor exited.
MON_LAST_OUTPUT
The entire output of the monitor from the last time
it exited.
MON_LAST_FAILURE
The time(2) of the last failure for this service.
MON_FIRST_FAILURE
The time(2) of the first time this service failed.
MON_LAST_SUCCESS
The time(2) of the last time this service passed.
MON_ALERTTYPE
Has one of the following values: "failure", "up",
"startup", "trap", or "traptimeout", and signifies
the type of alert which was triggered. This envi-
ronment variable is meant to supercede the "-u"
commandline parameter passed to alert scripts.
MON_DESCRIPTION
The description of this service, as defined in the
configuration file using the description tag.
MON_RETVAL
completed successfully (found no problems), or nonzero if
a problem was detected. The first line of output from the
monitor script has a special meaning: it is used as a
brief summary of the exact failure which was detected, and
is passed to the alert program. All remaining output is
also passed to the alert program, but it has no required
interpretation.
If a monitor for a particular service is still running,
and the time comes for mon to run another monitor for that
service, it will not start another monitor. For example,
if the interval is 10s, and the monitor does not finish
running within 10 seconds, then mon will wait until the
first monitor exits before running another one.
Upon a nonzero exit status, the associated alert program
is started, pending the following criteria: If an alert
for a specific service is disabled, do not send an alert.
If an alert is not within the specified period, record the
failure via syslog(3) and do not send an alert. If the
failure falls within the defined period, and an alert was
already sent within the last alertevery interval, do not
send another alert, unless the output from the current
monitor program differs from the last monitor process.
Otherwise, send an alert using each alert program listed
for that period.
ALERT PROGRAMS
Alert programs are found in the path supplied with the -a
parameter, or in the /usr/lib/mon/alert.d directory if not
specified. They are invoked with the following command-
line parameters:
-s service
Service tag from the configuration file.
-g group
Host group name from the configuration file.
-h hosts
The expanded version of the host group, space
delimited, but contained in one shell "word".
-t secs
The number of seconds left before another alert
will be sent out.
-u This option is supplied to an alert only if it is
being called as an upalert.
parameters in the configuration file, after the "alert"
service parameter.
As with monitor programs, alert programs are invoked with
environment variables defined by the user in the service
definition, in addition to the following which are explic-
itly set by the server:
MON_LAST_SUMMARY
The first line of the output from the last time the
monitor exited.
MON_LAST_OUTPUT
The entire output of the monitor from the last time
it exited.
MON_LAST_FAILURE
The time(2) of the last failure for this service.
MON_FIRST_FAILURE
The time(2) of the first time this service failed.
MON_LAST_SUCCESS
The time(2) of the last time this service passed.
MON_DESCRIPTION
The description of this service, as defined in the
configuration file using the description tag.
The first line from standard input must be used as a brief
summary of the problem, normally supplied as the subject
line of an email, or text sent to an alphanumeric pager.
Interpretation of all subsequent lines read from stdin is
left up to the monitoring program. The usual parameters
are a list of recipients to deliver the notification to.
The interpretation of the recipients is not specified, and
is up to the alert program.
CONFIGURATION FILE
The configuration file consists of zero or more hostgroup
definitions, and one or more watch definitions. Each watch
definition may have one or more service definitions. A
line beginning with optional leading whitespace and a
pound ("#") is regarded as a comment, and is ignored.
The following variables may be set to override compiled-in
defaults. Command-line options will have a higher prece-
dence than these definitions.
alertdir = dir
dir is the full path to the alert scripts. This is
the value set by the -a command-line parameter.
Multiple alert paths may be specified by separating
them with a colon. All paths must be absolute.
When the configuration file is read, all alerts
referenced from the configuration will be looked up
in each of these paths, and the full path to the
first instance of the alert found is stored in a
hash. This hash is only generated upon startup or
after a "reset" command, so newly added alert
scripts will not be recognized until a "reset" is
performed.
mondir = dir
dir is the full path to the monitor scripts. This
value may also be set by the -s command-line param-
eter.
Multiple alert paths may be specified by separating
them with a colon. All paths must be absolute.
When the configuration file is read, all monitors
referenced from the configuration will be looked up
in each of these paths, and the full path to the
first instance of the monitor found is stored in a
hash. This hash is only generated upon startup or
after a "reset" command, so newly added monitor
scripts will not be recognized until a "reset" is
performed.
statedir = dir
dir is the full path to the state directory. mon
uses this directory to save various state informa-
tion.
logdir = dir
dir is the full path to the log directory. mon
uses this directory to save various logs, including
the downtime log.
dir is the full path for the state, script, and
alert directory.
authfile = file
file is the full path to the authentication file.
authtype = type
type is the type of authentication to use. If type
is getpwnam, then the standard Unix passwd file
authentication method will be used (calls getpw-
nam(3) on the user and compares the crypt(3)ed ver-
sion of the password with what it gets from getpw-
nam). This will not work if shadow passwords are
enabled on the system.
If type is userfile, then usernames and hashed
passwords are read from userfile, which is defined
via the userfile configuration variable.
If type is shadow, then shadow password may be used
(NOT IMPLEMENTED).
userfile = file
This file is used when authtype is set to userfile.
It consists of a sequence of lines of the format
'username : password'. password is stored as the
hash returned by the standard Unix crypt(3) func-
tion.
Blank lines and lines beginning with # are ignored.
snmpport = portnum
Set the SNMP port that the server binds to.
use SNMP
Turn on SNMP support.
dtlogfile = file
file is a file which will be used to record the
downtime log. Whenever a service fails for some
amount of time and then stop failing, this even is
written to the log. If this parameter is not set,
no logging is done. The format of the file is as
follows (# is a comment and may be ignored):
timenoticed is the time(2) the service came back
up.
group service is the group and service which
failed.
firstfail is the time(2) when the service began to
fail.
downtime is the number of seconds the service
failed.
interval is the frequency (in seconds) that the
service is polled.
summary is the summary line from when the service
was failing.
dtlogging = yes/no
Turns downtime logging on or off. The default is
off.
histlength = num
num is the the maximum number of events to be
retained in history list. The default is 100. This
value may also be set by the -k command-line param-
eter.
serverport = port
port is the TCP port number that the server should
bind to. This value may also be set by the -p com-
mand-line parameter. Normally this port is looked
up via getservbyname(3), and it defaults to 2583.
trapport = port
port is the UDP port number that the trap server
should bind to. Normally this port is looked up
via getservbyname(3), and it defaults to 2583.
pidfile = path
path is the file the sever will store its pid in.
This value may also be set by the -P command-line
parameter.
Throttles the number of concurrently forked pro-
cesses to num. The intent is to provide a safety
net for the unlikely situation when the server
tries to take on too many tasks at once. Note that
this situation has only been reported to happen
when trying to use a garbled configuration file!
You don't want to use a garbled configuration file
now, do you?
cltimeout = secs
Sets the client inactivity timeout to secs. This
is meant to help thwart denial of service attacks
or recover from crashed clients. secs is inter-
preted as a "1h/1m/1s" string, where "1m" = 60 sec-
onds.
randstart = secs
When the server starts, normally all services will
not be scheduled until the interval defined in the
respective service section. This can cause long
delays before the first check of a service, and
possibly a high load on the server if multiple
things are scheduled at the same intervals. This
option is used to randomize the scheduling of the
first test for all services during the startup
period, and immediately after the reset command.
If randstart is defined, the scheduled run time of
all services of all watch groups will be a random
number between zero and randstart seconds.
Hostgroup Entries
Hostgroup entries begin with the keyword hostgroup, and
are followed by a hostgroup tag and one or more hostnames
or IP addresses, separated by whitespace. The hostgroup
tag must be composed of alphanumeric characters, a dash
("-"), a period ("."), or an underscore ("_"). Non-blank
lines following the first hostgroup line are interpreted
as more hostnames. The hostgroup definition ends with a
blank line. For example:
hostgroup servers nameserver smtpserver nntpserver
nfsserver httpserver smbserver
hostgroup router_group cisco7000 agsplus
Watch Group Entries
Watch entries begin with a line that starts with the key-
word watch, followed by whitespace and a single word which
is created whose tag is that word, and that word is its
only member.
Watch entries consist of one or more service definitions.
Service Definitions
service servicename
A service definition begins with they keyword ser-
vice followed by a word which is the tag for this
service.
The components of a service are an interval, moni-
tor, and one or more time period definitions, as
defined below.
interval timeval
The keyword interval followed by a time value spec-
ifies the frequency that a monitor script will be
triggered. Time values are defined as "30s", "5m",
"1h", or "1d", meaning 30 seconds, 5 minutes, 1
hour, or 1 day. The numeric portion may be a frac-
tion, such as "1.5h" or an hour and a half. This
format of a time specification will be referred to
as timeval.
traptimeout timeval
This keyword takes the same time specification
argument as interval, and makes the service expect
a trap from an external source at least that often,
else a failure will be registered. This is used for
a heartbeat-style service.
trapduration timeval
If a trap is received, the status of the service
the trap was delivered to will normally remain con-
stant. If trapduration is specified, the status of
the service will remain in a failure state for the
duration specified by timeval, and then it will be
reset to "success".
randskew timeval
Rather than schedule the monitor script to run at
the start of each interval, randomly adjust the
interval specified by the interval parameter by
plus-or-minus randskew . The skew value is speci-
fied as the interval parameter: "30s", "5m", etc...
For example if interval is 1m, and randskew is
The intent is to help distribute the load on the
server when many services are scheduled at the same
intervals.
monitor monitor-name [arg...]
The keyword monitor followed by a script name and
arguments specifies the monitor to run when the
timer expires. Shell-like quoting conventions are
followed when specifying the arguments to send to
the monitor script. The script is invoked from the
directory given with the -s argument, and all fol-
lowing words are supplied as arguments to the moni-
tor program, followed by the list of hosts in the
group referred to by the current watch group. If
the monitor line ends with ";;" as a separate word,
the host groups are not appended to the argument
list when the program is invoked.
allow_empty_group
The allow_empty_group option will allow a monitor
to be invoked even when the hostgroup for that
watch is empty because of disabled hosts. The
default behavior is not to invoke the monitor when
all hosts in a hostgroup have been disabled.
description descriptiontext
The text following description is queried by client
programs, passed to alerts and monitors via an
environment variable. It should contain a brief
description of the service, suitable for inclusion
in an email or on a web page.
depend dependexpression
The depend keyword is used to specify an expression
to be evaluated before this service is scheduled to
run. If the expression evaluates to zero (or unde-
fined), then this service is set to an undefined
operational status and is not run. This can be used
to control alerts for services which are dependent
on other services, e.g. an SMTP test which is
dependent upon the machine being ping-reachable.
Dependencies are actual Perl expressions, and must
obey all syntactical rules. If a syntax error is
found when evaluating the expression, it is logged
via syslog. Before evaluation, mon performs the
following substitutions on the expression: phrases
which look like "group:service" are substituted
tions are computed recursively, so if service A
depends upon service B, and service B depends upon
service C, then service A depends upon service C.
Period Definitions
Periods are used to define the conditions which should
allow alerts to be delivered.
period [label:] periodspec
A period groups one or more alarms and variables
which control how often an alert happens when there
is a failure. The period keyword has two forms.
The first takes an argument which is a period spec-
ification from Patrick Ryan's Time::Period Perl 5
module. Refer to "perldoc Time::Period" for more
information.
The second form requires a label followed by a
period specification, as defined above. The label
is a tag consisting of an alphabetic character or
underscore followed by zero or more alphanumerics
or underscores and ending with a colon. This form
allows multiple periods with the same period defi-
nition. One use is to have a period definition
which has no alertafter or alertevery parameters
for a particular time period, and another for the
same time period with a different set of alerts
that does contain those parameters.
alertevery timeval
The alertevery keyword (within a period definition)
takes the same type of argument as the interval
variable, and limits the number of times an alert
is sent when the service continues to fail. For
example, if the interval is "1h", then only the
alerts in the period section will only be triggered
once every hour. If the alertevery keyword is omit-
ted in a period entry, an alert will be sent out
every time a failure is detected. By default, if
the output of two successive failures changes, then
the alertevery interval is overridden. If the word
"summary" is the last argument, then only the sum-
mary output lines will be considered when comparing
the output of successive failures.
alertafter num timeval
The alertafter keyword (within a period section)
above. If this parameter is specified, then the
alerts for that period will only be called after
that many failures happen within that interval. For
example, if alertafter is given the arguments
"3 30m", then the alert will be called if 3 fail-
ures happen within 30 minutes.
alert alert [arg...]
A period may contain multiple alerts, which are
triggered upon failure of the service. An alert is
specified with the alert keyword, followed by an
optional exit parmeter, and arguments which are
interpreted the same as the monitor definition, but
without the ";;" exception. The exit parameter
takes the form of exit=x or exit=x-y and has the
effect that the alert is only called if the exit
status of the monitor script falls within the range
of the exit parameter. If, for example, the alert
line is alert exit=10-20 mail.alert mis then mail-
alert will only be invoked with mis as its argu-
ments if the monitor program's exit value is
between 10 and 20. This feature allows you to trig-
ger different alerts at different severity levels
(like when free disk space goes from 8% to 3%).
Alert programs are invoked with the following com-
mand-line parameters:
-s service name
The service tag for this failure.
-g hostgroup
The tag of the host group for this service.
-h hostgroup expansion
All of the members in this hostgroup.
-t time
The time (in time(2) format) of when this failure
condition was detected.
-l alertevery
The number of seconds until the next alarm will be
sent.
-u This option is supplied to an alert only if it is
being called as an upalert.
upalert alert [arg...]
transition from failure to success. The upalert
script is called supplying the same parameters as
the alert script, with the addition of the -u
parameter which is simply used to let an alert
script know that it is being called as an upalert.
Multiple upalerts may be specified for each period
definition.
startupalert alert [arg...]
A startupalert is only called when the mon server
starts execution.
upalertafter timeval
The upalertafter parameter is specified as a string
that follows the syntax of the interval parameter
("30s", "1m", etc.), and controls the triggering of
an upalert. If a service comes back up after being
down for a time greater than or equal to the value
of this option, an upalert will be called. Use this
option to prevent upalerts to be called because of
"blips" (brief outages).
AUTHENTICATION CONFIGURATION FILE
The file specified by the authfile variable in the config-
uration file (or passed via the -A parameter) will be
loaded upon startup. This file defines restrictions upon
which client commands may be executed by which users. It
is a text file which consists of comments and command def-
initions. A comment line begins with optional whitespace
followed by pound sign. Blank lines are ignored. A com-
mand definition consists of a command, followed by a
colon, followed by a comma-separated list of users who may
execute the command. The default is that no users may
execute any commands.
An example configuration file:
list: all
reset: root,admin
loadstate: root
savestate: root
This means that all clients are able to perform the list
command, "root" is able to perform "reset", "loadstate",
"savestate", and "admin" is able to execute the "reset"
command.
CLIENT-SERVER INTERFACE
line each, terminated by a newline. Currently the server
is iterative, accepting a single client at a time. This
will change in future releases.
CLIENT INTERFACE COMMANDS
See manual page for moncmd.
EXAMPLES
The mon distribution comes with an example configuration
called example.cf. Refer to that file for more informa-
tion.
SEE ALSO
moncmd(1), Time::Period(3pm)
HISTORY
mon was written because I couldn't find anything out there
that did just what I needed, and nothing was worth modify-
ing to add the features I wanted. It doesn't have a cool
name, and that bothers me because I couldn't think of one.
BUGS
Report bugs to the email address below.
AUTHOR
Jim Trocki <trockij@transmeta.com>
Man(1) output converted with
man2html