mon  -  monitor  services for availability, sending alarms
       upon failures.


SYNOPSIS

       mon [-dfhlSv] [-a dir] [-A authfile] [-b dir] [-B dir] [-c
       config]  [-D dir] [-i secs] [-k num] [-l dir] [-m num] [-p
       num] [-P pidfile] [-r delay] [-s dir]


DESCRIPTION

       mon is a general-purpose scheduler for monitoring  service
       availability  and  triggering  alerts upon detecting fail-
       ures.  mon was designed to be open in the  sense  that  it
       supports arbitrary monitoring facilities and alert methods
       via a  common  interface,  which  are  easily  implemented
       through  shell scripts, Perl scripts, C, or any other lan-
       guage.



OPTIONS

       -a dir Path    to    alert     scripts.     Default     is
              /usr/lib/mon/alert.d.   Multiple alert paths may be
              specified by separating  them  with  a  colon.  All
              paths must be absolute.

       -b dir Base  directory  for  mon. scriptdir, alertdir, and
              statedir are all relative to this directory  unless
              specified from /.  Default is /usr/lib/mon.

       -B dir Configuration file base directory. All config files
              are located here,  including  mon.cf,  monusers.cf,
              and auth.cf.

       -A authfile
              Authentication  configuration file. By default this
              is  /etc/mon/auth.cf  if  the  /etc/mon   directory
              exists, or /usr/lib/mon/auth.cf otherwise.

       -c file
              Read  configuration from file.  This defaults to IR
              /etc/mon/mon.cf  "  if  the  "  /etc/mon  directory
              exists, otherwise to /etc/mon.cf.

       -d     Enable debugging mode.

       -D dir Path  to  state directory.  Default is the first of
              /var/state/mon,          /var/lib/mon,          and
              /usr/lib/mon/state.d which exists.

       -f     Fork  and run as a daemon process. This is the pre-
              ferred way to run mon.

       -h     Print help information.
              Sleep interval, in seconds.  Defaults  to  1.  This
              shouldn't need to be adjusted for any reason.

       -k num Set  log  history  to  a  maximum  of  num entries.
              Defaults to 100.

       -l     Load state from the last  saved  state  file.  Cur-
              rently  the  only supported saved state is disabled
              watches, services, and hosts.

       -L dir Sets the log dir. See also logdir in the configura-
              tion file.

       -m num Set  the  throttle  for  the maximum number of pro-
              cesses to num.

       -p num Make server listen on port num.  This  defaults  to
              32777.

       -S     Start with the scheduler stopped.

       -P pidfile
              Store  the  server's pid in pidfile, the default is
              the      first       of       /var/run/mon/mon.pid,
              /var/run/mon.pid,  and /etc/mon.pid whose directory
              exists.  An empty value tells mon not to use a  pid
              file.

       -r delay
              Sets  the  number  of seconds used to randomize the
              startup delay before  each  service  is  scheduled.
              Refer  to the global randstart variable in the con-
              figuration file.

       -s dir Path    to    monitor    scripts.    Default     is
              /usr/lib/mon/mon.d.   Multiple  alert  paths may be
              specified by separating  them  with  a  colon.  All
              paths must be absolute.

       -v     Print version information.



DEFINITIONS

       monitor
              A  program  which  tests  for  a certain condition,
              returns either true or false, and  optionally  pro-
              duces  output  to  be passed back to the scheduler.
              Common monitors detect host reachability  via  ICMP
              echo messages, or connection to TCP services.

       period A  period  in time as interpreted by the Time::Perl
              module.
              scheduler.   The scheduler calls upon an alert when
              it detects a failure from a monitor.  An alert pro-
              gram  accepts  a set of command-line arguments from
              the scheduler, in addition  to  data  via  standard
              input.

       hostgroup
              A  single host or list of hosts, specified as names
              or IP addresses.

       service
              A collection of parameters used to deal with  moni-
              toring a particular resource which is provided by a
              group. Services are usually  modeled  after  things
              such  as  an  SMTP  server,  ICMP  echo capability,
              server disk space availability, or SNMP events.

       watch  A collection of services which apply to a  particu-
              lar group.


OPERATION

       When  the  mon  scheduler starts, it reads a configuration
       file to determine the services it needs  to  monitor.  The
       configuration  file  defaults  to  /etc/mon.cf, and can be
       specified using the -c parameter.

       The scheduler enters a loop which handles  client  connec-
       tions,  monitor invocations, and failure alerts. Each ser-
       vice has a timer, specified in the configuration  file  as
       the  interval variable, which tells the scheduler how fre-
       quently to invoke a monitor process.  The scheduler may be
       temporarily  stopped.  While  it is stopped, client access
       still functions, but it just doesn't schedule things. This
       is  useful  in  conjunction  while  resetting  the server,
       because you can do this: save the hosts and services which
       are disabled, reset the server with the scheduler stopped,
       re-disabled those  hosts  and  services,  then  start  the
       scheduler.  It  also  allows  making atomic changes across
       several client connections.  See the moncmd man  page  for
       more information.



MONITOR PROGRAMS

       Monitor processes are invoked with the arguments specified
       in the configuration file, appended by the hosts from  the
       applicable  host group. For example, if the watch group is
       "servers", which contain the hostnames "smtp", "nntp", and
       "ns", and the monitor line reads as follows,

       monitor fping.monitor -t 4000 -r 2

       then  the exectuable "fping.monitor" will be executed with


       MONITOR_DIR is /usr/lib/mon/mon.d, or the  path  specified
       by the -s option.  If all hosts in the hostgroup have been
       disabled, then a warning is sent to syslog and the monitor
       is  not  run.  This  behavior  may  be overridden with the
       "allow_empty_group" option in the service definition.   If
       the  final argument to the "monitor" line is ";;" (it must
       be preceded by whitespace), then the host list will not be
       appended to the parameter list.

       In  addition  to environment variables defined by the user
       in the service definition, mon passes certain variables to
       monitor process.


       MON_LAST_SUMMARY
              The first line of the output from the last time the
              monitor exited.


       MON_LAST_OUTPUT
              The entire output of the monitor from the last time
              it exited.


       MON_LAST_FAILURE
              The time(2) of the last failure for this service.


       MON_FIRST_FAILURE
              The  time(2) of the first time this service failed.


       MON_LAST_SUCCESS
              The time(2) of the last time this service passed.


       MON_ALERTTYPE
              Has one of the following values:  "failure",  "up",
              "startup",  "trap", or "traptimeout", and signifies
              the type of alert which was triggered.  This  envi-
              ronment  variable  is  meant  to supercede the "-u"
              commandline parameter passed to alert scripts.


       MON_DESCRIPTION
              The description of this service, as defined in  the
              configuration file using the description tag.


       MON_RETVAL
       completed successfully (found no problems), or nonzero  if
       a  problem was detected. The first line of output from the
       monitor script has a special meaning:  it  is  used  as  a
       brief summary of the exact failure which was detected, and
       is passed to the alert program. All  remaining  output  is
       also  passed  to the alert program, but it has no required
       interpretation.

       If a monitor for a particular service  is  still  running,
       and the time comes for mon to run another monitor for that
       service, it will not start another monitor.  For  example,
       if  the  interval  is 10s, and the monitor does not finish
       running within 10 seconds, then mon will  wait  until  the
       first monitor exits before running another one.

       Upon  a  nonzero exit status, the associated alert program
       is started, pending the following criteria:  If  an  alert
       for  a specific service is disabled, do not send an alert.
       If an alert is not within the specified period, record the
       failure  via  syslog(3)  and do not send an alert.  If the
       failure falls within the defined period, and an alert  was
       already  sent  within the last alertevery interval, do not
       send another alert, unless the  output  from  the  current
       monitor  program  differs  from  the last monitor process.
       Otherwise, send an alert using each alert  program  listed
       for that period.



ALERT PROGRAMS

       Alert  programs are found in the path supplied with the -a
       parameter, or in the /usr/lib/mon/alert.d directory if not
       specified.   They  are invoked with the following command-
       line parameters:


       -s service
              Service tag from the configuration file.

       -g group
              Host group name from the configuration file.

       -h hosts
              The expanded  version  of  the  host  group,  space
              delimited, but contained in one shell "word".

       -t secs
              The  number  of  seconds  left before another alert
              will be sent out.

       -u     This option is supplied to an alert only if  it  is
              being called as an upalert.

       parameters in the configuration file,  after  the  "alert"
       service parameter.

       As  with monitor programs, alert programs are invoked with
       environment variables defined by the user in  the  service
       definition, in addition to the following which are explic-
       itly set by the server:


       MON_LAST_SUMMARY
              The first line of the output from the last time the
              monitor exited.


       MON_LAST_OUTPUT
              The entire output of the monitor from the last time
              it exited.


       MON_LAST_FAILURE
              The time(2) of the last failure for this service.


       MON_FIRST_FAILURE
              The time(2) of the first time this service  failed.


       MON_LAST_SUCCESS
              The time(2) of the last time this service passed.


       MON_DESCRIPTION
              The  description of this service, as defined in the
              configuration file using the description tag.


       The first line from standard input must be used as a brief
       summary  of  the problem, normally supplied as the subject
       line of an email, or text sent to an  alphanumeric  pager.
       Interpretation  of all subsequent lines read from stdin is
       left up to the monitoring program.  The  usual  parameters
       are  a  list of recipients to deliver the notification to.
       The interpretation of the recipients is not specified, and
       is up to the alert program.



CONFIGURATION FILE

       The  configuration file consists of zero or more hostgroup
       definitions, and one or more watch definitions. Each watch
       definition  may  have  one  or more service definitions. A
       line beginning with  optional  leading  whitespace  and  a
       pound ("#") is regarded as a comment, and is ignored.
       The following variables may be set to override compiled-in
       defaults. Command-line options will have a  higher  prece-
       dence than these definitions.


       alertdir = dir
              dir  is the full path to the alert scripts. This is
              the value set by the -a command-line parameter.

              Multiple alert paths may be specified by separating
              them with a colon. All paths must be absolute.

              When  the  configuration  file  is read, all alerts
              referenced from the configuration will be looked up
              in  each  of  these paths, and the full path to the
              first instance of the alert found is  stored  in  a
              hash.  This  hash is only generated upon startup or
              after a  "reset"  command,  so  newly  added  alert
              scripts  will  not be recognized until a "reset" is
              performed.


       mondir = dir
              dir is the full path to the monitor  scripts.  This
              value may also be set by the -s command-line param-
              eter.

              Multiple alert paths may be specified by separating
              them with a colon. All paths must be absolute.

              When  the  configuration file is read, all monitors
              referenced from the configuration will be looked up
              in  each  of  these paths, and the full path to the
              first instance of the monitor found is stored in  a
              hash.  This  hash is only generated upon startup or
              after a "reset" command,  so  newly  added  monitor
              scripts  will  not be recognized until a "reset" is
              performed.



       statedir = dir
              dir is the full path to the state  directory.   mon
              uses  this directory to save various state informa-
              tion.


       logdir = dir
              dir is the full path to  the  log  directory.   mon
              uses this directory to save various logs, including
              the downtime log.

              dir is the full path for  the  state,  script,  and
              alert directory.


       authfile = file
              file is the full path to the authentication file.


       authtype = type
              type  is the type of authentication to use. If type
              is getpwnam, then the  standard  Unix  passwd  file
              authentication  method  will  be used (calls getpw-
              nam(3) on the user and compares the crypt(3)ed ver-
              sion  of the password with what it gets from getpw-
              nam). This will not work if  shadow  passwords  are
              enabled on the system.

              If  type  is  userfile,  then  usernames and hashed
              passwords are read from userfile, which is  defined
              via the userfile configuration variable.

              If type is shadow, then shadow password may be used
              (NOT IMPLEMENTED).



       userfile = file
              This file is used when authtype is set to userfile.
              It  consists  of  a sequence of lines of the format
              'username : password'.  password is stored  as  the
              hash  returned  by the standard Unix crypt(3) func-
              tion.

              Blank lines and lines beginning with # are ignored.


       snmpport = portnum
              Set the SNMP port that the server binds to.


       use SNMP
              Turn on SNMP support.


       dtlogfile = file
              file  is  a  file  which will be used to record the
              downtime log. Whenever a  service  fails  for  some
              amount  of time and then stop failing, this even is
              written to the log. If this parameter is  not  set,
              no  logging  is  done. The format of the file is as
              follows (# is a comment and may be ignored):


              timenoticed  is  the  time(2) the service came back
              up.

              group  service  is  the  group  and  service  which
              failed.

              firstfail  is the time(2) when the service began to
              fail.

              downtime is  the  number  of  seconds  the  service
              failed.

              interval  is  the  frequency  (in seconds) that the
              service is polled.

              summary is the summary line from when  the  service
              was failing.


       dtlogging = yes/no

              Turns  downtime  logging  on or off. The default is
              off.


       histlength = num
              num is the the  maximum  number  of  events  to  be
              retained in history list. The default is 100.  This
              value may also be set by the -k command-line param-
              eter.


       serverport = port
              port  is the TCP port number that the server should
              bind to. This value may also be set by the -p  com-
              mand-line  parameter.  Normally this port is looked
              up via getservbyname(3), and it defaults to 2583.


       trapport = port
              port is the UDP port number that  the  trap  server
              should  bind  to.   Normally this port is looked up
              via getservbyname(3), and it defaults to 2583.


       pidfile = path
              path is the file the sever will store its  pid  in.
              This  value  may also be set by the -P command-line
              parameter.


              Throttles the number of  concurrently  forked  pro-
              cesses  to  num.  The intent is to provide a safety
              net for the  unlikely  situation  when  the  server
              tries to take on too many tasks at once.  Note that
              this situation has only  been  reported  to  happen
              when  trying  to  use a garbled configuration file!
              You don't want to use a garbled configuration  file
              now, do you?


       cltimeout = secs
              Sets  the  client inactivity timeout to secs.  This
              is meant to help thwart denial of  service  attacks
              or  recover  from  crashed clients.  secs is inter-
              preted as a "1h/1m/1s" string, where "1m" = 60 sec-
              onds.


       randstart = secs
              When  the server starts, normally all services will
              not be scheduled until the interval defined in  the
              respective  service  section.   This can cause long
              delays before the first check  of  a  service,  and
              possibly  a  high  load  on  the server if multiple
              things are scheduled at the same  intervals.   This
              option  is  used to randomize the scheduling of the
              first test for  all  services  during  the  startup
              period,  and  immediately  after the reset command.
              If randstart is defined, the scheduled run time  of
              all  services  of all watch groups will be a random
              number between zero and randstart seconds.


   Hostgroup Entries
       Hostgroup entries begin with the  keyword  hostgroup,  and
       are  followed by a hostgroup tag and one or more hostnames
       or IP addresses, separated by  whitespace.  The  hostgroup
       tag  must  be  composed of alphanumeric characters, a dash
       ("-"), a period ("."), or an underscore  ("_").  Non-blank
       lines  following  the first hostgroup line are interpreted
       as more hostnames.  The hostgroup definition ends  with  a
       blank line. For example:

              hostgroup servers nameserver smtpserver nntpserver
                   nfsserver httpserver smbserver

              hostgroup router_group cisco7000 agsplus


   Watch Group Entries
       Watch  entries begin with a line that starts with the key-
       word watch, followed by whitespace and a single word which
       is  created  whose  tag is that word, and that word is its
       only member.

       Watch entries consist of one or more service  definitions.


   Service Definitions
       service servicename
              A  service definition begins with they keyword ser-
              vice followed by a word which is the tag  for  this
              service.

              The  components of a service are an interval, moni-
              tor, and one or more time  period  definitions,  as
              defined below.


       interval timeval
              The keyword interval followed by a time value spec-
              ifies the frequency that a monitor script  will  be
              triggered.  Time values are defined as "30s", "5m",
              "1h", or "1d", meaning 30  seconds,  5  minutes,  1
              hour,  or 1 day. The numeric portion may be a frac-
              tion, such as "1.5h" or an hour and  a  half.  This
              format  of a time specification will be referred to
              as timeval.


       traptimeout timeval
              This keyword  takes  the  same  time  specification
              argument  as interval, and makes the service expect
              a trap from an external source at least that often,
              else a failure will be registered. This is used for
              a heartbeat-style service.


       trapduration timeval
              If a trap is received, the status  of  the  service
              the trap was delivered to will normally remain con-
              stant. If trapduration is specified, the status  of
              the  service will remain in a failure state for the
              duration specified by timeval, and then it will  be
              reset to "success".


       randskew timeval
              Rather  than  schedule the monitor script to run at
              the start of each  interval,  randomly  adjust  the
              interval  specified  by  the  interval parameter by
              plus-or-minus randskew .  The skew value is  speci-
              fied as the interval parameter: "30s", "5m", etc...
              For example if interval  is  1m,  and  randskew  is
              The  intent  is  to help distribute the load on the
              server when many services are scheduled at the same
              intervals.


       monitor monitor-name [arg...]
              The  keyword  monitor followed by a script name and
              arguments specifies the monitor  to  run  when  the
              timer  expires.  Shell-like quoting conventions are
              followed when specifying the arguments to  send  to
              the monitor script.  The script is invoked from the
              directory given with the -s argument, and all  fol-
              lowing words are supplied as arguments to the moni-
              tor program, followed by the list of hosts  in  the
              group  referred  to by the current watch group.  If
              the monitor line ends with ";;" as a separate word,
              the  host  groups  are not appended to the argument
              list when the program is invoked.


       allow_empty_group
              The allow_empty_group option will allow  a  monitor
              to  be  invoked  even  when  the hostgroup for that
              watch is  empty  because  of  disabled  hosts.  The
              default  behavior is not to invoke the monitor when
              all hosts in a hostgroup have been disabled.


       description descriptiontext
              The text following description is queried by client
              programs,  passed  to  alerts  and  monitors via an
              environment variable. It  should  contain  a  brief
              description  of the service, suitable for inclusion
              in an email or on a web page.


       depend dependexpression
              The depend keyword is used to specify an expression
              to be evaluated before this service is scheduled to
              run. If the expression evaluates to zero (or  unde-
              fined),  then  this  service is set to an undefined
              operational status and is not run. This can be used
              to  control alerts for services which are dependent
              on other services,  e.g.  an  SMTP  test  which  is
              dependent upon the machine being ping-reachable.

              Dependencies  are actual Perl expressions, and must
              obey all syntactical rules. If a  syntax  error  is
              found  when evaluating the expression, it is logged
              via syslog.  Before evaluation,  mon  performs  the
              following  substitutions on the expression: phrases
              which look  like  "group:service"  are  substituted
              tions  are  computed  recursively,  so if service A
              depends upon service B, and service B depends  upon
              service C, then service A depends upon service C.



   Period Definitions
       Periods  are  used  to  define the conditions which should
       allow alerts to be delivered.


       period [label:] periodspec
              A period groups one or more  alarms  and  variables
              which control how often an alert happens when there
              is a failure.  The period keyword  has  two  forms.
              The first takes an argument which is a period spec-
              ification from Patrick Ryan's Time::Period  Perl  5
              module.  Refer  to  "perldoc Time::Period" for more
              information.

              The second form requires  a  label  followed  by  a
              period  specification,  as defined above. The label
              is a tag consisting of an alphabetic  character  or
              underscore  followed  by zero or more alphanumerics
              or underscores and ending with a colon.  This  form
              allows  multiple periods with the same period defi-
              nition. One use is  to  have  a  period  definition
              which  has  no  alertafter or alertevery parameters
              for a particular time period, and another  for  the
              same  time  period  with  a different set of alerts
              that does contain those parameters.


       alertevery timeval
              The alertevery keyword (within a period definition)
              takes  the  same  type  of argument as the interval
              variable, and limits the number of times  an  alert
              is  sent  when  the service continues to fail.  For
              example, if the interval is  "1h",  then  only  the
              alerts in the period section will only be triggered
              once every hour. If the alertevery keyword is omit-
              ted  in  a  period entry, an alert will be sent out
              every time a failure is detected.  By  default,  if
              the output of two successive failures changes, then
              the alertevery interval is overridden.  If the word
              "summary"  is the last argument, then only the sum-
              mary output lines will be considered when comparing
              the output of successive failures.


       alertafter num timeval
              The  alertafter  keyword  (within a period section)
              above.  If this parameter is  specified,  then  the
              alerts  for  that  period will only be called after
              that many failures happen within that interval. For
              example,  if  alertafter  is  given  the  arguments
              "3 30m", then the alert will be called if  3  fail-
              ures happen within 30 minutes.


       alert alert [arg...]
              A  period  may  contain  multiple alerts, which are
              triggered upon failure of the service. An alert  is
              specified  with  the  alert keyword, followed by an
              optional exit parmeter,  and  arguments  which  are
              interpreted the same as the monitor definition, but
              without the  ";;"  exception.  The  exit  parameter
              takes  the  form  of exit=x or exit=x-y and has the
              effect that the alert is only called  if  the  exit
              status of the monitor script falls within the range
              of the exit parameter. If, for example,  the  alert
              line  is alert exit=10-20 mail.alert mis then mail-
              alert will only be invoked with mis  as  its  argu-
              ments  if  the  monitor  program's  exit  value  is
              between 10 and 20. This feature allows you to trig-
              ger  different  alerts at different severity levels
              (like when free disk space goes from 8% to 3%).

              Alert programs are invoked with the following  com-
              mand-line parameters:


       -s service name
              The service tag for this failure.

       -g hostgroup
              The tag of the host group for this service.

       -h hostgroup expansion
              All of the members in this hostgroup.

       -t time
              The  time  (in time(2) format) of when this failure
              condition was detected.

       -l alertevery
              The number of seconds until the next alarm will  be
              sent.

       -u     This  option  is supplied to an alert only if it is
              being called as an upalert.


       upalert alert [arg...]
              transition from failure  to  success.  The  upalert
              script  is  called supplying the same parameters as
              the alert script,  with  the  addition  of  the  -u
              parameter  which  is  simply  used  to let an alert
              script know that it is being called as an  upalert.
              Multiple  upalerts may be specified for each period
              definition.


       startupalert alert [arg...]
              A startupalert is only called when the  mon  server
              starts execution.


       upalertafter timeval
              The upalertafter parameter is specified as a string
              that follows the syntax of the  interval  parameter
              ("30s", "1m", etc.), and controls the triggering of
              an upalert.  If a service comes back up after being
              down  for a time greater than or equal to the value
              of this option, an upalert will be called. Use this
              option  to prevent upalerts to be called because of
              "blips" (brief outages).



AUTHENTICATION CONFIGURATION FILE

       The file specified by the authfile variable in the config-
       uration  file  (or  passed  via  the -A parameter) will be
       loaded upon startup.  This file defines restrictions  upon
       which  client  commands may be executed by which users. It
       is a text file which consists of comments and command def-
       initions.  A  comment line begins with optional whitespace
       followed by pound sign. Blank lines are ignored.   A  com-
       mand  definition  consists  of  a  command,  followed by a
       colon, followed by a comma-separated list of users who may
       execute  the  command.   The  default is that no users may
       execute any commands.

       An example configuration file:

              list:          all
              reset:         root,admin
              loadstate:          root
              savestate:          root

       This means that all clients are able to perform  the  list
       command,  "root"  is able to perform "reset", "loadstate",
       "savestate", and "admin" is able to  execute  the  "reset"
       command.



CLIENT-SERVER INTERFACE

       line  each, terminated by a newline.  Currently the server
       is iterative, accepting a single client at  a  time.  This
       will change in future releases.



CLIENT INTERFACE COMMANDS

       See manual page for moncmd.



EXAMPLES

       The  mon  distribution comes with an example configuration
       called example.cf.  Refer to that file for  more  informa-
       tion.



SEE ALSO

       moncmd(1), Time::Period(3pm)


HISTORY

       mon was written because I couldn't find anything out there
       that did just what I needed, and nothing was worth modify-
       ing  to  add the features I wanted. It doesn't have a cool
       name, and that bothers me because I couldn't think of one.


BUGS

       Report bugs to the email address below.


AUTHOR

       Jim Trocki <trockij@transmeta.com>
























Man(1) output converted with man2html