HotSaNIC

From AdminWiki

Revision as of 15:13, 1 June 2006 by Robe (Talk | contribs)
Jump to: navigation, search

Contents

HotSaNIC monitoring

Overview

HotSaNIC is a monitoring system that collects data with a perl daemon and uses rrd to store the data and create graphs for the system. It is very similar to munin, but it lacks a client-server system. Like munin it can use plugins, but they are more difficult to write.

The advantage of HotSaNIC is the very high resolution, it reads the data every few seconds which leads to very accurate results. This is good for monitoring network traffic, processes and similar fast-changing events.

The disadvantage is the lack of a server-client setup, so you have to do a full install of hotsanic on every server you want to monitor, which is more complex than munin.

The current CVS version is recommended as it has all patches to work with current kernels. The Hotsanic daemon should run as root, so it hasn't got any problems with grs or other secure kernels.

Additional Modules / Changed Modules

NTP Patch to work with current ntp daemon

The ntp module from HotSaNIC works, but misses some data because of some output change from the ntpdaemon. I wrote a small patch to fix this.

Patches below

diagrams.pl

--- /usr/local/src/sys/HotSaNIC/modules/ntp/diagrams.pl 2004-06-03 08:25:10.000000000 +0900
+++ /usr/local/HotSaNIC/modules/ntp/diagrams.pl 2006-03-20 10:13:06.447836886 +0900
@@ -33,8 +33,8 @@

 # handle module-specific stuff
 #
-my  @GRAPHS=("stratum","distance","dispersion","frequency","stability");
-my  %LEGENDS=("stratum"=>" ","distance"=>"seconds","dispersion"=>"seconds","frequency"=>"ppm","stability"=>"ppm");
+my  @GRAPHS=("stratum","distance","dispersion","jitter","stability","broadcastdelay","authdelay");
+my  %LEGENDS=("stratum"=>" ","distance"=>"seconds","dispersion"=>"seconds","jitter"=>"seconds","stability"=>"ppm","broadcastdelay"=>"seconds","authdelay"=>"seconds");
 push @OPTIONS,(
   "--alt-autoscale-max");   # alternate scaling

makeindex.pl

--- /usr/local/src/sys/HotSaNIC/modules/ntp/makeindex.pl        2004-06-03 08:25:10.000000000 +0900
+++ /usr/local/HotSaNIC/modules/ntp/makeindex.pl        2006-03-20 10:09:42.065613912 +0900
@@ -29,7 +29,7 @@
 %MODCONFIG=HotSaNICmod::common::configure();
 @DIAGRAMS=("hour","6h","day","week","month","year");
 @TIMES=("6h","week");
-@GRAPHS=("stratum","distance","dispersion","frequency","stability");
+@GRAPHS=("stratum","distance","dispersion","jitter","stability","broadcastdelay","authdelay");

 # build time-based .html files
 #

makerrd

--- /usr/local/src/sys/HotSaNIC/modules/ntp/makerrd     2004-05-18 03:31:58.000000000 +0900
+++ /usr/local/HotSaNIC/modules/ntp/makerrd     2006-03-20 10:06:04.466164389 +0900
@@ -16,9 +16,11 @@
   $BINPATH/rrdtool create rrd/$PROC.rrd --step 10 \
     DS:dispersion:GAUGE:300:0:$MAX \
     DS:distance:GAUGE:300:0:$MAX \
-    DS:frequency:GAUGE:300:-10000:10000 \
+    DS:jitter:GAUGE:300:0:$MAX \
     DS:stability:GAUGE:300:0:$MAX \
     DS:stratum:GAUGE:300:0:20 \
+    DS:broadcastdelay:GAUGE:300:0:$MAX \
+    DS:authdelay:GAUGE:300:0:$MAX \
     RRA:AVERAGE:0:1:720 \
     RRA:AVERAGE:0.3:6:2880 \
     RRA:AVERAGE:0.3:60:2016 \

in the subdirectory platform common.pm

--- /usr/local/src/sys/HotSaNIC/modules/ntp/platform/common.pm  2004-06-03 08:25:10.000000000 +0900
+++ /usr/local/HotSaNIC/modules/ntp/platform/common.pm  2006-03-20 09:59:15.837619337 +0900
@@ -13,6 +13,7 @@
     my @list=HotSaNICparser::locate_files("bin/ntpdc");
     if (! @list) { @list=HotSaNICparser::locate_files("bin/xntpdc"); }
     $MODCONF{NTPCOMMAND}=pop @list;
+    chomp $MODCONF{NTPCOMMAND};
     }

   return %MODCONF;

default.pm

--- /usr/local/src/sys/HotSaNIC/modules/ntp/platform/default.pm 2004-07-06 05:40:31.000000000 +0900
+++ /usr/local/HotSaNIC/modules/ntp/platform/default.pm 2006-03-20 10:07:04.107497582 +0900
@@ -24,12 +24,14 @@
       $str=$value if $var eq "stratum";
       $dst=$value if $var eq "root distance";
       $dps=$value if $var eq "root dispersion";
-      $frq=$value if $var eq "frequency";
+      $jtr=$value if $var eq "jitter"; # changed that from frequency
       $stb=$value if $var eq "stability";
+      $bcdly=$value if $var eq "broadcastdelay";
+      $athdly=$value if $var eq "authdelay";
       }
     close FILE;
-
-    HotSaNICmod::do_rrd($dbname,"U",time,$dps,$dst,$frq,$stb,$str);
+    HotSaNICmod::do_rrd($dbname,"U",time,$dps,$dst,$jtr,$stb,$str,$bcdly,$athdly);
     }
   }

after patching the files and you have already a rrd file there you need to remove it (yes, you loose all data). You also need to run the makeindex.pl file again to regenerate the html files

Add more detailed error legend in Postfix queue log

The default postfix queue graph has already a nice list of errors, but there are some which can be also shown to be more detailed. Especially 4x errors.

Below some patches to add this to the vaniall mailq without the multiple postfix patch. The multiple postfix patch has this one already included.

diagrams.pl

--- /usr/local/src/HotSaNIC/modules/mailq/diagrams.pl   2004-05-24 08:53:39.000000000 +0900
+++ /usr/local/HotSaNIC/modules/mailq/diagrams.pl       2006-03-23 12:53:27.843497000 +0900
@@ -113,8 +113,12 @@
       "DEF:ctimeout=$DB_FILE:ctimeout:AVERAGE",
       "DEF:rtimeout=$DB_FILE:rtimeout:AVERAGE",
       "DEF:nohost=$DB_FILE:nohost:AVERAGE",
+      "DEF:noroute=$DB_FILE:noroute:AVERAGE",
+      "DEF:err450=$DB_FILE:err450:AVERAGE",
+      "DEF:err421=$DB_FILE:err421:AVERAGE",
+      "DEF:err4=$DB_FILE:err4:AVERAGE",
       "DEF:other=$DB_FILE:other:AVERAGE",
-      "CDEF:active=req,crefused,-,msrefused,-,ctimeout,-,rtimeout,-,nohost,-,other,-,");
+      "CDEF:active=req,crefused,-,msrefused,-,ctimeout,-,rtimeout,-,nohost,-,noroute,-,err450,-,err421,-,err4,-,other,-,");

     if ($range ne "1h") {
       push @COMMANDS,("DEF:maxreq=$DB_FILE:req:MAX",
@@ -128,8 +132,12 @@
       HotSaNICdiagram::insert_data("STACK","crefused" ,$MODULECONFIG{COLOR_AREA_CREFUSED} ,"connection refused  ",$legends,$LEGEND,1),
       HotSaNICdiagram::insert_data("STACK","other"    ,$MODULECONFIG{COLOR_AREA_OTHER}    ,"other               ",$legends,$LEGEND,1),
       HotSaNICdiagram::insert_data("STACK","nohost"   ,$MODULECONFIG{COLOR_AREA_NOHOST}   ,"host not found      ",$legends,$LEGEND,1),
+      HotSaNICdiagram::insert_data("STACK","noroute"  ,$MODULECONFIG{COLOR_AREA_NOROUTE}  ,"route not found     ",$legends,$LEGEND,1),
       HotSaNICdiagram::insert_data("STACK","ctimeout" ,$MODULECONFIG{COLOR_AREA_CTIMEOUT} ,"connection timed out",$legends,$LEGEND,1),
       HotSaNICdiagram::insert_data("STACK","rtimeout" ,$MODULECONFIG{COLOR_AREA_RTIMEOUT} ,"read timed out      ",$legends,$LEGEND,1),
+      HotSaNICdiagram::insert_data("STACK","err450"   ,$MODULECONFIG{COLOR_AREA_ERR450}   ,"450 mbox not okay   ",$legends,$LEGEND,1),
+      HotSaNICdiagram::insert_data("STACK","err421"   ,$MODULECONFIG{COLOR_AREA_ERR421}   ,"421 service not okay",$legends,$LEGEND,1),
+      HotSaNICdiagram::insert_data("STACK","err4"     ,$MODULECONFIG{COLOR_AREA_ERR4}     ,"general 4xx error   ",$legends,$LEGEND,1),
       HotSaNICdiagram::insert_data("STACK","active"   ,$MODULECONFIG{COLOR_AREA}          ,"active              ",$legends,$LEGEND,1),
       "LINE1:req#".$MODULECONFIG{COLOR_LINE}.":",
       HotSaNICdiagram::insert_lines(%MODULECONFIG));

makerrd

--- /usr/local/src/HotSaNIC/modules/mailq/makerrd       2004-04-19 01:35:24.000000000 +0900
+++ /usr/local/HotSaNIC/modules/mailq/makerrd   2006-03-23 12:13:07.296222250 +0900
@@ -18,6 +18,10 @@
     DS:nohost:GAUGE:300:0:U \
     DS:other:GAUGE:300:0:U \
     DS:msrefused:GAUGE:300:0:U \
+    DS:noroute:GAUGE:300:0:U \
+    DS:err450:GAUGE:300:0:U \
+    DS:err421:GAUGE:300:0:U \
+    DS:err4:GAUGE:300:0:U \
     RRA:AVERAGE:0:1:720 \
     RRA:AVERAGE:0.3:6:2880 \
     RRA:AVERAGE:0.3:60:2016 \

in the platform folder default.pm

--- /usr/local/src/HotSaNIC/modules/mailq/platform/default.pm   2004-07-01 19:28:50.000000000 +0900
+++ /usr/local/HotSaNIC/modules/mailq/platform/default.pm       2006-03-23 13:39:45.393083000 +0900
@@ -19,6 +19,10 @@
   $other=0;
   $msrefused=0;
   $active=0;
+  $noroute=0;
+  $err450=0;
+  $err421=0;
+  $err4=0;

   open FILE,"mailq|";
   while (<FILE>) {
@@ -29,15 +33,19 @@
       elsif (index($_,"Host not found") >=0 ) { $nohost++; }
       elsif (index($_,"read timeout") >=0 ) { $rtimeout++; }
       elsif (index($_,"server refused mail service") >=0 ) { $msrefused++; }
+      elsif (index($_,"No route to host") >=0 ) { $noroute++; }
+      elsif (index($_,"said: 450") >=0 ) { $err450++; }
+      elsif (index($_,"said: 421") >=0 ) { $err421++; }
+      elsif (index($_,"said: 4") >=0 ) { $err4++; } # all other 4xx errors are collected here
       else { $other++; }
       }
     elsif (/^--/o) { (undef,$kbytes)=split; }
     }
   close FILE;

-  my $req=$crefused+$ctimeout+$rtimeout+$nohost+$other+$msrefused+$active;
+  my $req=$crefused+$ctimeout+$rtimeout+$nohost+$other+$msrefused+$active+$noroute+$err450+$err421+$err4;

-  HotSaNICmod::do_rrd("queue","U",time,$kbytes,$req,$crefused,$ctimeout,$rtimeout,$nohost,$other,$msrefused);
+  HotSaNICmod::do_rrd("queue","U",time,$kbytes,$req,$crefused,$ctimeout,$rtimeout,$nohost,$other,$msrefused,$noroute,$err450,$err421,$err4);

   }

if you have already an rrd, you need to remove it, as this patche changes the rrd file layout. Old data will be lost.

Postfix Module patch to have it check 'n' postfix servers on one box

I have a special setup where I have ten postfix servers running on one box. I wanted to monitor them with the postfix module from HotSaNIC to get the current processes count and mails in the queue.

This is quite a majour change, it works fine for me. Remove the rrd and config files because the configuration has to be created new. Also the rrd file layout changes, so old rrd files and their data will be lost.

Patches for multiple postfixes module

Personal tools