HotSaNIC
From AdminWiki
(→Postfix Module patch to have it check 'n' postfix servers on one box) |
m (→Overview) |
||
(3 intermediate revisions not shown) | |||
Line 3: | Line 3: | ||
== Overview == | == Overview == | ||
- | HotSaNIC is system that | + | HotSaNIC is a monitoring system that collects data with a perl daemon and uses rrd to store the data and create graphs for the system. It is very similar to [[munin]], but it lacks a client-server system. Like [[munin]] it can use plugins, but they are more difficult to write. |
- | The advantage | + | The advantage of HotSaNIC is it's very high resolution, it reads the data every few seconds which leads to very accurate results. This is good for monitoring network traffic, processes and similar fast-changing events. |
- | The disadvantage is | + | The disadvantage is the lack of a server-client setup, so you have to do a full install of hotsanic on every server you want to monitor, which is more complex than [[munin]]. |
- | The current CVS version is recommended as it has all patches to work with current | + | The current CVS version is recommended as it has all patches to work with current kernels. |
== Additional Modules / Changed Modules == | == Additional Modules / Changed Modules == |
Latest revision as of 15:17, 1 June 2006
Contents |
HotSaNIC monitoring
Overview
HotSaNIC is a monitoring system that collects data with a perl daemon and uses rrd to store the data and create graphs for the system. It is very similar to munin, but it lacks a client-server system. Like munin it can use plugins, but they are more difficult to write.
The advantage of HotSaNIC is it's very high resolution, it reads the data every few seconds which leads to very accurate results. This is good for monitoring network traffic, processes and similar fast-changing events.
The disadvantage is the lack of a server-client setup, so you have to do a full install of hotsanic on every server you want to monitor, which is more complex than munin.
The current CVS version is recommended as it has all patches to work with current kernels.
Additional Modules / Changed Modules
NTP Patch to work with current ntp daemon
The ntp module from HotSaNIC works, but misses some data because of some output change from the ntpdaemon. I wrote a small patch to fix this.
Patches below
diagrams.pl
--- /usr/local/src/sys/HotSaNIC/modules/ntp/diagrams.pl 2004-06-03 08:25:10.000000000 +0900 +++ /usr/local/HotSaNIC/modules/ntp/diagrams.pl 2006-03-20 10:13:06.447836886 +0900 @@ -33,8 +33,8 @@ # handle module-specific stuff # -my @GRAPHS=("stratum","distance","dispersion","frequency","stability"); -my %LEGENDS=("stratum"=>" ","distance"=>"seconds","dispersion"=>"seconds","frequency"=>"ppm","stability"=>"ppm"); +my @GRAPHS=("stratum","distance","dispersion","jitter","stability","broadcastdelay","authdelay"); +my %LEGENDS=("stratum"=>" ","distance"=>"seconds","dispersion"=>"seconds","jitter"=>"seconds","stability"=>"ppm","broadcastdelay"=>"seconds","authdelay"=>"seconds"); push @OPTIONS,( "--alt-autoscale-max"); # alternate scaling
makeindex.pl
--- /usr/local/src/sys/HotSaNIC/modules/ntp/makeindex.pl 2004-06-03 08:25:10.000000000 +0900 +++ /usr/local/HotSaNIC/modules/ntp/makeindex.pl 2006-03-20 10:09:42.065613912 +0900 @@ -29,7 +29,7 @@ %MODCONFIG=HotSaNICmod::common::configure(); @DIAGRAMS=("hour","6h","day","week","month","year"); @TIMES=("6h","week"); -@GRAPHS=("stratum","distance","dispersion","frequency","stability"); +@GRAPHS=("stratum","distance","dispersion","jitter","stability","broadcastdelay","authdelay"); # build time-based .html files #
makerrd
--- /usr/local/src/sys/HotSaNIC/modules/ntp/makerrd 2004-05-18 03:31:58.000000000 +0900 +++ /usr/local/HotSaNIC/modules/ntp/makerrd 2006-03-20 10:06:04.466164389 +0900 @@ -16,9 +16,11 @@ $BINPATH/rrdtool create rrd/$PROC.rrd --step 10 \ DS:dispersion:GAUGE:300:0:$MAX \ DS:distance:GAUGE:300:0:$MAX \ - DS:frequency:GAUGE:300:-10000:10000 \ + DS:jitter:GAUGE:300:0:$MAX \ DS:stability:GAUGE:300:0:$MAX \ DS:stratum:GAUGE:300:0:20 \ + DS:broadcastdelay:GAUGE:300:0:$MAX \ + DS:authdelay:GAUGE:300:0:$MAX \ RRA:AVERAGE:0:1:720 \ RRA:AVERAGE:0.3:6:2880 \ RRA:AVERAGE:0.3:60:2016 \
in the subdirectory platform common.pm
--- /usr/local/src/sys/HotSaNIC/modules/ntp/platform/common.pm 2004-06-03 08:25:10.000000000 +0900 +++ /usr/local/HotSaNIC/modules/ntp/platform/common.pm 2006-03-20 09:59:15.837619337 +0900 @@ -13,6 +13,7 @@ my @list=HotSaNICparser::locate_files("bin/ntpdc"); if (! @list) { @list=HotSaNICparser::locate_files("bin/xntpdc"); } $MODCONF{NTPCOMMAND}=pop @list; + chomp $MODCONF{NTPCOMMAND}; } return %MODCONF;
default.pm
--- /usr/local/src/sys/HotSaNIC/modules/ntp/platform/default.pm 2004-07-06 05:40:31.000000000 +0900 +++ /usr/local/HotSaNIC/modules/ntp/platform/default.pm 2006-03-20 10:07:04.107497582 +0900 @@ -24,12 +24,14 @@ $str=$value if $var eq "stratum"; $dst=$value if $var eq "root distance"; $dps=$value if $var eq "root dispersion"; - $frq=$value if $var eq "frequency"; + $jtr=$value if $var eq "jitter"; # changed that from frequency $stb=$value if $var eq "stability"; + $bcdly=$value if $var eq "broadcastdelay"; + $athdly=$value if $var eq "authdelay"; } close FILE; - - HotSaNICmod::do_rrd($dbname,"U",time,$dps,$dst,$frq,$stb,$str); + HotSaNICmod::do_rrd($dbname,"U",time,$dps,$dst,$jtr,$stb,$str,$bcdly,$athdly); } }
after patching the files and you have already a rrd file there you need to remove it (yes, you loose all data). You also need to run the makeindex.pl file again to regenerate the html files
Add more detailed error legend in Postfix queue log
The default postfix queue graph has already a nice list of errors, but there are some which can be also shown to be more detailed. Especially 4x errors.
Below some patches to add this to the vaniall mailq without the multiple postfix patch. The multiple postfix patch has this one already included.
diagrams.pl
--- /usr/local/src/HotSaNIC/modules/mailq/diagrams.pl 2004-05-24 08:53:39.000000000 +0900 +++ /usr/local/HotSaNIC/modules/mailq/diagrams.pl 2006-03-23 12:53:27.843497000 +0900 @@ -113,8 +113,12 @@ "DEF:ctimeout=$DB_FILE:ctimeout:AVERAGE", "DEF:rtimeout=$DB_FILE:rtimeout:AVERAGE", "DEF:nohost=$DB_FILE:nohost:AVERAGE", + "DEF:noroute=$DB_FILE:noroute:AVERAGE", + "DEF:err450=$DB_FILE:err450:AVERAGE", + "DEF:err421=$DB_FILE:err421:AVERAGE", + "DEF:err4=$DB_FILE:err4:AVERAGE", "DEF:other=$DB_FILE:other:AVERAGE", - "CDEF:active=req,crefused,-,msrefused,-,ctimeout,-,rtimeout,-,nohost,-,other,-,"); + "CDEF:active=req,crefused,-,msrefused,-,ctimeout,-,rtimeout,-,nohost,-,noroute,-,err450,-,err421,-,err4,-,other,-,"); if ($range ne "1h") { push @COMMANDS,("DEF:maxreq=$DB_FILE:req:MAX", @@ -128,8 +132,12 @@ HotSaNICdiagram::insert_data("STACK","crefused" ,$MODULECONFIG{COLOR_AREA_CREFUSED} ,"connection refused ",$legends,$LEGEND,1), HotSaNICdiagram::insert_data("STACK","other" ,$MODULECONFIG{COLOR_AREA_OTHER} ,"other ",$legends,$LEGEND,1), HotSaNICdiagram::insert_data("STACK","nohost" ,$MODULECONFIG{COLOR_AREA_NOHOST} ,"host not found ",$legends,$LEGEND,1), + HotSaNICdiagram::insert_data("STACK","noroute" ,$MODULECONFIG{COLOR_AREA_NOROUTE} ,"route not found ",$legends,$LEGEND,1), HotSaNICdiagram::insert_data("STACK","ctimeout" ,$MODULECONFIG{COLOR_AREA_CTIMEOUT} ,"connection timed out",$legends,$LEGEND,1), HotSaNICdiagram::insert_data("STACK","rtimeout" ,$MODULECONFIG{COLOR_AREA_RTIMEOUT} ,"read timed out ",$legends,$LEGEND,1), + HotSaNICdiagram::insert_data("STACK","err450" ,$MODULECONFIG{COLOR_AREA_ERR450} ,"450 mbox not okay ",$legends,$LEGEND,1), + HotSaNICdiagram::insert_data("STACK","err421" ,$MODULECONFIG{COLOR_AREA_ERR421} ,"421 service not okay",$legends,$LEGEND,1), + HotSaNICdiagram::insert_data("STACK","err4" ,$MODULECONFIG{COLOR_AREA_ERR4} ,"general 4xx error ",$legends,$LEGEND,1), HotSaNICdiagram::insert_data("STACK","active" ,$MODULECONFIG{COLOR_AREA} ,"active ",$legends,$LEGEND,1), "LINE1:req#".$MODULECONFIG{COLOR_LINE}.":", HotSaNICdiagram::insert_lines(%MODULECONFIG));
makerrd
--- /usr/local/src/HotSaNIC/modules/mailq/makerrd 2004-04-19 01:35:24.000000000 +0900 +++ /usr/local/HotSaNIC/modules/mailq/makerrd 2006-03-23 12:13:07.296222250 +0900 @@ -18,6 +18,10 @@ DS:nohost:GAUGE:300:0:U \ DS:other:GAUGE:300:0:U \ DS:msrefused:GAUGE:300:0:U \ + DS:noroute:GAUGE:300:0:U \ + DS:err450:GAUGE:300:0:U \ + DS:err421:GAUGE:300:0:U \ + DS:err4:GAUGE:300:0:U \ RRA:AVERAGE:0:1:720 \ RRA:AVERAGE:0.3:6:2880 \ RRA:AVERAGE:0.3:60:2016 \
in the platform folder default.pm
--- /usr/local/src/HotSaNIC/modules/mailq/platform/default.pm 2004-07-01 19:28:50.000000000 +0900 +++ /usr/local/HotSaNIC/modules/mailq/platform/default.pm 2006-03-23 13:39:45.393083000 +0900 @@ -19,6 +19,10 @@ $other=0; $msrefused=0; $active=0; + $noroute=0; + $err450=0; + $err421=0; + $err4=0; open FILE,"mailq|"; while (<FILE>) { @@ -29,15 +33,19 @@ elsif (index($_,"Host not found") >=0 ) { $nohost++; } elsif (index($_,"read timeout") >=0 ) { $rtimeout++; } elsif (index($_,"server refused mail service") >=0 ) { $msrefused++; } + elsif (index($_,"No route to host") >=0 ) { $noroute++; } + elsif (index($_,"said: 450") >=0 ) { $err450++; } + elsif (index($_,"said: 421") >=0 ) { $err421++; } + elsif (index($_,"said: 4") >=0 ) { $err4++; } # all other 4xx errors are collected here else { $other++; } } elsif (/^--/o) { (undef,$kbytes)=split; } } close FILE; - my $req=$crefused+$ctimeout+$rtimeout+$nohost+$other+$msrefused+$active; + my $req=$crefused+$ctimeout+$rtimeout+$nohost+$other+$msrefused+$active+$noroute+$err450+$err421+$err4; - HotSaNICmod::do_rrd("queue","U",time,$kbytes,$req,$crefused,$ctimeout,$rtimeout,$nohost,$other,$msrefused); + HotSaNICmod::do_rrd("queue","U",time,$kbytes,$req,$crefused,$ctimeout,$rtimeout,$nohost,$other,$msrefused,$noroute,$err450,$err421,$err4); }
if you have already an rrd, you need to remove it, as this patche changes the rrd file layout. Old data will be lost.
Postfix Module patch to have it check 'n' postfix servers on one box
I have a special setup where I have ten postfix servers running on one box. I wanted to monitor them with the postfix module from HotSaNIC to get the current processes count and mails in the queue.
This is quite a majour change, it works fine for me. Remove the rrd and config files because the configuration has to be created new. Also the rrd file layout changes, so old rrd files and their data will be lost.