git.lirion.de

Of git, get, and gud

summaryrefslogtreecommitdiffstats
path: root/nagios-plugins-contrib-24.20190301~bpo9+1/check_hpasm/check_hpasm-4.8/README
blob: 43fceacf9755e982b3c70e6fe4d864311292b376 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
check_hpasm Nagios Plugin README
---------------------

This plugin checks the hardware health of HP Proliant servers with the 
hpasm software installed. It uses the hpasmcli command to acquire the 
condition of the system's critical components like cpus, power supplies,
temperatures, fans and memory modules. Newer versions also use SNMP.

* For instructions on installing this plugin for use with Nagios,
  see below. In addition, generic instructions for the GNU toolchain
  can be found in the INSTALL file.

* For major changes between releases, read the CHANGES file.

* For information on detailed changes that have been made,
  read the Changelog file.

* This plugins is self documenting.  All plugins that comply with
  the basic guidelines for development will provide detailed help when
  invoked with the '-h' or '--help' options.

You can check for the latest plugin at:
  http://www.consol.de/opensource/nagios/check-hpasm

Send mail to mail_redacted_for_web for assistance.  
Please include the OS type and version that you are using.
Also, run the plugin with the '-v' option and provide the resulting 
version information.  Of course, there may be additional diagnostic information
required as well.  Use good judgment.


How to "compile" the check_hpasm script.
--------------------------------------------------------

1) Run the configure script to initialize variables and create a Makefile, etc.

	./configure --prefix=BASEDIRECTORY --with-nagios-user=SOMEUSER --with-nagios-group=SOMEGROUP --with-perl=PATH_TO_PERL --with-noinst-level=LEVEL --with-degrees=UNIT --with-perfdata --with-hpacucli

   a) Replace BASEDIRECTORY with the path of the directory under which Nagios
      is installed (default is '/usr/local/nagios')
   b) Replace SOMEUSER with the name of a user on your system that will be
      assigned permissions to the installed plugins (default is 'nagios')
   c) Replace SOMEGRP with the name of a group on your system that will be
      assigned permissions to the installed plugins (default is 'nagios')
   d) Replace PATH_TO_PERL with the path where a perl binary can be found.
      Besides the system wide perl you might have installed a private perl
      just for the nagios plugins (default is the perl in your path).
   e) Replace LEVEL with one of ok, warning, critical or unknown.
      If the required hpasm-rpm is not installed, the check_hpasm plugin
      will exit with the level specified. If you chose ok, the message
      will say "ok - .... hpasm is not installed". This is different from
      the "ok - hardware working fine" if hpasm was found.
      The default is to treat a missing hpasm package as ok.
   f) Replace UNIT with one of celsius or fahrenheit. The hpasmcli "show temp"
      prints temperatures both in units of celsius and fahrenheit. With the
      --with-degrees option you can decide which units will be shown in an
      alarm message.
      The default is "celsius".
   g) You can tell check_hpasm to output performance data by default if
      you call configure with the --enable-perfdata option.
   h) You can tell check_hpasm to check the raid status with the hpacucli command
      if you call configure with the --enable-hpacucli option.
      You need the hpacucli rpm.

2) "Compile" the plugin with the following command:

	make

    This will produce a "check_hpasm" script. You will also find
    a "check_hpasm.pl" which you better ignore. It is the base for
    the compilation filled with placeholders. These will be replaced during
    the make process.


3) Install the compiled plugin script with the following command:

	make install

   The installation procedure will attempt to place the plugin in a 
   'libexec/' subdirectory in the base directory you specified with
   the --prefix argument to the configure script.


4) Verify that your configuration files for Nagios contains
   the correct paths to the new plugin.


5) Add this line to /etc/sudoers:
   nagios      ALL=NOPASSWD: /sbin/hpasmcli
   or ths, if you also installed the hpacu package
   nagios      ALL=NOPASSWD: /sbin/hpasmcli, /usr/sbin/hpacucli
  


Command line parameters
-----------------------

-v, --verbose
   Increased verbosity will print how check_hpasm communicates with the
   hpasm daemon and which values were acquired.

-t, --timeout
   The number of seconds after which the plugin will abort.

-b, --blacklist
   If some components of your system are missing (mostly the secondary
   power supply bay is empty) and you tolerate this, then blacklist the
   missing/failed component to avoid false alarms.
   The value for this option is a slash-separated list of components to
   ignore.
   Example: -b p:1,2/f:2/t:3,4/c:1/d:0-1,0-2
   means: ignore power supplies #1 and #2, fan #2, temperature #3 and #4,
   cpu #1 and dimms #1 and #2 in cartridge #0.

-c, --customthresh
   Override the machine-default temperature thresholds.
   Example: -c 1:60/4:80/5:50
   Sets limit for temperature 1 to 60 degrees, temperature 4 to 80 degrees
   and temperature 5 to 50 degrees. You get the consecutive numbers by
   calling check_hpasm -v
   ...
      checking temperatures
       1 processor_zone temperature is 46 (62 max)
       2 cpu#1 temperature is 43 (73 max)
       3 i/o_zone temperature is 54 (68 max)
       4 cpu#2 temperature is 46 (73 max)
       5 power_supply_bay temperature is 38 (55 max)

-p, --perfdata
   Add performance data to the output even if you did not compile check_hpasm
   with --with-perfdata in step 1.



SNMP and Memory Modules
-----------------------
Older hardware does not always show valuable information when queried for
the health of memory modules. Maybe it's because older modules do not support
error checking at all.


1. no cpqHeResMemModule
---------------------------------------------------------------------------

2. collapsed cpqHeResMemModule
---------------------------------------------------------------------------

Some (older) systems do not support the cpqHeResMemModuleEntry table.
Either there is no oid with 1.3.6.1.4.1.232.6.2.14.11.1 at all
or there is a single oid like

Example:
iso.3.6.1.4.1.232.2.2.4.5.1.3.0.1 = INTEGER: 524288
iso.3.6.1.4.1.232.2.2.4.5.1.3.0.2 = INTEGER: 262144
iso.3.6.1.4.1.232.2.2.4.5.1.3.0.3 = INTEGER: 0
iso.3.6.1.4.1.232.2.2.4.5.1.3.0.4 = INTEGER: 524288
iso.3.6.1.4.1.232.2.2.4.5.1.3.0.5 = INTEGER: 262144
iso.3.6.1.4.1.232.2.2.4.5.1.3.0.6 = INTEGER: 0

                                ^-- module number
                              ^-- cartridge number (0 = system board)
                            ^-- size

iso.3.6.1.4.1.232.6.2.14.11.1.1.0.6 = INTEGER: 0
 
I compared 300 systems and found out that with
1.3.6.1.4.1.232.6.2.14.11.1.<no1>.<no2>.<no3> = <no4>
no1 is always 1
no2 is always 0
no3 is the number of memory slots (including the empty ones).
no4 is always 0. It is probably the health status of the 
overall memory subsystem. I don't know.
I will implement 0 = ok, not 0 = ask compaq

cpqSiMemECCStatus provides no usable information. All my test systems
showed 0 which is an undocumented value.

function get_size(cpqHeResMemModuleEntry) will return 1.

3. cpqHeResMemModule containing crap
---------------------------------------------------------------------------

grepping for cpqSiMemBoardSize shows 4 modules
iso.3.6.1.4.1.232.2.2.4.5.1.3.0.1 = INTEGER: 262144
iso.3.6.1.4.1.232.2.2.4.5.1.3.0.2 = INTEGER: 262144
iso.3.6.1.4.1.232.2.2.4.5.1.3.0.3 = INTEGER: 0
iso.3.6.1.4.1.232.2.2.4.5.1.3.0.4 = INTEGER: 262144
iso.3.6.1.4.1.232.2.2.4.5.1.3.0.5 = INTEGER: 262144
iso.3.6.1.4.1.232.2.2.4.5.1.3.0.6 = INTEGER: 0

grepping for cpqHeResMemEntry shows one module with zero values
iso.3.6.1.4.1.232.6.2.14.11.1.1.0.0 = INTEGER: 0
iso.3.6.1.4.1.232.6.2.14.11.1.2.0.0 = INTEGER: 0
iso.3.6.1.4.1.232.6.2.14.11.1.3.0.0 = ""
iso.3.6.1.4.1.232.6.2.14.11.1.4.0.0 = INTEGER: 0
iso.3.6.1.4.1.232.6.2.14.11.1.5.0.0 = INTEGER: 0
iso.3.6.1.4.1.232.6.2.14.11.1.6.0.0 = Hex-STRING: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 


4. cpqHeResMemModuleEntry and cpqSiMemModuleEntry use different table indexes
---------------------------------------------------------------------------

cpqSiMemBoardIndex      1.3.6.1.4.1.232.2.2.4.5.1.1 
cpqSiMemModuleIndex     1.3.6.1.4.1.232.2.2.4.5.1.2 

cpqHeResMemBoardIndex   1.3.6.1.4.1.232.6.2.14.11.1.1 
cpqHeResMemModuleIndex  1.3.6.1.4.1.232.6.2.14.11.1.2 


cpqSiMemBoardIndex
SNMPv2-SMI::enterprises.232.2.2.4.5.1.1.0.1 = INTEGER: 0
SNMPv2-SMI::enterprises.232.2.2.4.5.1.1.0.2 = INTEGER: 0
SNMPv2-SMI::enterprises.232.2.2.4.5.1.1.0.3 = INTEGER: 0
SNMPv2-SMI::enterprises.232.2.2.4.5.1.1.0.4 = INTEGER: 0
SNMPv2-SMI::enterprises.232.2.2.4.5.1.1.0.5 = INTEGER: 0
SNMPv2-SMI::enterprises.232.2.2.4.5.1.1.0.6 = INTEGER: 0

cpqHeResMemBoardIndex
SNMPv2-SMI::enterprises.232.6.2.14.11.1.1.1.1 = INTEGER: 0
SNMPv2-SMI::enterprises.232.6.2.14.11.1.1.1.2 = INTEGER: 0
SNMPv2-SMI::enterprises.232.6.2.14.11.1.1.1.3 = INTEGER: 0
SNMPv2-SMI::enterprises.232.6.2.14.11.1.1.1.4 = INTEGER: 0
SNMPv2-SMI::enterprises.232.6.2.14.11.1.1.1.5 = INTEGER: 0
SNMPv2-SMI::enterprises.232.6.2.14.11.1.1.1.6 = INTEGER: 0

It is not possible to use the SNMP-table-indices to identify the 
corresponding he-entry. Matching is done with nested loops.

5. even worse: cpqHeResMemBoardIndex and cpqSiMemBoardIndex don't match
---------------------------------------------------------------------------

cpqSiMemBoardIndex
iso.3.6.1.4.1.232.2.2.4.5.1.1.1.1 = INTEGER: 1
iso.3.6.1.4.1.232.2.2.4.5.1.1.1.2 = INTEGER: 1
iso.3.6.1.4.1.232.2.2.4.5.1.1.1.3 = INTEGER: 1
iso.3.6.1.4.1.232.2.2.4.5.1.1.1.4 = INTEGER: 1
iso.3.6.1.4.1.232.2.2.4.5.1.1.1.5 = INTEGER: 1
iso.3.6.1.4.1.232.2.2.4.5.1.1.1.6 = INTEGER: 1
iso.3.6.1.4.1.232.2.2.4.5.1.1.1.7 = INTEGER: 1
iso.3.6.1.4.1.232.2.2.4.5.1.1.1.8 = INTEGER: 1
iso.3.6.1.4.1.232.2.2.4.5.1.1.2.1 = INTEGER: 2
iso.3.6.1.4.1.232.2.2.4.5.1.1.2.2 = INTEGER: 2
iso.3.6.1.4.1.232.2.2.4.5.1.1.2.3 = INTEGER: 2
iso.3.6.1.4.1.232.2.2.4.5.1.1.2.4 = INTEGER: 2
iso.3.6.1.4.1.232.2.2.4.5.1.1.2.5 = INTEGER: 2
iso.3.6.1.4.1.232.2.2.4.5.1.1.2.6 = INTEGER: 2
iso.3.6.1.4.1.232.2.2.4.5.1.1.2.7 = INTEGER: 2
iso.3.6.1.4.1.232.2.2.4.5.1.1.2.8 = INTEGER: 2
iso.3.6.1.4.1.232.2.2.4.5.1.1.3.1 = INTEGER: 3

cpqHeResMemBoardIndex
iso.3.6.1.4.1.232.6.2.14.11.1.1.0.1 = INTEGER: 0
iso.3.6.1.4.1.232.6.2.14.11.1.1.0.2 = INTEGER: 0
iso.3.6.1.4.1.232.6.2.14.11.1.1.0.3 = INTEGER: 0
iso.3.6.1.4.1.232.6.2.14.11.1.1.0.4 = INTEGER: 0
iso.3.6.1.4.1.232.6.2.14.11.1.1.0.5 = INTEGER: 0
iso.3.6.1.4.1.232.6.2.14.11.1.1.0.6 = INTEGER: 0
iso.3.6.1.4.1.232.6.2.14.11.1.1.0.7 = INTEGER: 0
iso.3.6.1.4.1.232.6.2.14.11.1.1.0.8 = INTEGER: 0
iso.3.6.1.4.1.232.6.2.14.11.1.1.1.1 = INTEGER: 1
iso.3.6.1.4.1.232.6.2.14.11.1.1.1.2 = INTEGER: 1
iso.3.6.1.4.1.232.6.2.14.11.1.1.1.3 = INTEGER: 1
iso.3.6.1.4.1.232.6.2.14.11.1.1.1.4 = INTEGER: 1
iso.3.6.1.4.1.232.6.2.14.11.1.1.1.5 = INTEGER: 1
iso.3.6.1.4.1.232.6.2.14.11.1.1.1.6 = INTEGER: 1
iso.3.6.1.4.1.232.6.2.14.11.1.1.1.7 = INTEGER: 1
iso.3.6.1.4.1.232.6.2.14.11.1.1.1.8 = INTEGER: 1
iso.3.6.1.4.1.232.6.2.14.11.1.1.2.1 = INTEGER: 2


Redundant fans
-----------------------
I saw one old server which had only half of the possible fans installed.

Fan#                               1    2      3    4      5    6

cpqHeFltTolFanPresent              yes  no     yes  no     yes  no
cpqHeFltTolFanRedundant            no   no     no   no     no   no
cpqHeFltTolFanRedundantPartner     2    1      4    3      6    5
cpqHeFltTolFanCondition            ok   other  ok   other  ok   other
cpqHeFltTolFanLocation             cpu  cpu    cpu  cpu    io   io

Normally this would result in
...
fan #1 (cpu) is not redundant
fan #2 (cpu) is not redundant
fan #3 (cpu) is not redundant
fan #4 (cpu) is not redundant
fan #5 (ioboard) is not redundant
fan #6 (ioboard) is not redundant
WARNING - fan #1 (cpu) is not redundant, fan #2 (cpu) is not redundant, fan #3 (cpu) is not redundant, fan #4 (cpu) is not redundant, fan #5 (ioboard) is not redundant, fan #6 (ioboard) is not redundant

However it was the server's owner decision not to install fan pairs but only one fan per location, so for him this is a false alert.

By using --ignore-fan-redundancy check_hpasm only looks at the cpqHeFltTolFanCondition and ignores dependencies between two fans, so the result is:

fan 1 speed is normal, pctmax is 50%, location is cpu, redundance is no, partner is 2
fan 3 speed is normal, pctmax is 50%, location is cpu, redundance is no, partner is 4
fan 5 speed is normal, pctmax is 50%, location is ioboard, redundance is no, partner is 6
OK - System: 'proliant ml370 g3', ...


A snmp forwarding trick 
-----------------------
local - where check_hpasm runs
remote - where a proliant can be reached
proliant - where the snmp agent runs

remote:
ssh -R6667:localhost:6667 local
socat tcp4-listen:6667,reuseaddr,fork UDP:proliant:161

local:
socat udp4-listen:161,reuseaddr,fork tcp:localhost:6667
check_hpasm --hostname 127.0.0.1


Sample data from real machines
------------------------------

hpasmcli=$(which hpasmcli)
hpacucli=$(which hpacucli)
for i in server powersupply fans temp dimm
do
  $hpasmcli -s "show $i" | while read line
  do
    printf "%s %s\n" $i "$line"
  done
done
if [ -x "$hpacucli" ]; then
  for i in config status
  do
    $hpacucli ctrl all show $i | while read line
    do
      printf "%s %s\n" $i "$line"
    done
  done
fi

If you think check_hpasm is not working correctly, please run the above script
and send me the output. It's also helpful to see the output of snmpwalk
snmpwalk .... 1.3.6.1.4.1.232


--
Gerhard Lausser <mail_redacted_for_web>