cpqhealth(4) cpqhealth(4) NAME cpqhealth - Compaq Advanced System Management Driver SYNOPSIS nmdcpqasm.o nmdcpqevt.o DESCRIPTION The Compaq Advanced Server Management Driver collects and monitors important operational data on your server to ensure that the system is "healthy". Any abnormal condi- tions are logged into the Non Volatile RAM (NVRAM) Inte- grated Management Log (IML). Compaq Servers are equipped with hardware and firmware to monitor certain abnormal conditions such as abnormal tem- perature readings, fan failures, ECC memory errors, etc. The cpqhealth driver monitors these conditions and reports it to the administrator by printing a message on the con- sole, and also logging the condition into the IML. The following is a list of features supported by the Com- paq Health & Wellness Driver: Monitoringabnormal temperatureconditions If the normal operating temperature is exceeded, or a cooling fan fails, the Compaq Advanced Server Management driver does the following; * Displays a message to the console stating the prob- lem * Makes an entry in the Integrated Management Log (IML). * Shuts the system down (optionally) to avoid hard- ware damage. Use Compaq System Configuration Util- ity to control the option. Monitoring fan failures If a cooling fan fails, the Compaq Advanced Server Management driver does the following: * Displays a message to the console stating the prob- lem * Makes an entry in the Integrated Management Log (IML). * Shuts the system down (optionally) to avoid hard- ware damage. Use Compaq System Configuration Util- ity to control the option. Monitoring the system Fault Tolerant Power Supply If the primary power supply fails, the system auto- matically switches over to a backup power supply. The Compaq Advanced System Management driver does the following: * Displays a message to the console stating the prob- lem. * Makes an entry in the Integrated Management Log (IML). Monitoring ECC memory errors If an ECC memory error occurs, the Compaq Advanced System Management driver logs the error in the health log including the error causing address. If too many errors occur at the same memory location, the driver disables the ECC error interrupts to prevent flooding the console from warnings (the hardware automatically corrects the ECC error). Automatic Server Recovery (ASR) The Automatic Server Recovery is implemented using a "heartbeat" timer that continually counts down. The driver frequently reloads the counter to pre- vent it from counting down to zero. If the ASR counts down to 0, it is assumed that the operating system is locked up and the system automatically attempts to reboot. Before rebooting, the driver does the following: * Displays a message on the console stating the prob- lem * Makes an entry in the Integrated Management Log (IML). Installing on patched Linux Kernels and remote deployment The cpqhealth driver has been designed to work with patched Linux kernels. There is a single source file which can compile and link against the patched Linux ker- nel sources. Additionally, a shell script has been pro- vided to aid in the packaging of the driver into a new RPM for remote deployment. This was done to allow customers to build once and deploy many time to servers which may not have build tools available. If the server has the build tools and the source files for the patched Linux kernel, the boot time scripts will auto- matically attempt to rebuild the driver and install it. Errors will be displayed on the screen and logged to /opt/compaq/cpqhealth/cpqhealth_boot.log if the driver can not be built on a patched Linux kernel. The requirements to build and deploy are: The sources for the "Patched" Linux kernel must be loaded The sources for the "Patched" Linux kernel must be loaded on the system. Ideally, the patched Linux kernel will have been built to sanity check the build environment. The build environment must be properly created The build scripts provided expect a standard Linux 2.4 kernel build environment. The sources should be linked to the "/lib/modules/`uname -r`/build" directory. The command "ls -ld /lib/modules/`uname -r`/build" should point to where the patched Linux kernel sources were loaded. Additionally, the standard build tools such as the gcc compiler and make must also be loaded. To create a custom cpqhealth RPM package, perform the fol- lowing steps: * Load the patched Linux kernel sources and test that the kernel can be built. * Make sure that a directory which corresponds to the output of "uname -r" exists in the "/lib/modules" directory. * Make sure that a link named "build" in the "/lib/modules/`uname -r`/" directory points to the correct kernel source directory. You can validate this by making sure that the file "/lib/mod- ules/`uname -r`/build/include/linux/version.h" has a version which matches the output of the "uname -r" command. * If all of the above conditions are met, you are ready to build. Run the shell script "sh custom_cpqhealth.sh". This script will create a new RPM SPEC file and attempt to build the driver. If the build of the driver is successful, a new RPM package will be created and copied to the /opt/compaq/cpqhealth directory. This package can then be deployed in the usual way. BUGS /proc file system entries This release of the cpqhealth driver does NOT have "/proc" file system entries. This will be addressed in a future release. Limited Hardware Platforms This driver will only work on Compaq ProLiant servers which have the Compaq Advanced Server Man- agement (ASM) ASIC (PCI ID 0x0E11A0F0) or the Com- paq iLO Advanced Server Management (PCI ID 0x0E11B203) ASICs. "VFS:Disk changedetected" messages When the health driver detects abnormal conditions, you might observe lines on the console starting with "VFS: Disk change detected...". This is a ker- nel message that is brought on by a CD automount deamon called magicdev. Even without the health driver, this behavior can be observed (many news- group postings bear testimony to that fact). Shut magicdev down (killall magicdev) in order to stop those messages from appearing on the console. Initialization time After inserting, the driver needs about one minute to get fully "situated". Specifically, faulty hard- ware that reports back to normal might not be rec- ognized as "working" within the first minute. FILES /opt/compaq/cpqhealth default directory for the scripts and binaries. /lib/modules/Compaq/drivers default location for the cpqasm and cpqevt modules. There are binary drivers provided with the package for "Boxed" Linux kernels. /lib/modules/Compaq/drivers/up/cpqasm.o /lib/modules/Compaq/drivers/up/cpqevt.o The binary supplied drivers for single processor Linux ker- nels. /lib/modules/Compaq/drivers/smp/cpqasm.o /lib/modules/Compaq/drivers/smp/cpqevt.o The binary supplied drivers for multiple processor Linux ker- nels. /lib/modules/Compaq/drivers/ent/cpqasm.o /lib/modules/Compaq/drivers/ent/cpqevt.o The binary supplied drivers for Enterprise Linux kernels. /opt/compaq/cpqhealth/custom_cpqhealth.sh The shell script which will rebuild and repackage the cpqhealth driver. /opt/compaq/cpqhealth/cpqhealth_boot.log A log file containing the results of the last boot of the system. The RPM errors are also logged here. /etc/init.d/cpqasm This file is linked to the multiuser initstate directories and controls the loading of the cpqasm and cpqevt drivers. This script makes the determi- nation if the drivers need to be rebuilt. SEE ALSO insmod(1) kerneld(8) modprobe(1) cpqimlview(8) AUTHOR Compaq Computer Corporation . 31 May 2002 cpqhealth(4)