cpqhealth(4) cpqhealth(4) NAME cpqhealth - Compaq Advanced System Management Driver SYNOPSIS /etc/init.d/cpqasm [start | stop | status] DESCRIPTION The Compaq Advanced Server Management Driver collects and monitors important operational data on your server to ensure that the system is operating nominally. Any abnor- mal conditions are logged into the Non Volatile RAM (NVRAM) Integrated Management Log (IML). Compaq ProLiant Servers are equipped with hardware and firmware to monitor certain abnormal conditions such as abnormal temperature readings, fan failures, ECC memory errors, etc. The cpqhealth driver monitors these condi- tions and reports it to the administrator by printing a message on the console, and also logging the condition into the IML. The Compaq Insight Management (CIM) agents can also be used to notify the adminstrator of abnormal conditions. The following is a list of features supported by the Com- paq Health & Wellness Driver: Monitoring abnormal temperature conditions If the normal operating temperature is exceeded, or a cooling fan fails, the Compaq Advanced Server Management driver does the following; * Displays a message to the console stating the prob- lem * Makes an entry in the Integrated Management Log (IML). * Shuts the system down (optionally) to avoid hard- ware damage. Use Compaq System Configuration Util- ity to control the option. Monitoring fan failures If a cooling fan fails, the Compaq Advanced Server Management driver does the following: * Displays a message to the console stating the prob- lem * Makes an entry in the Integrated Management Log (IML). * Shuts the system down (optionally) to avoid hard- ware damage. Use Compaq System Configuration Util- ity to control the option. Monitoring the system Fault Tolerant Power Supply If the primary power supply fails, the system auto- matically switches over to a backup power supply. The Compaq Advanced System Management driver does the following: * Displays a message to the console stating the prob- lem. * Makes an entry in the Integrated Management Log (IML). Monitoring ECC memory errors If an ECC memory error occurs, the Compaq Advanced System Management driver logs the error in the health log including the error causing address. If too many errors occur at the same memory location, the driver disables the ECC error interrupts to prevent flooding the console from warnings (the hardware automatically corrects the ECC error). Automatic Server Recovery (ASR) The Automatic Server Recovery is implemented using a "heartbeat" timer that continually counts down. The driver frequently reloads the counter to pre- vent it from counting down to zero. If the ASR counts down to 0, it is assumed that the operating system is locked up and the system automatically attempts to reboot. Before rebooting, the driver does the following: * Displays a message on the console stating the prob- lem * Makes an entry in the Integrated Management Log (IML). Getting the status of the ProLiant Server. There are multiple ways to get the operational status of the ProLiant server. The ideal way is to load the Compaq Insight Management agents and use a tool such as HP Open- View or CIM-XE to monitor the status of all the ProLiant servers. For those customers who do not have automatic monitoring tools, the servers can be checked using a stan- dard Web browser as long as the CIM agents have been installed. The CIM Web Agent responds to port 2301 and 2381 (if browswer supports SSL encryption). For example, the browser could be pointed to: http://192.1.1.20:2301 or http://localhost:2301. Note that the "http://" is required. Until the agents are customized, the user name and password are both "administrator". The CIM Web Agent allows the administrator to remotely view the IML log and individual feature (i.e. temperature) status. Other ProLiant Server specific information is also available. The Compaq Insight Management Agents may currently be obtained at www.compaq.com/support/files/server/us/locOsCat/70.html. This link may change in the future as a result of the HP - Compaq consolidation. There are also "/proc" file entries available to allow quick checks to be made. * "/proc/cpqtemp" shows the current temperature and the threshold levels of all temperature sensors. * "/proc/cpqfan" shows the current status of all fans. * "/proc/cpqpwr" shows the current status of all power supplies. There is a graphical maintenance utility named cpqimlview (8). Today the cpqimlview utility must be run in the graphical (X11) interface for full functionality but a limited text based version is released with the cpqhealth RPM for use on "Blade" servers or Telnet sessions. See the man page on the cpqimlview (8) utility for more infor- mation. The cpqimlview (8) utility can automatically determine which version (i.e. graphical (X11) or text) of the viewer to launch. Most errors which are logged to the NVRAM based Integrated Management Log (IML) are also logged to the standard "mes- sages" file (i.e. /var/log/messages). Installing on patched Linux Kernels and remote deployment The cpqhealth driver has been designed to work with patched Linux kernels. There is a single source file which can compile and link against the patched Linux ker- nel sources. Additionally, a shell script has been pro- vided to aid in the packaging of the driver into a new RPM for remote deployment. This was done to allow customers to build once and deploy many times to servers which may not have build tools available. If the server has the build tools and the source files for the patched Linux kernel, the boot time scripts will auto- matically attempt to rebuild the driver and install it. Errors will be displayed on the screen and logged to /opt/compaq/cpqhealth/cpqhealth_boot.log if the driver can not be built on a patched Linux kernel. The requirements to build and deploy are: The sources for the "Patched" Linux kernel must be loaded The sources for the "Patched" Linux kernel must be loaded on the system. The build environment must be properly created The build scripts provided expect a standard Linux 2.4 kernel build environment. The sources should be linked to the "/lib/modules/`uname -r`/build" directory. The command "ls -ld /lib/modules/`uname -r`/build" should point to where the patched Linux kernel sources were loaded. Additionally, the standard build tools such as the gcc compiler and make must also be loaded. To create a custom cpqhealth RPM package, perform the fol- lowing steps: * Load the patched Linux kernel sources and develop- ment tools. * Make sure that a directory which corresponds to the output of "uname -r" exists in the "/lib/modules" directory. * Make sure that a link named "build" in the "/lib/modules/`uname -r`/" directory points to the correct kernel source directory. You can validate this by making sure that the file "/lib/mod- ules/`uname -r`/build/include/linux/version.h" has a version which matches the output of the "uname -r" command. * If all of the above conditions are met, you are ready to build. Run the shell script "sh custom_cpqhealth.sh". This script will create a new RPM SPEC file and attempt to build the driver. If the build of the driver is successful, a new RPM package will be created and copied to the /opt/compaq/cpqhealth directory. This package can then be deployed in the usual way. * Typical errors Usually there will be compiler or linker warnings which indicate that kernel drivers should not use regular header files. This is an indication that the kernel sources are not loaded or not installed correctly. You need to make sure the "version.h" file listed above matches the output of "uname -r". If all the above conditions are met, you might want to try building the kernel. If the kernel can not be built suc- cessfully, this is an indication that the particular ker- nel release may have some issues. cpqhealth installation messages The following message will be logged in the /opt/com- paq/cpqhealth/cpqhealth_boot.log or the /opt/compaq/com- paq/cpqhealth/cpqhealth_boot.log.old. These messages are logged for the RPM installation of the cpqhealth module as well as when booting the Linux operating system. Message: "WARNING: cpqasm: casmd already running!" " You must stop the process first." " usage: /etc/init.d/cpqasm stop" Description: This is an indication that the /etc/init.d/cpqasm script was run multiple times with the "start" parameter. Action: None. ============================ Message: "The Compaq Health Event Logging module is not available" "for this Linux kernel: ${THIS_KERNEL}" Description: This is an indication that the package has either been installed on a patched Linux kernel (i.e. an errata Linux kernel) or the wrong binary package has been installed. Action: Make sure the correct distribution package has been installed. If there is not a cpqhealth package for the installed Linux distribution, make sure the kernel source files have been installed as previously described in this document. ============================ Message: "The Compaq Health Event Logging module failed to load!" "Linux Kernel Symbol Conflict - Attemping rebuild to resolve Description: This is an indication that the package has either been installed on a patched Linux kernel (i.e. an errata Linux kernel) or the wrong binary package has been installed. Action: Make sure the correct distribution package has been installed. If there is not a cpqhealth package for the installed Linux distribution, make sure the kernel source files have been installed as previously described in this document. If the correct Linux kernel source is present, the boot scripts will attempt to automatically rebuild the cpqhealth module and reload the drivers. ============================ Message: "WARNING! Not able to rebuild the cpqevt.o module on this kernel!" " Remove and install again the cpqhealth RPM to correct." " See /opt/com- paq/cpqhealth/cpqhealth_boot.log Description: This is an indication that the package has either been installed on a patched Linux kernel (i.e. an errata Linux kernel) or the wrong binary package has been installed. In either case, there was no source available or there were compliation / linker errors. Action: Make sure the correct distribution package has been installed. If there is not a cpqhealth package for the installed Linux distribution, make sure the kernel source files have been installed as previously described in this document. The errors located in /opt/com- paq/cpqhealth/cpqhealth_boot.log will need to be reviewed and corrected. This may require "wrapper" file changes if the Linux kernel header files have been drastically modified in the installed distribution. ============================ Message: "The Compaq Advanced Server Management mod- ule is not available" "for this Linux kernel: ${THIS_KERNEL}" Description: This is an indication that the package has either been installed on a patched Linux kernel (i.e. an errata Linux kernel) or the wrong binary package has been installed. Action: Make sure the correct distribution package has been installed. If there is not a cpqhealth package for the installed Linux distribution, make sure the kernel source files have been installed as previously described in this document. ============================ Message: "The Compaq Advanced Server Management mod- ule failed to load!" "Linux Kernel Symbol Conflict - Attemping rebuild to resolve." Description: This is an indication that the package has either been installed on a patched Linux kernel (i.e. an errata Linux kernel) or the wrong binary package has been installed. Action: Make sure the correct distribution package has been installed. If there is not a cpqhealth package for the installed Linux distribution, make sure the kernel source files have been installed as previously described in this document. If the correct Linux kernel source is present, the boot scripts will attempt to automatically rebuild the cpqhealth module and reload the drivers. ============================ Message: "WARNING! Not able to rebuild the cpqasm.o module on this kernel!" " Remove and install again the cpqhealth RPM to correct." " See /opt/com- paq/cpqhealth/cpqhealth_boot.log" Description: This is an indication that the package has either been installed on a patched Linux kernel (i.e. an errata Linux kernel) or the wrong binary package has been installed. Action: Make sure the correct distribution package has been installed. If there is not a cpqhealth package for the installed Linux distribution, make sure the kernel source files have been installed as previously described in this document. The errors located in /opt/com- paq/cpqhealth/cpqhealth_boot.log will need to be reviewed and corrected. This may require "wrapper" file changes if the Linux kernel header files have been drastically modified in the installed distribution. ============================ Message: "/lib/modules/${THIS_KERNEL}/build does not exist" "This is an indication that the sources for this kernel (${THIS_KERNEL}) are not loaded." "Please load the appropriate sources to rebuild module". Description: The cpqhealth driver package follows a stan- dard Linux 2.4 kernel distribution. The Linux kernel source files must be loaded as indicated in the message. The "build" directory is actually a symbolic link and must exist. Action: Make sure the correct distribution package has been installed. If there is not a cpqhealth package for the installed Linux distribution, make sure the kernel source files have been installed as previously described in this document. If the correct Linux kernel source is present, the boot scripts will attempt to automatically rebuild the cpqhealth module and reload the drivers. ============================ Message: "/lib/modules/${THIS_KER- NEL}/build/include/linux/version.h does not exist" "Please load the appropriate sources to rebuild module". Description: This message usually only occurs on SuSe Linux distribution because the file speci- fied does not exist. Action: Make sure the correct distribution package has been installed. If there is not a cpqhealth package for the installed Linux distribution, make sure the kernel source files have been installed as previously described in this document. If the correct Linux kernel source is present, the boot scripts will attempt to automatically rebuild the cpqhealth module and reload the drivers. For SuSe distributions, there should be a file "/boot/vmlinuz.version.h" which needs to be moved to the directory listed in the message. ============================ Message: "/lib/modules/${THIS_KER- NEL}/build/include/linux/autoconf.h does not exist" "Please load the appropriate sources to rebuild module". Description: This message usually only occurs on SuSe Linux distribution because the file speci- fied does not exist. Action: Make sure the correct distribution package has been installed. If there is not a cpqhealth package for the installed Linux distribution, make sure the kernel source files have been installed as previously described in this document. If the correct Linux kernel source is present, the boot scripts will attempt to automatically rebuild the cpqhealth module and reload the drivers. For SuSe distributions, there should be a file "/boot/vmlinuz.autoconf.h" which needs to be moved to the directory listed in the message. ============================ Message: "There does not appear to be kernel sources which match the current booting Linux ker- nel. There must be a directory named "/lib/modules/${THIS_KERNEL}" and there must be a valid directory linked to "/lib/mod- ules/${THIS_KERNEL}/build"." "Please load the appropriate Linux sources to rebuild module". Description: This is an indication that the package has either been installed on a patched Linux kernel (i.e. an errata Linux kernel) or the wrong binary package has been installed. Action: Make sure the correct distribution package has been installed. If there is not a cpqhealth package for the installed Linux distribution, make sure the kernel source files have been installed as previously described in this document. If the correct Linux kernel source is present, the boot scripts will attempt to automatically rebuild the cpqhealth module and reload the drivers. The cpqhealth RPM installation has failed and the RPM should be immediately removed. ============================ Message: "cpqasm: You should also stop the driver otherwise the Automatic" " Server Recovery (ASR) Feature may reboot the server." " usage: rmmod cpqasm" Description: When the driver is stopped using the /etc/init.d/cpqasm script, the ASR timer will continue to run. The cpqasm driver needs to also be terminated to keep the server from automatically shutting down when the ASR timer expires. Action: Use the "rmmod cpqasm" command to unload the driver or restart the daemon using the /etc/init.d/cpqasm start command. Driver Messages Most of the following messages will be seen prepended with "casm: " to indicate that they are from the casm driver. Message: "There is no SHAFT Record in this ROM!" Description: This is an indication of a ROM problem or that the cpqhealth driver has been loaded on an unsupported server. Action: Remove the cpqhealth package. ============================ Message: "Health Envinronment Parsing failed!" Description: This is an indication of a ROM internal table problem. Action: Please report this to Customer Service for follow up. ============================ Message: "SHAFT Parsing failed!" Description: This is an indication of a ROM internal table problem. Action: Please report this to Customer Service for follow up. ============================ Message: "Neither SMBIOS or SIT is present!" Description: This is an indication of a ROM internal table problem. Action: Please report this to Customer Service for follow up. ============================ Message: "Unknown casmc_crom_ioctl Cmd: 0x%x" Description: This message is displayed when an applica- tion such as the Compaq Insight Management (CIM) Agent makes a request of the cpqhealth driver which the cpqhealth driver does not understand. Action: The message usually indicates slightly reduced functionality for the application making the request. Check to see that the application and the cpqhealth driver are both at the latest release. ============================ Message: "Unknown casmc_ecc_ioctl Cmd: 0x%x" Description: This message is displayed when an applica- tion such as the Compaq Insight Management (CIM) Agent makes a request of the cpqhealth driver which the cpqhealth driver does not understand. Action: The message usually indicates slightly reduced functionality for the application making the request. Check to see that the application and the cpqhealth driver are both at the latest release. ============================ Message: "Unknown casmc_asr_ioctl Cmd: 0x%x" Description: This message is displayed when an applica- tion such as the Compaq Insight Management (CIM) Agent makes a request of the cpqhealth driver which the cpqhealth driver does not understand. Action: The message usually indicates slightly reduced functionality for the application making the request. Check to see that the application and the cpqhealth driver are both at the latest release. ============================ Message: "Unknown casmc_event_ioctl Cmd: 0x%x" Description: This message is displayed when an applica- tion such as the Compaq Insight Management (CIM) Agent makes a request of the cpqhealth driver which the cpqhealth driver does not understand. Action: The message usually indicates slightly reduced functionality for the application making the request. Check to see that the application and the cpqhealth driver are both at the latest release. ============================ Message: "Monitoring of fan #%d has been disabled." Description: Monitoring of the indicated fan has been disabled because the interrupt threshold was exceeded. This is an indication that the fan or the fan controller is generating spurious interrupts. Action: The fan specified in the message may need to be replaced. ============================ Message: "Automatic Operating System Shutdown Initi- ated Due to Overheat Condition" Description: This message is generated either from inter- nal temperature sensors or storage con- trollers detecting an critical thermal event. Action: The operating environment is too warm and requires better cooling. ============================ Message: "Power supply %d revision is %d.%d, %d.%d is recommended." Description: The power supply is of a version other than the recommended version. Action: Contact Compaq support to determine what needs to be done. ============================ Message: "Monitoring of Health has been disabled." Description: System Health is no longer being monitored. Action: Could be due to a hardware failure. Usually caused by a device interrupting the driver at a very fast rate. ============================ Message: "Compaq Advanced Server Management driver will not be loaded." Description: The casm driver cannot be initialized at this time due to a conflict in ROM internal tables. Action: Upgrade to the latest ROM version for this server, if available. ============================ Message: "SHAFT and Patch signature strings do not match at byte #%d" Description: Two tables internal to the ROM do not have matching signature strings. Not necessarily an indication of a problem. Action: If the casm driver is loaded no further action is required. Optionally, one could upgrade to the latest ROM version for this server, if available. ============================ Message: "Temperature sensor #%d has been disabled." Description: The indicated temp sensor has been disabled. Action: Call Compaq Support for further assistance. ============================ Message: "Monitoring of VRM #%d has been disabled." Description: Monitoring of the indicated VRM has been disabled because the interrupt threshold was exceeded. This is an indication that the VRM is generating spurious interrupts. Action: The indicated VRM may need to be replaced. ============================ Message: "Monitoring of power supply #%d has been disabled." Description: Monitoring of the indicated power supply has been disabled because the interrupt thresh- old was exceeded. This is an indication that the power supply is generating spurious interrupts. Action: The indicated power supply may need to be replaced. ============================ Message: "Approaching Dangerous Temperature. The %s Thermal Sensor(#%d) is reporting overheating conditions." Description: A thermal sensor is reporting high tempera- tures. Thermal shutdown may be triggered if the temperature increases beyond the thresh- old. Action: The ambient temperature in the environment must be below 35C. If this condition is met, there may be something blocking the air flow to the server. If the Termal Sensor indicates that this is a CPU, this may be an indication of an improperly mounted CPU Heat Sink. Check the front of the server for a blockage. A failed fan could also lead to this condition in a warm environment. ============================ Message: "A dangerous temperature condition has been detected by a %s Thermal Sensor(#%d)." Description: The temperature has exceeded the threshold. Shutdown will occur. Action: None, shutdown will automatically occur. ============================ Message: "Normal conditions have returned to a Ther- mal sensor (#%d) in the %s group. " Description: The temperature has returned to a normal, non-dangerous level. Action: None ============================ Message: "A redundant fan (fan #%d) in the %s group has failed " Description: The indicated fan has failed. Its redundant backup will continue to run to provide ade- quate cooling for the server in a normal ambient environment. Action: Replace the failed fan fan. In many cases, the fan can be replaced without taking the server down. ============================ Message: "A fan (fan #%d) in the %s group has failed." Description: There are no more fans active in the indi- cated group. The system will be shut down. Action: None, system will automatically shut down. ============================ Message: "A redundant fan (fan #%d) in the %s group has returned to normal." Description: The indicated fan is functioning again. Action: If this message intermittently happens, this may be a sign of a fan about to fail. ============================ Message: "A required system fan (fan #%d) in the %s group has failed " Description: The indicated required fan has failed. Sys- tem will shutdown Action: None, shutdown will automatically occur. ============================ Message: "The system is NOT configured to shutdown on non-critical thermal failures - (configurable via RBSU Utility)." Description: This message accompanies the previous mes- sage if the system will not shut down for non-critical thermal failures. Action: If desired, configure the system via RBSU to shutdown on non-critical failures. The RBSU Utility is usually executed by pressing the "F9" function key during POST when indi- cated. ============================ Message: "A required system fan (fan #%d) in the %s group has returned to normal operation " Description: The indicated fan is functioning again. Action: If this message intermittently happens, this may be a sign of a fan about to fail. ============================ Message: "Non-critical Thermal Failure - System fan %d has failed." Description: The indicated fan has failed, but the system is not in danger of overheating. Action: Replace the failed fan. ============================ Message: "Non-critical Thermal Failure System fan %d has returned to normal." Description: The indicated fan is now functioning. Action: If this message intermittently happens, this may be a sign of a fan about to fail. ============================ Message: "A Critical fan (fan #%d) located in the %s has failed." Description: The indicated fan has failed, and the system will be shut down. Action: None, the system will automatically shut down. ============================ Message: "Fan %d located in %s has returned to nor- mal operation." Description: The indicated fan is now functioning. Action: If this message intermittently happens, this may be a sign of a fan about to fail. ============================ Message: "The system of fans, located in the %s area, is no longer redundant. " Description: The indicated fan system is not redundant. If another fan in this group fails, the sys- tem will shut down. Action: The faulty fan(s) should be replaced. ============================ Message: "The system of fans, located in the %s area, is now redundant. " Description: The indicated system of fans is now a redun- dant system. Action: If this message intermittently happens, this may be a sign of a fan about to fail. ============================ Message: "Fan %d located in %s has been inserted." Description: The indicated fan has been inserted. Action: None ============================ Message: "Fan %d located in %s has been removed." Description: The indicated fan is no longer present. Action: None required. Optionally, replace fan. ============================ Message: "A Power Supply (Power Supply #%d) in the %s group is not providing power. Please confirm the power cord is correctly attached." Description: Power has been lost to one or more power supplies. Action: Check power cord and power source. ============================ Message: "A Power Supply (Power Supply #%d) in the %s group is not providing power. Due to an EPROM reading failure" Description: The indicated power supply is not function- ing. Action: Replace the power supply. ============================ Message: "A Power Supply (Power Supply #%d) in the %s group is not providing power. Due to a failed internal power supply fan." Description: The indicated power supply is not function- ing because an internal power supply fan has failed. Action: Replace power supply. ============================ Message: "A redundant Power Supply (Power Supply #%d) in the %s group has returned normal. " Description: The indicated power supply is functioning. Action: If this message intermittently happens, this may be a sign of a power supply about to fail. ============================ Message: "Power Supply system located in %s is no longer redundant. " Description: Due to the failure or removal of a power supply, the indicated system is no longer redundant. Action: Replace, or add redundant power supply. ============================ Message: "Power Supply system located in %s is now redundant. " Description: The indicated power supply is now redundant. Action: None ============================ Message: "Power Supply %d located in %s has been inserted. " Description: The indicated power supply has been inserted. Action: None ============================ Message: "Power Supply %d located in %s has been removed. " Description: The indicated power supply is no longer pre- sent. Action: Replace power supply. ============================ Message: "A Processor Power Module(#%d) has failed (slot %d, socket %d). The system will con- tinue to operate." Description: The indicated power module had failed, but the system will continue operation. Action: Replace the failed Power Module as soon as possible. This will require down time for the server. ============================ Message: "A Processor Power Module(#%d) located in (slot %d, socket %d) has returned to normal operation. " Description: The indicated power module is now function- ing. Action: If this message is frequently displayed, this is an indication that the Processor Power Module is faulty. ============================ Message: "Processor Power Module sub-system located in (slot %d, socket %d) is no longer redun- dant. " Description: The indicated power module sub-system is not redundant, due to the removal or failure of a Processor Power Module. Action: Replace the faulty or missing module. ============================ Message: "Processor Power Module sub-system located in (slot %d, socket %d) is now redundant." Description: The indicated sub-system is now redundant. Action: None ============================ Message: "A memory module has exceeded its threshold of correctable errors. Monitoring of ECC errors has been turned off. " Description: ECC errors will no longer be monitored, due to an excessive amount of memory errors. Action: ECC Memory may be faulty, and need to be replaced. ============================ Message: "Excessive ECC memory errors detected and automatically corrected. Online Spare Memory engaged." Description: A DIMM has exceeded the number of cor- rectable errors allowed and the Advanced Memory Protection mechanism has engaged the dedicated spare DIMM. Action: The failed DIMM needs to be replaced. Review the Integrated Management Log (IML) for more detailed information. ============================ Message: "A multi-bit memory error occurred on Memory Board %d. The memory board mirror has been engaged." Description: A non-correctable error has occurred on a system configured with Advanced Memory Pro- tection and additionaly configured in a mir- rored state. The memory subsystem is no longer redundant. Action: The faulty DIMMs on the failing board need to be replaced. See the Integrated Manage- ment Log (IML) for more detailed informa- tion. ============================ Message: "Memory board %d has a configuration error." Description: The memory board indicated is not config- ured correctly. Action: Make sure that all the memory on the memory board meets the specified requirements for the specific server. In most cases, the memory must be identical for multiple board configurations. ============================ Message: "Excessive ECC memory errors detected and automatically corrected. Subsequent ECC mem- ory errors will be corrected "but not reported." Description: Advanced Memory Protection is active on this system and the threshold for Single Bit Cor- rectable Errors (SBCE) has been reached. If other ECC memory errors occur, they will be automatically corrected but no further log- ging will take place. Action: Correct the DIMM which was previously logged in the Integrated Management Log (IML). ============================ Message: "(MCA) Processor BINIT in progress! Description: An Intel Processor Machine Check Architec- ture event has occurred. Action: The server will be forced down hard. The processor should be replaced. ============================ Message: "ASR Shutdown has completed normally." Description: The Automatic Server Recovery feature was able to gracefully terminate the operating system. This may not always be successful as a result of the initial trigger of the ASR mechanism. Action: This message combined with other messages in the Compaq Integrated Management Log (IML) may assist in debugging the initial trigger of the ASR mechanism. ============================ Message: "Spurious interrupt: Feature %d has been previously disabled!" Description: This is an indication that a feature (iden- tified by a number) had been disabled. There may have been one more event in the queue to be processed. Action: This is usually the result of some other event (such as a fan failure). Once the previous event has been corrected, no other action will be required. ============================ Message: "Feature %d has been disabled" Description: This is usually because a feature (or device) has exceeded it's interrupt thresh- old limit. Action: A previous message will have been displayed indicating that a device has exceeded it's set threshold limit. The failing device should be replaced. The following section lists Non-Maskable Interrupt (NMI) errors which are common. There are other NMI type errors which may occur. In general, all NMI type errors are usually related to hardware and customer support will need to be engaged to provide a solution. The list below covers the more common errors which may be displayed. ============================ Message: "casm: NMI Handler has been called on pro- cessor %d!" Description: This is a message which is logged for all NMI's. If no other messages are logged or displayed, this may be an indication of an Uncorrectable Memory Error. These types of errors are difficult to log because the casm device driver code may actually be physi- cally located on a failed DIMM. Action: If no other messages are displayed, try mov- ing the DIMMs around to different slots and see if the error will recreate. Otherwise, check for subsequent messages which will give an indication of the source of the problem. ============================ Message: "casm: Spinning for 2 seconds!" Description: All NMI's are processed by the bootstrap processor. If an NMI is received on a pro- cessor other than the bootstrap processor, the casm driver will spin to allow the NMI be processed. Action: This message along with other NMI messages can be used to assist in sourcing the prob- lem that generated the NMI. ============================ Message: "NMI - Uncorrectable memory error - "Hour %d - %d/%d/%d" "Bank %d DIMMs" Description: The Bank indicated DIMMS have generated an Uncorrectable memory error. Action: The failed DIMMS need to be replaced. ============================ Message: "NMI - Uncorrectable memory error - "Hour %d - %d/%d/%d Slot: %d Module %d" Description: The specific DIMM indicated in the message has generated an Uncorrectable Memory Error. Action: The failed DIMM need to be replaced. ============================ Message: "NMI - Automatic Server Recovery timer expi- ration - Hour %d - %d/%d/%d" Description: The Advanced Server Management (ASM) watch- dog timer has expired. This is an indica- tion that either a software application con- sumed all of the Processor resources such that the operating system was not able to schedule or a major event occurred (such as a Non-Maskable Interrupt (NMI)) and halted the operating system. Action: Use the messages in the Integrated Manage- ment Log (IML) and the operating system event logs to determine what caused the operating system to cease functioning or to "lock up". ============================ Message: "NMI - Unexpected Slot Power Loss (Bus %d, dev %d, func %d) Hour %d - %d/%d/%d" Description: This is a result of opening a PCI Hot Plug slot while the slot is powered on. Action: If no PCI Hot Plug slot was opened, this could be an indication of a slot failure. Check the slot LED's for proper operation. ============================ Message: "NMI - PCI Bus parity error (Bus %d, dev %d, func %d) Hour %d - %d/%d/%d" Description: A PCI device has indicated a parity error has occurred. Action: This is an indication that the PCI device specified may be failing. If no other errors have occurred before this error, this might be an indication that the specified PCI device is failed or about to fail. If other errors have occurred, this error needs to be analyzed in context with previous errors. ============================ Message: "NMI - Dump Switch has been pressed - "Hour %d - %d/%d/%d" Description: Some ProLiant servers has a "debug" switch which will generate a Non-Maskable Interrupt (NMI). This message indicates that this switch was pressed. Action: None. ============================ Message: "Unrecoverable Non-Maskable Interrupt (NMI) error" Description: This is a NMI which the ProLiant server ROM was not able to "source". This is either a problem with the ROM code or a hardware failure of a product not shipped as part of the server (i.e. a third party hardware device). Action: Contact customer support for assistance. ============================ Message: "Unknown Non-Maskable Interrupt (NMI) error (0x%x) Hour %d - %d/%d/%d" Description: This message indicates that an unknown NMI was generated. The hexidecimal value returned is an internal code from the Server ROM which customer support can interpret. Action: Contact customer support for assistance. BUGS Limited Hardware Platforms This driver will only work on Compaq Pro- Liant servers which have the Compaq Advanced Server Management (ASM) ASIC (PCI ID 0x0E11A0F0) or the Compaq iLO Advanced Server Management (PCI ID 0x0E11B203) ASICs. Initialization time After inserting, the driver needs about one minute to get fully "situated". Specifi- cally, faulty hardware that reports back to normal might not be recognized as "working" within the first minute of operation. FILES /opt/compaq/cpqhealth default directory for the scripts and bina- ries. There are sub-directories for the cpqasm and cpqevt drivers and then further sub-directories for each supported Linux kernel. /opt/compaq/cpqhealth/custom_cpqhealth.sh The shell script which will rebuild and repackage the cpqhealth driver. /opt/compaq/cpqhealth/cpqhealth_boot.log A log file containing the results of the last boot of the system. The RPM errors are also logged here. /etc/init.d/cpqasm This file is linked to the multiuser init- state directories and controls the loading of the cpqasm and cpqevt drivers. This script makes the determination if the drivers need to be rebuilt. SEE ALSO cpqimlview (8) www.compaq.com/sup- port/files/server/us/locOsCat/70.html www.compaq.com/products/software/linux/index.html AUTHOR Compaq Computer Corporation . 23 July 2002 cpqhealth(4)