cpqhealth(4) cpqhealth(4) NAME cpqhealth - hp ProLiant Advanced System Management Driver SYNOPSIS /etc/init.d/cpqasm [start | stop | status] DESCRIPTION The hp ProLiant Advanced Server Management Driver collects and monitors important operational data on your server to ensure that the system is operating nominally. Any abnormal conditions are logged into the Non Volatile RAM (NVRAM) Integrated Management Log (IML). ProLiant Servers are equipped with hardware and firmware to monitor certain abnormal conditions such as abnormal temperature readings, fan failures, ECC memory errors, etc. The cpqhealth driver monitors these conditions and reports it to the administrator by printing a message on the console, and also logging the condition into the ProLiant Inte- grated Management Log (IML). The Insight Manager 7 agents can also be used to notify the administrator of abnormal conditions. The following is a list of features supported by the hp ProLiant Advanced System Management Driver: Monitoring abnormal temperature conditions If the normal operating temperature is exceeded, or a cooling fan fails, the hp ProLiant Advanced Server Management driver does the following; * Displays a message to the console stating the problem * Makes an entry in the Integrated Management Log (IML). * Shuts the system down (optionally) to avoid hardware damage. Use hp ProLiant ROM Based Setup (System Configuration) Utility (RBSU) to control the option. Monitoring fan failures If a cooling fan fails, the hp ProLiant Advanced Server Manage- ment driver does the following: * Displays a message to the console stating the problem * Makes an entry in the Integrated Management Log (IML). * Shuts the system down (optionally) to avoid hardware damage. Use hp ProLiant ROM Based Setup (System Configuration) Utility (RBSU) to control the option. Monitoring the system Fault Tolerant Power Supply If the primary power supply fails, the system automatically switches over to a backup power supply. The hp ProLiant Advanced System Management driver does the following: * Displays a message to the console stating the problem. * Makes an entry in the Integrated Management Log (IML). Monitoring ECC memory errors If an ECC memory error occurs, the hp ProLiant Advanced System Management driver logs the error in the health log including the error causing address. If too many errors occur at the same memory location, the driver disables the ECC error interrupts to prevent flooding the console from warnings (the hardware auto- matically corrects the ECC error). Automatic Server Recovery (ASR) The Automatic Server Recovery is implemented using a "heartbeat" timer that continually counts down. The driver frequently reloads the counter to prevent it from counting down to zero. If the ASR counts down to 0, it is assumed that the operating system is locked up and the system automatically attempts to reboot. Events which may contribute to the operating system locking up include: * A peripheral device (such as a PCI adapter) failing in such a way that numerous spurious interrupts are generated. * A high priority software application consumes all the available CPU cycles and does not allow the operating system scheduler to run the ASR timer reset process. * A software or kernel application consumes all available memory including the virtual memory space (i.e. swap). This may cause the operating system scheduler to cease functioning. * A critical operating system component such as a file system fails and causes the operating system scheduler to cease func- tioning. * Any other event besides an ASR timeout which causes a Non-Mask- able Interrupt (NMI) to be generated. The ProLiant ASR feature is a hardware based timer. If a true hardware failure occurs, the ProLiant Advanced Server Management driver might not be called but the server will be reset as if the power switch was pressed. The ProLiant ROM code may log an event to the ProLiant Inte- grated Management Log (IML) when the server reboots. The ProLiant Advanced Server Management driver is notified via a Non- Maskable Interrupt (NMI). If possible, the driver will attempt to per- form the following actions: * Displays a message on the console stating the problem * Makes an entry in the ProLiant Integrated Management Log (IML). * Attempts to gracefully shutdown the operating system to close the file systems. There is not a guarantee that the operating system will gracefully shutdown. This depends on the type (software or hwardware) and severity of the error condition. There is more information about the ProLiant Advanced Server Recovery (ASR) feature later on in this document. Getting the status of the ProLiant Server. There are multiple ways to get the operational status of the ProLiant server. The ideal way is to load the Insight Manager 7 agents and use a tool such as HP OpenView or Insight Manager 7 to monitor the status of all the ProLiant servers. For those customers who do not have auto- matic monitoring tools, the servers can be checked using a standard Web browser as long as the Insight Manager 7 agents have been installed. The Insight Manager 7 Web Agent responds to port 2301 and 2381 (if browser supports SSL encryption). For example, the browser could be pointed to: http://192.1.1.20:2301 or http://localhost:2301. Note that the "http://" is required. Until the agents are customized, the user name and password are both "administrator". The Insight Manager 7 Web Agent allows the administrator to remotely view the IML log and individual feature (i.e. temperature) status. Other ProLiant Server specific information is also available. The Insight Manager 7 Agents may currently be obtained at: www.compaq.com/support/files/server/us/ This link may change in the future as a result of the HP - Compaq con- solidation. The UID (blue) Light utility (/sbin/hpuid) There is a utility, /sbin/hpuid, which allows a user to: * Turn on the UID (blue) light * Turn off the UID (blue) light * Get the status of the UID (blue) light You must be logged on as the "root" user. You can just enter "hpuid" at the command line prompt to get the parameter definition. There is an example script of how to use the hpuid utility located in /opt/com- paq/cpqhealth/hpuid_example.sh. Note that the UID light is not avail- able on all ProLiant servers. The "/proc" file system entries There are also "/proc" file entries available to allow quick checks to be made. * "/proc/cpqtemp" shows the current temperature and the threshold levels of all temperature sensors. * "/proc/cpqfan" shows the current status of all fans. * "/proc/cpqpwr" shows the current status of all power supplies. There is a graphical maintenance utility named cpqimlview (8). The cpqimlview utility can be run in the graphical (X11) interface for full functionality or a limited text based (ncurses) version is available for use on "Blade" servers or Telnet sessions. The cpqimlview utility will automatically start the correct IML viewer based on the terminal type. See the man page on the cpqimlview (8) utility for more informa- tion. Most errors which are logged to the NVRAM based Integrated Management Log (IML) are also logged to the standard "messages" file (i.e. /var/log/messages). Installing on patched Linux Kernels and remote deployment The cpqhealth driver has been designed to work with patched Linux ker- nels. There is a single source file which can compile and link against the patched Linux kernel sources. Additionally, a shell script has been provided to aid in the packaging of the driver into a new RPM for remote deployment. This was done to allow customers to build once and deploy many times to servers which may not have build tools available. If the server has the build tools and the source files for the patched Linux kernel, the boot time scripts will automatically attempt to rebuild the driver and install it. Errors will be displayed on the screen and logged to /opt/compaq/cpqhealth/cpqhealth_boot.log if the driver can not be built on a patched Linux kernel. The requirements to build and deploy are: The sources for the "Patched" Linux kernel must be loaded The sources for the "Patched" Linux kernel must be loaded on the system. The build environment must be properly created The build scripts provided expect a standard Linux 2.4 kernel build environment. The sources should be linked to the "/lib/modules/`uname -r`/build" directory. The command "ls -ld /lib/modules/`uname -r`/build" should point to where the patched Linux kernel sources were loaded. Additionally, the standard build tools such as the gcc compiler and make must also be loaded. To create a custom cpqhealth RPM package, perform the following steps: * Load the patched Linux kernel sources and development tools. * Make sure that a directory which corresponds to the output of "uname -r" exists in the "/lib/modules" directory. * Make sure that a link named "build" in the "/lib/modules/`uname -r`/" directory points to the correct kernel source directory. You can validate this by making sure that the file "/lib/mod- ules/`uname -r`/build/include/linux/version.h" has a version which matches the output of the "uname -r" command. * If all of the above conditions are met, you are ready to build. Run the shell script "sh custom_cpqhealth.sh". This script will create a new RPM SPEC file and attempt to build the driver. If the build of the driver is successful, a new RPM package will be created and copied to the /opt/compaq/cpqhealth directory. This package can then be deployed in the usual way. * Typical errors Usually there will be compiler or linker warnings which indicate that kernel drivers should not use regular header files. This is an indication that the kernel sources are not loaded or not installed correctly. You need to make sure the "version.h" file listed above matches the output of "uname -r". The initializa- tion script "/etc/init.d/cpqasm" does check to see if the "/lib/modules/`uname -r`/build/include/linux/version.h" file exists and matches the output of "uname -r". If you are building a custom kernel, you are responsible to making sure the correct "version.h" file is created and located in the correct ker- nel header file directory (i.e. "/lib/modules/`uname -r`/build/include/linux/version.h"). If all the above conditions are met, you might want to try building the kernel. If the kernel can not be built successfully, this is an indi- cation that the particular kernel release may have some issues. cpqhealth error messages The next few sections of this document is dedicated to error messages. The best way to locate a particular message is to search for it using the "/" key. The searches are case sensitive and require an exact match. The errors are catorgized into installation, compatibility, general, temperature, fan, power supply, memory, automatic server recovery and critical server type errors. cpqhealth installation messages The following message will be logged in the /opt/com- paq/cpqhealth/cpqhealth_boot.log or the /opt/com- paq/cpqhealth/cpqhealth_boot.log.old. These messages are logged for the RPM installation of the cpqhealth module as well as when booting the Linux operating system. The "/var/log/messages", the output from "dmesg", the "/opt/com- paq/cpqhealth/cpqhealth_boot.log" and the "/opt/com- paq/cpqhealth/cpqhealth_boot.log.old" files should always be sent with any queries concerning installation or rebuilding issues. Message: "WARNING: cpqasm: casmd already running!" " You must stop the process first." " usage: /etc/init.d/cpqasm stop" Description: This is an indication that the /etc/init.d/cpqasm script was run multiple times with the "start" parameter. Action: None. ============================ Message: "The hp ProLiant Event Logging module is not available" "for this Linux kernel: ${THIS_KERNEL}" Description: This is an indication that the package has either been installed on a patched Linux kernel (i.e. an errata Linux kernel) or the wrong binary package has been installed. Action: Make sure the correct distribution package has been installed. If there is not a cpqhealth package for the installed Linux distribution, make sure the kernel source files have been installed as previously described in this document. ============================ Message: "The hp ProLiant Event Logging module failed to load!" "Linux Kernel Symbol Conflict - Attempting rebuild to resolve Description: This is an indication that the package has either been installed on a patched Linux kernel (i.e. an errata Linux kernel) or the wrong binary package has been installed. Action: Make sure the correct distribution package has been installed. If there is not a cpqhealth package for the installed Linux distribution, make sure the kernel source files have been installed as previously described in this document. If the correct Linux kernel source is present, the boot scripts will attempt to automatically rebuild the cpqhealth module and reload the drivers. There will be a message at the end of the rebuild process if the modules load successfully. ============================ Message: "WARNING! Not able to rebuild the cpqevt.o module on this kernel!" " See /opt/compaq/cpqhealth/cpqhealth_boot.log for details." Description: This is an indication that the package has either been installed on a patched Linux kernel (i.e. an errata Linux kernel) or the wrong binary package has been installed. In either case, there was no source available or there were compilation / linker errors. Action: Make sure the correct distribution package has been installed. If there is not a cpqhealth package for the installed Linux distribution, make sure the kernel source files have been installed as previously described in this document. The errors located in /opt/com- paq/cpqhealth/cpqhealth_boot.log will need to be reviewed and corrected. This may require "wrapper" file changes if the Linux kernel header files have been drastically modified in the installed distribution. ============================ Message: "The hp ProLiant Advanced Server Management module is not available" "for this Linux kernel: ${THIS_KERNEL}" Description: This is an indication that the package has either been installed on a patched Linux kernel (i.e. an errata Linux kernel) or the wrong binary package has been installed. Action: Make sure the correct distribution package has been installed. If there is not a cpqhealth package for the installed Linux distribution, make sure the kernel source files have been installed as previously described in this document. ============================ Message: "The hp ProLiant Advanced Server Management module failed to load!" "Linux Kernel Symbol Conflict - Attempting rebuild to resolve." Description: This is an indication that the package has either been installed on a patched Linux kernel (i.e. an errata Linux kernel) or the wrong binary package has been installed. Action: Make sure the correct distribution package has been installed. If there is not a cpqhealth package for the installed Linux distribution, make sure the kernel source files have been installed as previously described in :356: warning [p 7, 9.3i]: cannot adjust line :370: warning [p 8, 2.0i]: cannot adjust line this document. If the correct Linux kernel source is present, the boot scripts will attempt to automatically rebuild the cpqhealth module and reload the drivers. A message will be displayed at the end of the rebuild process if the modules is successfully loaded. ============================ Message: "WARNING! Not able to rebuild the cpqasm.o module on this kernel!" " See /opt/compaq/cpqhealth/cpqhealth_boot.log for details." Description: This is an indication that the package has either been installed on a patched Linux kernel (i.e. an errata Linux kernel) or the wrong binary package has been installed. Action: Make sure the correct distribution package has been installed. If there is not a cpqhealth package for the installed Linux distribution, make sure the kernel source files have been installed as previously described in this document. The errors located in /opt/com- paq/cpqhealth/cpqhealth_boot.log will need to be reviewed and corrected. This may require "wrapper" file changes if the Linux kernel header files have been drastically modified in the installed distribution. ============================ Message: "/lib/modules/${THIS_KERNEL}/build does not exist" "This is an indication that the sources for this kernel (${THIS_KERNEL}) are not loaded." "Please load the appropriate sources to rebuild module". Description: The cpqhealth driver package follows a standard Linux 2.4 kernel distribution. The Linux kernel source files must be loaded as indicated in the message. The "build" directory is actually a symbolic link and must exist. Action: Make sure the correct distribution package has been installed. If there is not a cpqhealth package for the installed Linux distribution, make sure the kernel source files have been installed as previously described in this document. If the correct Linux kernel source is present, the boot scripts will attempt to automatically rebuild the cpqhealth module and reload the drivers. ============================ Message: "/lib/modules/${THIS_KERNEL}/build/include/linux/ver- sion.h does not exist" "Please load the appropriate sources to rebuild module". Description: This message usually only occurs on SuSe Linux distribu- tion because the file specified does not exist. Action: Make sure the correct distribution package has been installed. If there is not a cpqhealth package for the installed Linux distribution, make sure the kernel source files have been installed as previously described in this document. If the correct Linux kernel source is present, the boot scripts will attempt to automatically rebuild the cpqhealth module and reload the drivers. For SuSe distributions, there should be a file "/boot/vmlinuz.ver- sion.h" which needs to be moved to the directory listed in the message. ============================ Message: "/lib/modules/${THIS_KERNEL}/build/include/linux/auto- conf.h does not exist" "Please load the appropriate sources to rebuild module". Description: This message usually only occurs on SuSe Linux distribu- tion because the file specified does not exist. Action: Make sure the correct distribution package has been installed. If there is not a cpqhealth package for the installed Linux distribution, make sure the kernel source files have been installed as previously described in this document. If the correct Linux kernel source is present, the boot scripts will attempt to automatically rebuild the cpqhealth module and reload the drivers. For SuSe distributions, there should be a file "/boot/vmlinuz.autoconf.h" which needs to be moved to the directory listed in the message. ============================ Message: "There does not appear to be kernel sources which match the current booting Linux kernel. There must be a direc- tory named "/lib/modules/${THIS_KERNEL}" and there must be a valid directory linked to "/lib/modules/${THIS_KER- NEL}/build"." "Please load the appropriate Linux sources to rebuild module". Description: This is an indication that the package has either been installed on a patched Linux kernel (i.e. an errata Linux kernel) or the wrong binary package has been installed. Action: Make sure the correct distribution package has been installed. If there is not a cpqhealth package for the installed Linux distribution, make sure the kernel source files have been installed as previously described in this document. If the correct Linux kernel source is present, the boot scripts will attempt to automatically rebuild the cpqhealth module and reload the drivers. The cpqhealth RPM installation has failed and the RPM should be immediately removed. ============================ Message: "cpqasm: You should also stop the driver otherwise the Automatic" " Server Recovery (ASR) Feature may reboot the server." " usage: rmmod cpqasm" Description: When the driver is stopped using the /etc/init.d/cpqasm script, the ASR timer will continue to run. The cpqasm driver needs to also be terminated to keep the server from automatically shutting down when the ASR timer expires. Action: Use the "rmmod cpqasm" command to unload the driver or restart the daemon using the /etc/init.d/cpqasm start command. ============================ Message: "hp ProLiant Advanced Server Management driver will not be loaded." Description: The casm driver cannot be initialized at this time due to a conflict in ROM internal tables or the server is not supported. This driver is only supported on servers which have the ProLiant Advanced Server Management ASIC (PCI indentifieer 0x0e11a0f0 or the ProLiant Integrated Lights Out Management ASIC (PCI indetifier 0x0e11b203). No other ProLiant servers are supported. Action: Check to see that the appropriate ProLiant Advanced Server Management ASIC is present. This can be done by using the following commands: cat /proc/bus/pci/devices | grep -i 0e11a0f0 cat /proc/bus/pci/devices | grep -i 0e11b203 One of these commands must succeed and return informa- tion. You might also check to see if a later ROM version is available for this server. Driver Messages For Configuration Or Compatibility Issues Most of the following messages will be seen prepended with "casm: " to indicate that they are from the casm driver. This section deals with driver initialization issues as the driver is loaded. Message: "Detected %d Physical/Logical processors installed" "but only %d recognized by operating system!" Description: The casm driver is able to take an inventory of the pro- cessors physically present to compare against the number of processors the Linux operating system detects (or rec- ognizes). If there are more processors available than what is recognized, this message is displayed. The casm driver will continue to operate normally. Action: There are multiple reasons for this message to generated. The most common reason is a single processor Linux kernel is installed on a multiprocessor server Other reasons include the APIC table setting. On multiprocessor servers the APIC setting should be "Full Table - Mapped" or "Full Table". This setting can be checked via the ROM Based Setup Utility (RBSU) available during POST when the server is booted (usually the "F9" key prompt) or by reviewing the "/proc/casmdbug" file. Please note that the "/proc/casmdbug" file is primarily designed for developer debug and is subject to change. The features in this file are not very useful without full hardware system specifications for the ProLiant server. The "top(1)" utility can be used to review the number of pro- cessors the operating system recognizes. The "/proc/cpuinfo" file can also be used to determine how many processors the Linux operating system recognizes (or enabled). A review of the operating system "boot" mes- sages (such as boot.log and is usually logged in the "/var/log" directory) may also provide some insight as to why the Linux operating system fails to recognize all the processors present in the server. ============================ Message: "There is no SHAFT Record in this ROM!" Description: This is an indication of a ROM problem or that the cpqhealth driver has been loaded on an unsupported server. Action: Remove the cpqhealth package. ============================ Message: "Health Environment Parsing failed!" Description: This is an indication of a ROM internal table problem. Action: Please report this to Customer Service for follow up. ============================ Message: "SHAFT Parsing failed!" Description: This is an indication of a ROM internal table problem. Action: Please report this to Customer Service for follow up. ============================ Message: "SHAFT and Patch signature strings do not match at byte #%d" Description: Two tables internal to the ROM do not have matching sig- nature strings. Not necessarily an indication of a prob- lem. Action: If the casm driver is loaded no further action is required. Optionally, one could upgrade to the latest ROM version for this server, if available. ============================ Message: "Neither SMBIOS or SIT is present!" Description: This is an indication of a ROM internal table problem. Action: Please report this to Customer Service for follow up. ============================ Message: "Unknown casmc_crom_ioctl Cmd: 0x%x" Description: This message is displayed when an application such as the Insight Manager 7 Agent makes a request of the cpqhealth driver which the cpqhealth driver does not understand. Action: The message usually indicates slightly reduced function- ality for the application making the request. Check to see that the application and the cpqhealth driver are both at the latest release. ============================ Message: "Unknown casmc_ecc_ioctl Cmd: 0x%x" Description: This message is displayed when an application such as the Insight Manager 7 Agent makes a request of the cpqhealth driver which the cpqhealth driver does not understand. Action: The message usually indicates slightly reduced function- ality for the application making the request. Check to see that the application and the cpqhealth driver are both at the latest release. ============================ Message: "Unknown casmc_asr_ioctl Cmd: 0x%x" Description: This message is displayed when an application such as the Insight Manager 7 Agent makes a request of the cpqhealth driver which the cpqhealth driver does not understand. Action: The message usually indicates slightly reduced function- ality for the application making the request. Check to see that the application and the cpqhealth driver are both at the latest release. ============================ Message: "Unknown casmc_event_ioctl Cmd: 0x%x" Description: This message is displayed when an application such as the Insight Manager 7 Agent makes a request of the cpqhealth driver which the cpqhealth driver does not understand. Action: The message usually indicates slightly reduced function- ality for the application making the request. Check to see that the application and the cpqhealth driver are both at the latest release. Driver Messages For General Environmental Issues Most of the following messages will be seen prepended with "casm: " to indicate that they are from the casm driver. This section deals with driver general environment monitoring events. Specific Environment Issues follow this section. ============================ Message: "Monitoring of fan #%d has been disabled." Description: Monitoring of the indicated fan has been disabled because the interrupt threshold was exceeded. This is an indica- tion that the fan or the fan controller is generating spurious interrupts. Action: The fan specified in the message may need to be replaced. ============================ Message: "Power supply %d revision is %d.%d, %d.%d is recom- mended." Description: The power supply is of a version other than the recommended version. Action: Contact Hewlett-Packard ProLiant support to determine what needs to be done. ============================ Message: "Monitoring of Health has been disabled." Description: System Health is no longer being monitored. Action: Could be due to a hardware failure. Usually caused by a device interrupting the driver at a very fast rate. ============================ Message: "Temperature sensor #%d has been disabled." Description: The indicated temp sensor has been disabled. Action: Call Hewlett-Packard ProLiant Support for further assis- tance. ============================ Message: "Monitoring of VRM #%d has been disabled." Description: Monitoring of the indicated VRM has been disabled because the interrupt threshold was exceeded. This is an indica- tion that the VRM is generating spurious interrupts. Action: The indicated VRM may need to be replaced. ============================ Message: "Monitoring of power supply #%d has been disabled." Description: Monitoring of the indicated power supply has been dis- abled because the interrupt threshold was exceeded. This is an indication that the power supply is generating spu- rious interrupts. Action: The indicated power supply may need to be replaced. ============================ Message: "Spurious interrupt: Feature %d has been previously dis- abled!" Description: This is an indication that a feature (identified by a number) had been disabled. There may have been one more event in the queue to be processed. Action: This is usually the result of some other event (such as a fan failure). Once the previous event has been cor- rected, no other action will be required. ============================ Message: "Feature %d has been disabled" Description: This is usually because a feature (or device) has exceeded it's interrupt threshold limit. Action: A previous message will have been displayed indicating that a device has exceeded it's set threshold limit. The failing device should be replaced. ============================ Message: "The system is NOT configured to shutdown on non-criti- cal thermal failures - (configurable via RBSU Utility)." Description: This message accompanies the previous message if the sys- tem will not shut down for non-critical thermal failures. Action: If desired, configure the system via RBSU to shutdown on non-critical failures. The RBSU Utility is usually exe- cuted by pressing the "F9" function key during POST when indicated. Driver Messages For Temperature Violations: Most of the following messages will be seen prepended with "casm: " to indicate that they are from the casm driver. This section deals with detected temperature violations. Note that there may be multiple messages each giving slightly different details (such as location) but all having similar causes. Events which are corrected will use the same message with the phrase "has been repaired" appended to the end of the message. This simplifies matching failures with corrections in the system message logs. ============================ Message: "Approaching Dangerous Temperature. The %s Thermal Sen- sor(#%d) is reporting overheating conditions." Description: A thermal sensor is reporting high temperatures. Thermal shutdown may be triggered if the temperature increases beyond the threshold. Action: The ambient temperature in the environment must be below 35C. If this condition is met, there may be something blocking the air flow to the server. If the Termal Sen- sor indicates that this is a CPU, this may be an indica- tion of an improperly mounted CPU Heat Sink. Check the front of the server for a blockage. A failed fan could also lead to this condition in a warm environment. ============================ Message: "System Overheating (Zone %s, Location %s, Temperature %s)" "External Chassis Overheating (Chassis %s, Zone %s, Loca- tion %s, Temperature%s)" "Internal Storage System Overheating (%sSlot %s, Zone %s, Location %s, Temperature %s)" "Server Blade Enclosure Overheating (Zone %s, Location %s, Temperature %s, %s)" "Power Enclosure Overheating (Zone %s, Location %s, Tem- perature %s, %s)" Description: This message indicates that the indicated location in the system is overheating. Another message will be dis- played if a system shutdown will occur. Action: On some servers the fans will increase to full speed in an attempt to cool the server. If the server does not cool down within 60 seconds, the operating system will most likely be shutdown to close the file systems. Check for blocked air flow to the indicated location. Check air conditioning system in environment. Driver Messages For Fan Related Events: Most of the following messages will be seen prepended with "casm: " to indicate that they are from the casm driver. This section deals with detected fan related events. Note that there may be multiple messages each giving slightly different details (such as location) but all having similar causes. Events which are corrected will use the same message with the phrase "has been repaired" appended to the end of the message. This simplifies matching failures with corrections in the system message logs. ============================ Message: "Fan Failure (Fan %s, Location %s)" "External Chassis Fan Failure (Chassis %s, Fan %s, Loca- tion %s)" "External Storage System Fan Failure (%sSlot %s, Fan %s, Location %s)" "Internal Storage System Fan Failure (%sSlot %s, Fan %s, Location %s)" Description: This message indicates that a fan in the specified loca- tion has failed. Another message will be displayed if a system shutdown will occur. Action: On some servers such as the ProLiant Dense Line (DL), a fan failure will trigger a shutdown even if Thermal Shut- down has been disabled in RBSU. There is a 60 second grace period to allow hot plug fans to be replaced in the case of a redundant fan failure. Another message will be displayed if a system shutdown will occur. The RBSU setup utility can be used to override "Thermal Shutdown" in the event of a bad signal from the fan. Any fan which shows a failure should be replaced as soon as possible even if the fan continues to operate. ============================ Message: "System Fan Inserted (Fan %s, Location %s)" "External Chassis Fan Inserted (Chassis %s, Fan %s, Loca- tion %s)" "External Storage System Fan Inserted (%sSlot %s, Fan %s, Location %s)" Description: This message indicates that the indicated fan has been inserted. Action: This is just an information message. No action required. ============================ Message: "System Fan Removed (Fan %s, Location %s)" "External Chassis Fan Removed (Chassis %s, Fan %s, Loca- tion %s)" "External Storage System Fan Removed (%sSlot %s, Fan %s, Location %s)" Description: This message indicates that the indicated fan has been removed. Action: This is just an information message. No action required. ============================ Message: "System Fans Not Redundant (Location %s)" Description: This message indicates that the fans are no longer redundant. This message usually follows a Fan Failure message. Action: Correct the previous fan error (failure or removal) to restore redundancy. Driver Messages For Power Supply Related Events: Most of the following messages will be seen prepended with "casm: " to indicate that they are from the casm driver. This section deals with detected power supply related events. Note that there may be multiple messages each giving slightly differ- ent details (such as location) but all having similar causes. Events which are corrected will use the same message with the phrase "has been repaired" appended to the end of the message. This simplifies matching failures with corrections in the system message logs. ============================ Message: "System Power Supply: %s (Power Supply %s)" "External Chassis Power Supply: %s (Chassis %s, Power Supply %s)" "External Storage System Power Supply: %s (%sSlot %s, Power Supply %s)" Description: This message indicates that the specified power supply has failed or electric power to the supply has been dis- continued. Action: Check to see that the power source (i.e. the plug) to the power supply is still providing electricity. If power is available, the power supply may have failed and needs to be replaced. ============================ Message: "System Power Supply Removed (Power Supply %s)" "External Chassis Power Supply Removed (Chassis %s, Power Supply %s)" "External Storage System Power Supply Removed (%sSlot %s, Power Supply %s)" Description: This message indicates that the specified power supply has been removed from the system. Action: No action required as this is just an information mes- sage. ============================ Message: "System Power Supply Inserted (Power Supply %s)" "External Chassis Power Supply Inserted (Chassis %s, Power Supply %s)" "External Storage System Power Supply Inserted (%sSlot %s, Power Supply %s)" Description: This message indicates that the specified power supply has been inserted into the system. Action: No action required as this is just an information mes- sage. ============================ Message: "System Power Supplies Not Redundant" "External Chassis Power Supplies Not Redundant (Chassis %s)" "External Storage System Power Supplies Not Redundant (%sSlot %s)" Description: This message indicates that the indicated power supplies are no longer redundant. This message usually follows a Power Supply Failure message. Action: Correct the previous power supply error (failure or removal) to restore redundancy. Driver Messages For Memory Subsystem Related Events: Most of the following messages will be seen prepended with "casm: " to indicate that they are from the casm driver. This section deals with detected Memory Subsystem related events. Note that there may be multiple messages each giving slightly different details (such as location) but all having similar causes. Events which are corrected will use the same message with the phrase "has been repaired" appended to the end of the message. This simplifies matching failures with corrections in the system message logs. ============================ Message: "Corrected Memory Error threshold exceeded (Slot %s, Mem- ory Module %s)" "Corrected Memory Error threshold exceeded (System Mem- ory)" "Corrected Memory Error threshold exceeded (Slot %s, Bank %s)" "Corrected Memory Error threshold exceeded (System Mem- ory, Bank %s)" Description: This message indicates that a memory module has exceeded the prefailure threshold for correctable memory errors. Action: The memory module should be replaced as soon as possible. ============================ Message: "Uncorrectable Memory Error (Slot %s, Memory Module %s)" "Uncorrectable Memory Error (System Memory)" Description: This message indicates that a memory module has failed. This problem could be intermittent due to the way memory fails so the sytem may reboot even though a memory module indicated a failure. Action: The memory module should be replaced as soon as possible. There is the possibility that the server may get a Non- Maskable Interrupt and be halted if this error occurs. ============================ Message: "Memory Cartridge Removed (Slot %s)" "Memory Board Removed (Slot %s)" Description: This message indicates that a memory cartridge / board has been removed. Action: No action required as this is just an information mes- sage. ============================ Message: "Memory Cartridge Inserted (Slot %s)" "Memory Board Inserted (Slot %s)" Description: This message indicates that a memory cartridge / board has been inserted. Action: No action required as this is just an information mes- sage. ============================ Message: "Memory Cartridge Unlocked (Slot %s)" Description: This message indicates that a memory cartridge / board has been manually unlocked. Action: No action required as this is just an information mes- sage. ============================ Message: "Memory Cartridge locked (Slot %s)" Description: This message indicates that a memory cartridge / board has been manually locked. Action: No action required as this is just an information mes- sage. ============================ Message: "Memory Cartridge Bus Fault (Slot %s)" "Memory Cartridge Power Fault (Slot %s)" Description: This message indicates that a memory cartridge / board has a fault. Action: Contact hp ProLiant support for further assistance. ============================ Message: "Memory Cartridge Configuration Error (Slot %s, Memory Module %s)" "Memory Board Configuration Error (Slot %s, Memory Module %s)" Description: This message indicates that a memory cartridge / board has a configuration error. The usual cause of this error is using DIMMS which do not match in size and speed. Action: Use identical DIMMs in all memory cartridge / board con- figurations if using multiple memory cartridges / boards. ============================ Message: "Online Spare Memory Engaged for Faulty Module (Slot %s, Memory Module %s)" "Online Spare Memory Engaged for Faulty Module (Slot %s, Bank %s)" "Online Spare Memory Engaged for Faulty Module (System Memory, Memory Module %s)" "Online Spare Memory Engaged for Faulty Module (System Memory, Bank %s)" Description: This message indicates that the server was configured to use the the "Online Spare Memory" option of the ProLiant Advanced Memory Protection feature and was forced to fail over due to a memory module exceeding the prefailure cor- rectable error threshold limit. There should be a pre- ceding message indicating which memory module exceeded the prefailure correctable error threshold limit in the ProLiant Integrated Management Log. Action: Shutdown the server and replace the DIMM indicated in the message. ============================ Message: "Mirrored Memory Engaged for Faulty Module (Slot %s, Mem- ory Module %s)" "Mirrored Memory Engaged for Faulty Module (Slot %s, Bank %s)" "Mirrored Memory Engaged for Faulty Module (System Mem- ory, Memory Module %s)" "Mirrored Memory Engaged for Faulty Module (System Mem- ory, Bank %s)" Description: This message indicates that the server was configured to use the the "Mirrored Memory" option of the ProLiant Advanced Memory Protection feature and was forced to fail over due to a failed memory module. Action: Most ProLiant servers which have this feature allow the failed memory board to be "hot plugged" out of a live system. The boards have indicator lights to let the user know which board should be removed to replace the failed DIMM. When the DIMM has been repaired, the system will automatically return to the redundant (mirrored) state. ============================ Message: "Memory Subsystem Not Mirrored" Description: This message indicates that the memory mirror has been broken. Action: No action required as this is just an information mes- sage. Driver Messages For Automatic Server Recovery (ASR) Events The following messages are displayed when an Automatic Server Recovery (ASR) timeout has occurred. The order of the messages is very important. When the ProLiant Advanced Server Management driver detects an ASR timeout, the driver will attempt to grace- fully shutdown the operating system. If the graceful shutdown attempt is successful, a message will be log indicating this otherwise the server will hard reboot as if the power switch was momentarily pressed. Most of the following messages will be seen prepended with "casm: " to indicate that they are from the casm driver. ============================ Message: "NMI - Automatic Server Recovery timer expiration - Hour %d - %d/%d/%d" Description: This message indicates that the ProLiant Advanced Server Management driver detected an ASR timeout and is attempt- ing to gracefully shutdown the operating system. If this message is not present, this may be an indication of a critical hardware failure (such as a non-correctable ECC error on a memory DIMM) or some other severe event. This is the first of a series of messages displayed to the console. This message will NOT be logged to the ProLiant Integrated Management Log and will most likely not be in any system logs. Action: Review all the messages log to the ProLiant Integrated Management Log to see if any previous errors have been logged. This does take a bit of detective work to figure these types of errors out. ============================ Message: "ASR Lockup Detected: %s" Description: This message indicates that the ProLiant Advanced Server Management driver detected an ASR timeout and is attempt- ing to gracefully shutdown the operating system. If this message is not present, this may be an indication of a critical hardware failure (such as a non-correctable ECC error on a memory DIMM) or some other severe event. This will be the first message logged to the ProLiant Inte- grated Management Log (if logging is possible). Action: Review all the messages log to the ProLiant Integrated Management Log to see if any previous errors have been logged. This does take a bit of detective work to figure these types of errors out. ============================ Message: "casm: ASR performed a successful OS shutdown" Description: This message indicates that the ProLiant Advanced Server Management driver detected an ASR timeout and was able to successfully perform a graceful shutdown of the operat- ing system. If this message is not present, this may be an indication of a hardware failure (such as a non-cor- rectable ECC error on a memory DIMM), a high priority process consuming all the available CPU cycles (software failure) or possible a device such as a storage or net- work controller flooding the system with interrupts. This will be the second message logged to the ProLiant Integrated Management Log if logging is possible. If this message is present, this usually indicates a software type error such as a high priority process con- suming all the available CPU cycles. Tools such as SAR can be used in conjunction with the ASR facility to locate the errant process at the time of failure. Action: Review all the messages log to the ProLiant Integrated Management Log to see if any previous errors have been logged. This does take a bit of detective work to figure these types of errors out. ============================ Message: "ASR Detected by System ROM" Description: This message indicates that the ProLiant Server ROM detected an ASR timeout. This message is almost always present in the ProLiant Integrated Management Log when an ASR timeout occurs. If this is the ONLY "ASR" message logged to the ProLiant Integrated Management Log, this may be indicative of a hardware failure (such as a non- correctable ECC error on a memory DIMM). The ASR feature on a ProLiant server will hard reset the server when the timeout expires with no software intervention required. Action: Review all the messages log to the ProLiant Integrated Management Log to see if any previous errors have been logged. This does take a bit of detective work to figure these types of errors out. ============================ Message: "Automatic Operating System Shutdown Initiated Due to Fan Failure" "Automatic Operating System Shutdown Initiated Due to Overheat Condition" "Automatic Operating System Shutdown Initiated Due to VRM Failure" "Automatic Operating System Shutdown Initiated by a Soft Power Down" "Automatic Operating System Shutdown Initiated by a soft- ware" "Server Blade Enclosure Blade Shutdown Via Power Manage- ment Software (Slot %s)" Description: This message indicates that a graceful operating system shutdown will take place unless the failing condition is immediately corrected. For most events, there is a one minute delay period to allow the opportunity for the failing condition to be corrected. For example, the user may need to remove two fans (as part of a Field Replaca- ble Unit) to correct a failed fan. This gives the user one minute to put the working pair of fans back into the system (assuming there was a redundant fan solution available for the ProLiant server). Action: If replacing a failed fan (which is permitted to be hot replaced), there is a one minute grace period to insert the working fan into the system. ============================ Message: "Automatic Operating System Shutdown Aborted" "Automatic Operating System Shutdown Due to Fan Failure Aborted" "Automatic Operating System Shutdown Due to Overheat Aborted" Description: This message indicates that the scheduled graceful shut- down of the operating system was aborted. Execution will continue. Action: Information message. No action required. Driver Messages For Critical Hardware Events (NMI) Most of the following messages will be seen prepended with "casm: " to indicate that they are from the casm driver. This section deals with Non-Maskable Interrupt (NMI) errors which are common. There are other NMI type errors which may occur. In general, all NMI type errors are usually related to hardware and customer support will need to be engaged to provide a solution. The list below covers the more common errors which may be displayed. ============================ Message: "(MCA) Processor BINIT in progress! Description: An Intel Processor Machine Check Architecture event has occurred. Action: The server will be forced down hard. The processor should be replaced. ============================ Message: "casm: NMI Handler has been called on processor %d!" Description: This is a message which is logged for all NMI's. If no other messages are logged or displayed, this may be an indication of an Uncorrectable Memory Error. These types of errors are difficult to log because the casm device driver code may actually be physically located on a failed DIMM. This will be the first message with other details following if the source of the NMI can be detected. The ProLiant Automatic Server Recovery (ASR) feature uses the NMI facility to alert the ProLiant Advanced Server Management driver that the ASR timer is about to expire. Action: If no other messages are displayed, try moving the DIMMs around to different slots and see if the error will recreate. Otherwise, check for subsequent messages which will give an indication of the source of the problem. ============================ Message: "casm: Spinning for 2 seconds!" Description: All NMI's are processed by the bootstrap processor. If an NMI is received on a processor other than the boot- strap processor, the casm driver will spin to allow the NMI be processed. Action: This message along with other NMI messages can be used to assist in sourcing the problem that generated the NMI. ============================ Message: "NMI - Uncorrectable memory error - "Hour %d - %d/%d/%d" "Bank %d DIMMs" Description: The Bank indicated DIMMS have generated an Uncorrectable memory error. Action: The failed DIMMS need to be replaced. ============================ Message: "NMI - Uncorrectable memory error - "Hour %d - %d/%d/%d Slot: %d Module %d" Description: The specific DIMM indicated in the message has generated an Uncorrectable Memory Error. Action: The failed DIMM need to be replaced. ============================ Message: "NMI - Automatic Server Recovery timer expiration - Hour %d - %d/%d/%d" Description: The Advanced Server Management (ASM) watchdog timer has expired. This is an indication that either a software application consumed all of the Processor resources such that the operating system was not able to schedule or a major event occurred (such as a Non-Maskable Interrupt (NMI)) and halted the operating system. See previous section concerning ProLiant Advanced Server Recovery. Action: Use the messages in the Integrated Management Log (IML) and the operating system event logs to determine what caused the operating system to cease functioning or to "lock up". ============================ Message: "NMI - Unexpected Slot Power Loss (Bus %d, dev %d, func %d) Hour %d - %d/%d/%d" Description: This is a result of opening a PCI Hot Plug slot while the slot is powered on. Action: If no PCI Hot Plug slot was opened, this could be an indication of a slot failure. Check the slot LED's for proper operation. ============================ Message: "NMI - PCI Bus parity error (Bus %d, dev %d, func %d) Hour %d - %d/%d/%d" Description: A PCI device has indicated a parity error has occurred. Action: This is an indication that the PCI device specified may be failing. If no other errors have occurred before this error, this might be an indication that the specified PCI device is failed or about to fail. If other errors have occurred, this error needs to be analyzed in context with previous errors. ============================ Message: "NMI - Dump Switch has been pressed - "Hour %d - %d/%d/%d" Description: Some ProLiant servers has a "debug" switch which will generate a Non-Maskable Interrupt (NMI). This message indicates that this switch was pressed. Action: None. ============================ Message: "Unrecoverable Non-Maskable Interrupt (NMI) error" Description: This is a NMI which the ProLiant server ROM was not able to "source". This is either a problem with the ROM code or a hardware failure of a product not shipped as part of the server (i.e. a third party hardware device). Action: Contact customer support for assistance. ============================ Message: "Unknown Non-Maskable Interrupt (NMI) error (0x%x) Hour %d - %d/%d/%d" Description: This message indicates that an unknown NMI was generated. The hexidecimal value returned is an internal code from the Server ROM which customer support can interpret. Action: Contact customer support for assistance. BUGS Limited Hardware Platforms This driver will only work on ProLiant servers which have the ProLiant Advanced Server Management (ASM) ASIC (PCI ID 0x0E11A0F0) or the ProLiant iLO Advanced Server Man- agement (PCI ID 0x0E11B203) ASICs. Initialization time After inserting, the driver needs about one minute to get fully "situated". Specifically, faulty hardware that reports back to normal might not be recognized as "work- ing" within the first minute of operation. FILES /opt/compaq/cpqhealth default directory for the scripts and binaries. There are sub-directories for the cpqasm and cpqevt drivers and then further sub-directories for each supported Linux kernel. /opt/compaq/cpqhealth/custom_cpqhealth.sh The shell script which will rebuild and repackage the cpqhealth driver. /opt/compaq/cpqhealth/cpqhealth_boot.log A log file containing the results of the last boot of the system. The RPM errors are also logged here. This file and the previous version ("/opt/com- paq/cpqhealth/cpqhealth_boot.log.old") should always be sent with any queries on the health driver installation or removal. /etc/init.d/cpqasm This file is linked to the multiuser initstate directo- ries and controls the loading of the cpqasm and cpqevt drivers. This script makes the determination if the drivers need to be rebuilt. SEE ALSO cpqimlview (8) www.compaq.com/support/files/server/us/ www.compaq.com/products/software/linux/index.html AUTHOR Hewlett-Packard Company . Copyright Notice copyright 2002 Compaq Information Technologies Group, L.P. 30 September 2002 cpqhealth(4)