Linux Ethernet Bonding Driver mini-howto Initial release : Thomas Davis Corrections, HA extensions : 2000/10/03-15 : - Willy Tarreau - Constantine Gavrilov - Chad N. Tindel - Janice Girouard Table of Contents ================= Supported Distributions Packaging Kernel Source Code Setup Installing the Source RPM Package Bond Configuration Module Parameters Configuring Multiple Bonds Switch Configuration Verifying Bond Configuration Frequently Asked Questions High Availability Limitations Uninstalling the RPM Supported Distributions ======================= The following distributions are currently supported: Red Hat Linux 7.3 Professional with errata kernel 2.4.18-10 Red Hat Linux 8.0 Professional Red Hat Linux Advanced Server 2.1 with errata kernel 2.4.9-e.8 SuSE Linux Enterprise Server 7 (SLES-7) with SMP errata kernel 2.4.18-224 or UP errata kernel 2.4.18-243 Packaging ========= The driver is released in a source RPM format. The file name for the package is bonding-.src.rpm and is dependent on the kernel source code. Kernel Source Code Setup ======================== The bonding driver requires the presence of the kernel source code and configuring the kernel source before building the bonding driver. The following steps need to be done once for each kernel that is booted. For example, if the current kernel is UP (uni-processor) and an SMP (symmetrical-multi-processor) kernel is booted, these steps must be performed again to configure the kernel source for SMP before building the bonding driver for the SMP kernel. Red Hat installations If the /usr/src/linux- directory does not exist install the kernel source code per Red Hat instructions. Once installed, follow the commands listed below to configure the kernel source to match the running kernel. For Red Hat Linux 8.0 Professional: # cd /usr/src/linux- # make mrproper # make -e KERNELRELEASE=`uname -r` oldconfig # make -e KERNELRELEASE=`uname -r` dep For all other Red Hat Linux distributions: # cd /usr/src/linux- # make mrproper # make oldconfig # make dep SLES 7 installation If the /usr/src/linux- directory does not exist install the kernel source code per SuSE instructions. Once installed, follow the commands listed below to configure the kernel source to match the running kernel. # cd /usr/src/linux-.SuSE # cp /boot/vmlinuz.config .config # cp /boot/vmlinuz.version.h include/linux/version.h # cp /boot/vmlinuz.autoconf.h include/linux/autoconf.h # make oldconfig # make dep Installing the Source RPM Package ================================= 1. Install the RPM source package. # rpm -ivh bonding-.src.rpm 2. Change to the following directory and build the binary RPM for the bonding driver. Red Hat installations # cd /usr/src/redhat # rpmbuild -bb SPECS/bonding.spec SLES 7 installation # cd /usr/src/packages # rpm -bb SPECS/bonding.spec Note: If an error occurs while building the driver or this directory doesn't exist, refer to the Kernel Source Code Setup section of this document. 3. Install (upgrade) the binary RPM package created above using the following command. # rpm -Uvh --force RPMS/i386/bonding-.i386.rpm The "force" rpm option is required since the bonding driver is part of the kernel rpm. Bond Configuration ================== You will need to add at least the following line to /etc/modules.conf so the bonding driver will automatically load when the bond0 interface is configured. Refer to the modules.conf manual page for specific modules.conf syntax details. The Module Parameters section of this document describes each bonding driver parameter. alias bond0 bonding Use standard distribution techniques to define the bond0 network interface. For example, on modern Red Hat distributions, create an ifcfg-bond0 file in the /etc/sysconfig/network-scripts directory that resembles the following: DEVICE=bond0 IPADDR=192.168.1.1 NETMASK=255.255.255.0 NETWORK=192.168.1.0 BROADCAST=192.168.1.255 ONBOOT=yes BOOTPROTO=none USERCTL=no (use appropriate values for your network above) The above file can be created on Red Hat systems using the following command: netconfig -d bond0 All interfaces that are part of a bond should have SLAVE and MASTER definitions. For example, in the case of Red Hat, if you wish to make eth0 and eth1 a part of the bonding interface bond0, their config files (ifcfg-eth0 and ifcfg-eth1) should resemble the following: DEVICE=eth0 USERCTL=no ONBOOT=yes MASTER=bond0 SLAVE=yes BOOTPROTO=none Use DEVICE=eth1 in the ifcfg-eth1 config file. If you configure a second bonding interface (bond1), use MASTER=bond1 in the config file to make the network interface be a slave of bond1. Restart the networking subsystem by issuing the following command: /etc/init.d/network restart If the administration tools of your distribution do not support master/slave notation in configuring network interfaces (such as SuSE), you will need to manually configure the bonding device with the following commands: # /sbin/ifconfig bond0 192.168.1.1 netmask 255.255.255.0 \ broadcast 192.168.1.255 up # /sbin/ifenslave bond0 eth0 # /sbin/ifenslave bond0 eth1 (use appropriate values for your network above) You can then create a script containing these commands and place it in the appropriate rc directory. If you specifically need all network drivers loaded before the bonding driver, adding the following line to modules.conf will cause the network driver for eth0 and eth1 to be loaded before the bonding driver. probeall bond0 eth0 eth1 bonding Be careful not to reference bond0 itself at the end of the line, or modprobe will die in an endless recursive loop. If running SNMP agents, the bonding driver should be loaded before any network drivers participating in a bond. This requirement is due to the the interface index (ipAdEntIfIndex) being associated to the first interface found with a given IP address. That is, there is only one ipAdEntIfIndex for each IP address. For example, if eth0 and eth1 are slaves of bond0 and the driver for eth0 is loaded before the bonding driver, the interface for the IP address will be associated with the eth0 interface. This configuration is shown below, the IP address 192.168.1.1 has an interface index of 2 which indexes to eth0 in the ifDescr table (ifDescr.2). interfaces.ifTable.ifEntry.ifDescr.1 = lo interfaces.ifTable.ifEntry.ifDescr.2 = eth0 interfaces.ifTable.ifEntry.ifDescr.3 = eth1 interfaces.ifTable.ifEntry.ifDescr.4 = eth2 interfaces.ifTable.ifEntry.ifDescr.5 = eth3 interfaces.ifTable.ifEntry.ifDescr.6 = bond0 ip.ipAddrTable.ipAddrEntry.ipAdEntIfIndex.10.10.10.10 = 5 ip.ipAddrTable.ipAddrEntry.ipAdEntIfIndex.192.168.1.1 = 2 ip.ipAddrTable.ipAddrEntry.ipAdEntIfIndex.10.74.20.94 = 4 ip.ipAddrTable.ipAddrEntry.ipAdEntIfIndex.127.0.0.1 = 1 This problem is avoided by loading the bonding driver before any network drivers participating in a bond. Below is an example of loading the bonding driver first, the IP address 192.168.1.1 is correctly associated with ifDescr.2. interfaces.ifTable.ifEntry.ifDescr.1 = lo interfaces.ifTable.ifEntry.ifDescr.2 = bond0 interfaces.ifTable.ifEntry.ifDescr.3 = eth0 interfaces.ifTable.ifEntry.ifDescr.4 = eth1 interfaces.ifTable.ifEntry.ifDescr.5 = eth2 interfaces.ifTable.ifEntry.ifDescr.6 = eth3 ip.ipAddrTable.ipAddrEntry.ipAdEntIfIndex.10.10.10.10 = 6 ip.ipAddrTable.ipAddrEntry.ipAdEntIfIndex.192.168.1.1 = 2 ip.ipAddrTable.ipAddrEntry.ipAdEntIfIndex.10.74.20.94 = 5 ip.ipAddrTable.ipAddrEntry.ipAdEntIfIndex.127.0.0.1 = 1 Module Parameters ================= Optional parameters for the bonding driver can be supplied as command line arguments to the insmod command. Typically, these parameters are specified in the file /etc/modules.conf (see the manual page for modules.conf). The available bonding driver parameters are listed below. If a parameter is not specified the default value is used. When initially configuring a bond, it is recommended "tail -f /var/log/messages" be run in a separate window to watch for bonding driver error messages. It is critical that either the miimon or arp_interval and arp_ip_target parameters be specified, otherwise serious network degradation will occur during link failures. mode Specifies one of three bonding policies. The default is round-robin. 0 Round-robin policy: Transmit in a sequential order from the first available slave through the last. 1 Active-backup policy: Only one slave in the bond is active. A different slave becomes active if, and only if, the active slave fails. The bond's MAC address is externally visible on only one port (network adapter) to avoid confusing the switch. 2 XOR policy: Transmit based on [(source MAC address XOR'd with destination MAC address) modula slave count]. This selects the same slave for each destination MAC address. miimon Specifies the frequency in milli-seconds that MII link monitoring will occur. A value of zero disables MII link monitoring. A value of 100 is a good starting point. See High Availability section for additional information. The default value is 0. downdelay Specifies the delay time in milli-seconds to disable a link after a link failure has been detected. This should be a multiple of miimon value, otherwise the value will be rounded. The default value is 0. updelay Specifies the delay time in milli-seconds to enable a link after a link up status has been detected. This should be a multiple of miimon value, otherwise the value will be rounded. The default value is 0. arp_interval Specifies the ARP monitoring frequency in milli-seconds. If ARP monitoring is used in a load-balancing mode (mode 0 or 2), the switch should be configured in a mode that evenly distributes packets across all links - such as round-robin. If the switch is configured to distribute the packets in an XOR fashion, all replies from the ARP target will be received on the same link which could cause the other team members to fail. A value of 0 disables ARP monitoring. The default value is 0. arp_ip_target Specifies the ip address to use when arp_interval is > 0. This is the target of the ARP request sent to determine the health of the link to the target. Specify this value in ddd.ddd.ddd.ddd format. Configuring Multiple Bonds ========================== If several bonding interfaces are required, the driver must be loaded multiple times. For example, to configure two bonding interfaces with link monitoring performed every 100 milli-seconds, the /etc/conf.modules should resemble the following: alias bond0 bonding alias bond1 bonding options bond0 miimon=100 options bond1 -o bonding1 miimon=100 Switch Configuration ==================== While the switch does not need to be configured when the Active-backup policy is used (mode=1), it does need to be configured for the Round-robin and XOR policies (mode=0, or mode=2). The following commands are issued to create a team on ports 5 and 6 of an Extreme Networks Summit 1i switch: enable sharing 5 grouping 5-6 algorithm round-robin Note: The switch does not need to use the round-robin algorithm, it can use a different load balancing algorithm than that used by the bond. Use the following commands to view the switch configuration: show port info show port config Verifying Bond Configuration ============================ 1) Bonding information files ---------------------------- The bonding driver information files reside in the /proc/net/bond* directories. Sample contents of /proc/net/bond0/info after the driver is loaded with parameters of mode=0 and miimon=1000 is shown below. Bonding Mode: load balancing (round-robin) Currently Active Slave: eth0 MII Status: up MII Polling Interval (ms): 1000 Up Delay (ms): 0 Down Delay (ms): 0 Slave Interface: eth1 MII Status: up Link Failure Count: 1 Slave Interface: eth0 MII Status: up Link Failure Count: 1 2) Network verification ----------------------- The network configuration can be verified using the ifconfig command. In the example below, the bond0 interface is the master (MASTER) while eth0 and eth1 are slaves (SLAVE). Notice all slaves of bond0 have the same MAC address (HWaddr) as bond0. [root]# /sbin/ifconfig bond0 Link encap:Ethernet HWaddr 00:C0:F0:1F:37:B4 inet addr:XXX.XXX.XXX.YYY Bcast:XXX.XXX.XXX.255 Mask:255.255.252.0 UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1 RX packets:7224794 errors:0 dropped:0 overruns:0 frame:0 TX packets:3286647 errors:1 dropped:0 overruns:1 carrier:0 collisions:0 txqueuelen:0 eth0 Link encap:Ethernet HWaddr 00:C0:F0:1F:37:B4 inet addr:XXX.XXX.XXX.YYY Bcast:XXX.XXX.XXX.255 Mask:255.255.252.0 UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1 RX packets:3573025 errors:0 dropped:0 overruns:0 frame:0 TX packets:1643167 errors:1 dropped:0 overruns:1 carrier:0 collisions:0 txqueuelen:100 Interrupt:10 Base address:0x1080 eth1 Link encap:Ethernet HWaddr 00:C0:F0:1F:37:B4 inet addr:XXX.XXX.XXX.YYY Bcast:XXX.XXX.XXX.255 Mask:255.255.252.0 UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1 RX packets:3651769 errors:0 dropped:0 overruns:0 frame:0 TX packets:1643480 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:100 Interrupt:9 Base address:0x1400 Frequently Asked Questions ========================== 1. Is it SMP safe? Yes. The old 2.0.xx channel bonding patch was not SMP safe. The new driver was designed to be SMP safe from the start. 2. How many bonding devices can I have? One for each module you load. See section on Module Parameters for how to accomplish this. 3. How many slaves can a bonding device have? Limited by the number of network interfaces Linux supports and/or the number of network cards you can place in your system. 4. What happens when a slave link dies? If your ethernet cards support MII status monitoring and the MII monitoring has been enabled in the driver (see description of module parameters), there will be no adverse consequences. This release of the bonding driver knows how to get the MII information and enables or disables its slaves according to their link status. See section on High Availability for additional information. For ethernet cards not supporting MII status, the arp_interval and arp_ip_target parameters must be specified for bonding to work correctly. If packets have not been sent or received during the specified arp_interval durration, an ARP request is sent to the target to generate send and receive traffic. If after this interval, either the successful send and/or receive count has not incremented, the next slave in the sequence will become the active slave. If neither mii_monitor and arp_interval is configured, the bonding driver will not handle this situation very well. The driver will continue to send packets but some packets will be lost. Re-transmits will cause serious degradation of performance (in the case when one of two slave links fails, 50% packets will be lost, which is a serious problem for both TCP and UDP). 5. Can bonding be used for High Availability? Yes, if you use MII monitoring and ALL your cards support MII link status reporting. See section on High Availability for more information. 6. Which switches/systems does it work with? In round-robin and XOR mode, it works with switches that support trunking and other bonded Linux systems. In Active-backup mode, it should work with any Layer-II switch. 7. Where does a bonding device get its MAC address from? If not explicitly configured with ifconfig, the MAC address of the bonding device is taken from its first slave device. This MAC address is then passed to all following slaves and remains persistent (even if the the first slave is removed) until the bonding device is brought down or reconfigured. If you wish to change the MAC address, you can set it with ifconfig: # ifconfig bond0 hw ether 00:11:22:33:44:55 The MAC address can be also changed by bringing down/up the device and then changing its slaves (or their order): # ifconfig bond0 down ; modprobe -r bonding # ifconfig bond0 .... up # ifenslave bond0 eth... This method will automatically take the address from the next slave that will be added. To restore your slaves' MAC addresses, you need to detach them from the bond (`ifenslave -d bond0 eth0'), set them down (`ifconfig eth0 down'), unload the drivers (`rmmod 3c59x', for example) and reload them to get the MAC addresses from their eeproms. If the driver is shared by several devices, you need to turn them all down. Another solution is to look for the MAC address at boot time (dmesg or tail /var/log/messages) and to reset it by hand with ifconfig : # ifconfig eth0 down # ifconfig eth0 hw ether 00:20:40:60:80:A0 8. Which transmit polices can be used? Round-robin, based on the order of enslaving, the output device is selected base on the next available slave. Regardless of the source and/or destination of the packet. XOR, based on (src hw addr XOR dst hw addr) % slave cnt. This selects the same slave for each destination hw address. Active-backup policy that ensures that one and only one device will transmit at any given moment. Active-backup policy is useful for implementing high availability solutions using two hubs (see section on High Availability). High Availability ================= To implement high availability using the bonding driver, the driver needs to be compiled as a module, because currently it is the only way to pass parameters to the driver. This may change in the future. High availability is achieved by using MII/ETHTOOL status reporting. You need to verify that all your interfaces support MII/ETHTOOL link status reporting. On Linux kernel 2.2.17, all the 100 Mbps capable drivers and yellowfin gigabit driver support MII. To determine if ETHTOOL link reporting is available for interface eth0, type "ethtool eth0" and the "Link detected:" line should contain the correct link status. If your system has an interface that does not support MII or ETHTOOL status reporting, a failure of its link will not be detected! A message indicating MII and ETHTOOL is not supported by a network driver is logged when the bonding driver is loaded with a non-zero miimon value. The bonding driver can regularly check all its slaves links using the ETHTOOL IOCTL (ETHTOOL_GLINK command) or by checking the MII status registers. The check interval is specified by the module argument "miimon" (MII monitoring). It takes an integer that represents the checking time in milliseconds. It should not come to close to (1000/HZ) (10 milli-seconds on i386) because it may then reduce the system interactivity. A value of 100 seems to be a good starting point. It means that a dead link will be detected at most 100 milli-seconds after it goes down. Example: # modprobe bonding miimon=100 Or, put the following lines in /etc/modules.conf: alias bond0 bonding options bond0 miimon=100 There are currently two policies for high availability. They are dependent on whether: a) hosts are connected to a single host or switch that support trunking b) hosts are connected to several different switches or a single switch that does not support trunking 1) High Availability on a single switch or host - load balancing ---------------------------------------------------------------- It is the easiest to set up and to understand. Simply configure the remote equipment (host or switch) to aggregate traffic over several ports (Trunk, EtherChannel, etc.) and configure the bonding interfaces. If the module has been loaded with the proper MII option, it will work automatically. You can then try to remove and restore different links and see in your logs what the driver detects. When testing, you may encounter problems on some buggy switches that disable the trunk for a long time if all ports in a trunk go down. This is not Linux, but really the switch (reboot it to ensure). Example 1 : host to host at twice the speed +----------+ +----------+ | |eth0 eth0| | | Host A +--------------------------+ Host B | | +--------------------------+ | | |eth1 eth1| | +----------+ +----------+ On each host : # modprobe bonding miimon=100 # ifconfig bond0 addr # ifenslave bond0 eth0 eth1 Example 2 : host to switch at twice the speed +----------+ +----------+ | |eth0 port1| | | Host A +--------------------------+ switch | | +--------------------------+ | | |eth1 port2| | +----------+ +----------+ On host A : On the switch : # modprobe bonding miimon=100 # set up a trunk on port1 # ifconfig bond0 addr and port2 # ifenslave bond0 eth0 eth1 2) High Availability on two or more switches (or a single switch without trunking support) --------------------------------------------------------------------------- This mode is more problematic because it relies on the fact that there are multiple ports and the host's MAC address should be visible on one port only to avoid confusing the switches. If you need to know which interface is the active one, and which ones are backup, use ifconfig. All backup interfaces have the NOARP flag set. To use this mode, pass "mode=1" to the module at load time : # modprobe bonding miimon=100 mode=1 Or, put in your /etc/modules.conf : alias bond0 bonding options bond0 miimon=100 mode=1 Example 1: Using multiple host and multiple switches to build a "no single point of failure" solution. | | |port3 port3| +-----+----+ +-----+----+ | |port7 ISL port7| | | switch A +--------------------------+ switch B | | +--------------------------+ | | |port8 port8| | +----++----+ +-----++---+ port2||port1 port1||port2 || +-------+ || |+-------------+ host1 +---------------+| | eth0 +-------+ eth1 | | | | +-------+ | +--------------+ host2 +----------------+ eth0 +-------+ eth1 In this configuration, there is an ISL - Inter Switch Link (could be a trunk), several servers (host1, host2 ...) attached to both switches each, and one or more ports to the outside world (port3...). One an only one slave on each host is active at a time, while all links are still monitored (the system can detect a failure of active and backup links). Each time a host changes its active interface, it sticks to the new one until it goes down. In this example, the hosts are negligibly affected by the expiration time of the switches' forwarding tables. If host1 and host2 have the same functionality and are used in load balancing by another external mechanism, it is good to have host1's active interface connected to one switch and host2's to the other. Such system will survive a failure of a single host, cable, or switch. The worst thing that may happen in the case of a switch failure is that half of the hosts will be temporarily unreachable until the other switch expires its tables. Example 2: Using multiple ethernet cards connected to a switch to configure NIC failover (switch is not required to support trunking). +----------+ +----------+ | |eth0 port1| | | Host A +--------------------------+ switch | | +--------------------------+ | | |eth1 port2| | +----------+ +----------+ On host A : On the switch : # modprobe bonding miimon=100 mode=1 # (optional) minimize the time # ifconfig bond0 addr # for table expiration # ifenslave bond0 eth0 eth1 Each time the host changes its active interface, it sticks to the new one until it goes down. In this example, the host is strongly affected by the expiration time of the switch forwarding table. 3) Adapting to your switches' timing ------------------------------------ If your switches take a long time to go into backup mode, it may be desirable not to activate a backup interface immediately after a link goes down. It is possible to delay the moment at which a link will be completely disabled by passing the module parameter "downdelay" (in milliseconds, must be a multiple of miimon). When a switch reboots, it is possible that its ports report "link up" status before they become usable. This could fool a bond device by causing it to use some ports that are not ready yet. It is possible to delay the moment at which an active link will be reused by passing the module parameter "updelay" (in milliseconds, must be a multiple of miimon). A similar situation can occur when a host re-negotiates a lost link with the switch (a case of cable replacement). A special case is when a bonding interface has lost all slave links. Then the driver will immediately reuse the first link that goes up, even if updelay parameter was specified. (If there are slave interfaces in the "updelay" state, the interface that first went into that state will be immediately reused.) This allows to reduce down-time if the value of updelay has been overestimated. Examples : # modprobe bonding miimon=100 mode=1 downdelay=2000 updelay=5000 # modprobe bonding miimon=100 mode=0 downdelay=0 updelay=5000 Limitations =========== The main limitations are : - Only the link status is monitored. If the switch on the other side is partially down (e.g. doesn't forward anymore, but the link is OK), the link won't be disabled. Another way to check for a dead link could be to count incoming frames on a heavily loaded host. This is not applicable to small servers, but may be useful when the front switches send multicast information on their links (e.g. VRRP), or even health-check the servers. Use the arp_interval/arp_ip_target parameters to count incoming/outgoing frames. - A "fail back" mechanism is not available when using the Active-backup policy. This would be useful if one slave was preferred over another, i.e. when one slave is 1000Mbps and another is 100Mbps. If the 1000Mbps slave fails and is later restored, it may be preferred the faster slave gracefully become the active slave - without deliberately failing the 100Mbps slave. A fail back mechanism would allow a previously failed slave to become active after certain conditions are met. - A Transmit Load Balancing policy is not available. This mode allows every slave in the bond to transmit while only one receives. If the "receiving" slave fails another slave takes over the MAC address of the failed receiving slave. Uninstalling the RPM ==================== The following command will uninstall the bonding RPM. rpm -e bonding