- How to troubleshoot slow drain device in a Brocade SAN environment
- What is slow drain device
- How to identify slow drain device by supportshow or supportsave outputs
A slow drain device is an end device (can be host/storage/Inter-switch link (ISL)) that does not return buffer credits quickly enough to the switch causing frames to back up through the fabric, thus causing fabric congestion.
In SAN environment, the performance degrade is a common issue. In such cases, the device processing speed becomes slow, or there are many frame drop warnings, and finally affect the business applications. Usually they are one or several devices cause this problem, we call such device slow drain devices. The slow drain device could be a host, storage or connected switch. For some reason, the frames they accepted exceeded their capabilities so that they could not return enough buffer credits to uplink devices, which causes network delay, congestion or even frame lose issue. All of these would lead to performance issue. The bottleneck device could either be at physical layer, such as SFP, fiber cable and endpoint device, or a SAN design defect, for example, the actual data volume exceeds the maximum processing capability.
In this article we shall talk about how to identify and troubleshoot slow drain device in Brocade SAN environment.
The cause of slow drain device
To understand the cause of the bottleneck, we should understand how switches implement the flow control mechanism. The buffer credit plays a key role in the flow control. Every single switch port has several buffer credits, the number of the credits is determined by the negotiation process of the port and connected device. Only when there are available buffer credits, the port can send out a frame and then occupy a credit. Once the remote device receives the frame, sends out an acknowledge message, then the available buffer credit will be added one. Since the buffer credits are limited, if the port has no enough credit, then the network delay would happen. Certainly, if the occupied time is more than 500ms, the frame will be dropped and release the credit.
Because of the credit-based flow control mechanism, the bottleneck will lead to the congestion on the entire data path. If the path includes a cascading link, all the data transmission through this link will be affected. Therefore, a bottleneck device can cause the congestion of the entire network. It is important to identify the bottleneck device during the troubleshooting of performance issue. For endpoint devices, such as hosts or storage, the system will report bottleneck issue. For Brocade switches, the following message will pop up, 2015/01/15-18:55:34, [AN-1004], 335118, SLOT 6 | FID 128, WARNING, CHD_1B_TLI_SAN1, Congestion bottleneck on port 10/32. 91.33 pct. of 300 secs. affected.
2015/01/15-19:00:37, [AN-1004], 335119, SLOT 6 | FID 128, WARNING, CHD_1B_TLI_SAN1, Congestion bottleneck on port 10/32. 88.67 pct. of 300 secs. affected.
2015/01/15-19:05:40, [AN-1004], 335120, SLOT 6 | FID 128, WARNING, CHD_1B_TLI_SAN1, Congestion bottleneck on port 10/32. 83.33 pct. of 300 secs. affected.
Clear the counters on Brocade switch
To troubleshoot performance issue, the first step is clearing the switch counters. We can use the following commands: #>statsclear
#>slotstatsclear If you d cleared the counters before, you can directly collect supportshow or supportsave logs for analysis. If you haven t cleared the counters, you d better collect a copy of the current the outputs of supportshow or supportsave, then clear the counters. The first one can be used to quickly analyze which ports already have the errors, then we can check these ports first. The sfpshow command can be used to check the power levels for both TX and RX on a particular port.
Identify SAN topology
For a single switch network, all the connected device are hosts or storage. For multiple switches network, there will be ISL links and E-Ports. Identifying the network topology can help administrators to understand the data transmission path.
For example, the following islshow ouput shows the connectivity status between the Brocade switch and the remote switch. No. 1: local switch port 57 connects to remote switch CHD_1C_TLI_SAN1 port 55. No. 2: local switch port 129 connects to remote switch CHD_1D_NGN_SAN1, port 135. islshow :
1: 57-> 55 10:00:00:05:1e:d2:c4:00 7 CHD_1C_TLI_SAN1 sp: 8.000G bw: 8.000G TRUNK QOS
2:129->135 10:00:00:05:33:83:e3:00 5 CHD_1D_NGN_SAN1 sp: 8.000G bw: 8.000G TRUNK QOS
Analyze port errors
As the following diagram shows, there are two Brocade switches in the SAN network.
As the above information, we check the port 57 status with the command portstatsshow 57, portstatsshow 57
tim_txcrd_z 1381820 Time TX Credit Zero (2.5Us ticks)
tim_txcrd_z_vc 0- 3: 0 0 228512 231010
tim_txcrd_z_vc 4- 7: 521007 401291 0 0
tim_txcrd_z_vc 8-11: 0 0 0 0
tim_txcrd_z_vc 12-15: 0 0 0 0
er_rx_c3_timeout 0 Class 3 receive frames discarded due to timeout
er_tx_c3_timeout 23 Class 3 transmit frames discarded due to timeout The Time TX Credit Zero counter shows the duration of the zero buffer credit. Zero buffer credit doesn t mean there is performance issue. However if the value is very high, there could be congestion somewhere in the network. Usually if the number is less than 30% of the transmission frames, then it is normal.
The c3_timeout counter is used to verify if there is frame loss. Prior to FOS 6.3.1, the counter has no direction. After FOS 6.3.1, it is replaced with the er_rx_c3_timeout and er_tx_c3_timeout counters. When the port sends or receives a frame, it occupies a buffer credit. If more than 500ms the port doesn t receive the response, then the transmission is failed and the frame will be dropped and the counter will be added one. This number indicates there is performance issue. In this case, er_tx_c3_timeout is not zero.
Let s take a look at the downstream port, portstatsshow 55
tim_txcrd_z 1259255 Time TX Credit Zero (2.5Us ticks)
tim_txcrd_z_vc 0- 3: 0 0 239711 218720
tim_txcrd_z_vc 4- 7: 403321 397503 0 0
tim_txcrd_z_vc 8-11: 0 0 0 0
tim_txcrd_z_vc 12-15: 0 0 0 0
er_rx_c3_timeout 31 Class 3 receive frames discarded due to timeout
er_tx_c3_timeout 0 Class 3 transmit frames discarded due to timeout The er_rx_c3_timeout counter is not zero which means it also exceeded 500ms and dropped the frames. Please be noted that the upstream er_tx_c3_timeout is not always equal to the downstream er_rx_c3_timeout, it depends on the time that you clear the counters and collect the logs.
We ve checked the ISL links between two switches, now let s find out the congestion device. We saw the er_tx_c3_timeout on the upstream port, and the er_rx_c3_timeout on the downstream port, there should be an F-Port on upstream switch while an F-port on downstream switch.
Then how to find out all these abnormal ports? We check the porterrshow output of these switches. Finally we find the port 21 and port 27 have some problem: portstatsshow 21
er_rx_c3_timeout 22 Class 3 receive frames discarded due to timeout
er_tx_c3_timeout 0 Class 3 transmit frames discarded due to timeout portstatsshow 27
er_rx_c3_timeout 0 Class 3 receive frames discarded due to timeout
er_tx_c3_timeout 31 Class 3 transmit frames discarded due to timeout
Let s take a look at the diagram again: Are there any ports also affected? Since there is only one ISL link between two switch, so all the ports on the data transmission path have been affected as well. Please be noted that the port 26 on the downstream switch hasn t been affected since its data is congested on the upstream switch.
For multiple switches SAN environment, we can also follow the above steps to find out the abnormal device from the portstatsshow output. For single switch environment, we only need to check the F-Ports.
Troubleshoot bottleneck devices
Next we need to find out the cause of the bottleneck device, here are the normal steps: 1. Use porterrshow or portstatsshow to check if there is errors at physical layer
2. Use sfpshow to check the power levels of SFP modules
3. Use switchshow to check the port status
4. Use fabriclog show to check if there is reset port.Look for ports issuing link resets. This can be an indication of the link going through Credit recovery
5. Check the connected device if there is no finding from the above steps Back to this case, we find there is a few errors at physical layer, and the power level of RX is less than -7dBm. So we need to check the fiber cable between the switch and the device. portstatsshow 27
er_enc_out 34181 Encoding error outside of frames
er_bad_os 23541 Invalid ordered set sfpshow 27
RX Power: -23.0 dBm (0.5 uW) 10.0 uW 1258.9 uW 15.8 uW 1000.0 uW
TX Power: -3.2 dBm (477.3 uW)125.9 uW 631.0 uW 158.5 uW 562.3 uW
After replacing the fiber cable, the problem was solved which indicates the bad fiber cable caused the problem. Sometimes we might not be able to find any problem on switches, then we should check if there is any problem on the connected device (e.g. HBA card).
You can use the
Summary
The key point of troubleshooting Brocade SAN performance issue is looking for the bottleneck device through the congestion data path. Understanding the difference between er_rx_c3_timeout and er_tx_c3_timeout is very important.
We suggest clearing the counters when the devices work normally. If the performance issue occurs, only the logs that are collected during that period have more meanings for troubleshooting.