Troubleshooting and Data collection for an Infiniband switch that becomes inaccessible over the network (Doc ID 2511028.1)

Troubleshooting and Data collection for an Infiniband switch that becomes inaccessible over the network (Doc ID 2511028.1)

Applies to:

Sun Datacenter InfiniBand Switch 36 - Version All Versions to All Versions [Release All Releases]
Sun Network QDR InfiniBand Gateway Switch - Version All Versions to All Versions [Release All Releases]
Information in this document applies to any platform.

Purpose

 To identify why an Infiniband switch becomes inaccessible over the network and collect data to provide feedback to engineering. 

Scope

 

Details

IMPORTANT: All the outputs from the steps below must be collected and provided by the Field Support Engineer in the field task/ Service Request notes.

Please ensure all these steps are followed AS-IS when a switch becomes inaccessible over the network.

1. Logged into another switch in the same fabric, please collect and document the output of the following commands:
# ibswitches
# ibnetdiscover

2. FE to check and document the status of all the LEDs on the Infiniband Switch.

A) Network Link (NET0 MGMT Port LED on the Left): ON:_____  Color: ____________ BLINK:_____ OFF:_____

B) Network Activity (NET0 MGMT Port LED on the Right): ON:_____  Color: ____________ BLINK:_____ OFF:_____

C) Infiniband Ports Link Lights: ON:_____  OFF_____

D) Chassis Status LEDs

Top Locator White LED - ON:_____ OFF_____ Flashing_____
Middle Attention Amber LED - ON:_____ OFF_____ Flashing_____
Bottom OK Green LED - ON:_____ OFF_____ Flashing_____

3. FE to connect to the affected Infiniband Switch via Serial Console and capture and provide the following data.

A) If Infiniband Switch comes online collect below command outputs:
# version
# getmaster
# env_test
# uptime
# date
# service --status-all
# ifconfig -a
# ethtool eth0
# ethtool -S eth0
# ping -b {broadcast address found in ifconfig -a}
^C after a few seconds of the ping
# ifconfig -a
# ethtool eth0
# ethtool -S eth0
# fwverify

# ls -l /etc/sysconfig/network-scripts/ifcfg-eth0
# ls -l /config/etc/sysconfig/network-scripts/ifcfg-eth0
# cat /etc/sysconfig/network-scripts/ifcfg-eth0
# cat /config/etc/sysconfig/network-scripts/ifcfg-eth0
# showdisk
# tail -150 /var/log/messages

B) Also, FE to collect the ILOM Snapshot of the Infiniband Switch and provide in the SR/Field Task.
Refer: How to Generate iLOM Snapshot on infiniband Switches (Doc ID 1594992.1)

4. If the switch is not accessible via the console, FE to power-cycle the affected Infiniband Switch and validate if the Infiniband Switch is Online.

A) If the switch is now accessible, Collect and provide all the data as per Step 3 A & B.

B) If the switch is still not accessible, proceed to Step 5.

5. If the Infiniband Switch is still not online FE to follow below MOS Notes:

A) Infiniband Switch Stays In Pre-boot Environment During Upgrade/Reboot (Doc ID 2202721.1) and validate if the Infiniband Switch boots up.

B) During upgrading or re-booting of an Infiniband switch from version 2.1.x the switch will not boot, NO Link Lights, Fans spinning full speed (Doc ID 2280595.1)

C) Preventing an Infiniband Switch from becoming un-bootable (during upgrade for Sun Datacenter Infiniband 36 or during reboot or upgrade for Sun Network QDR InfiniBand Gateway Switch) due to Real Time Clock corruption (Doc ID 2302714.1)

6. If the Infiniband Switch is still not online then FSE to Factory re-image the Infiniband Switch 

FE to follow Procedure to restore NM2-36p and NM2-GW switches to factory image (Doc ID 1467182.1)

 


    • Related Articles

    • Working Effectively With Oracle Support - Best Practices (Doc ID 166650.1)

      (an Oracle Single Sign on is required to access the content on the My Oracle Support Website) Working Effectively with Oracle Support Working Effectively with Support Best Practices Working Effectively with Support is a foundational module within the ...
    • ETS Support Call-out Process

      1         PURPOSE The following process has been complied to ensure that the ETS customer support services response commitment is properly coordinated and delivered as laid down in the ETS SLA procedures. 2         INITIAL NOTIFICATION The initial ...
    • Instance PGA memory leak detection (Linux/Solaris)

      Instance PGA memory leak detction. The purpuse of this KB is to illustrate on Linux and Solaris how to track and diagnose PGA memory leaks. This KB will only focus on dedicated server connections (most common case). When an Oracle client process is ...
    • What Can or Cannot Be Changed On Oracle MiniCluster S7-2

      What Can or Cannot Be Changed On Oracle MiniCluster S7-2  (An Oracle Single Sign on user is required to access content on the My Oracle Support) Oracle MiniCluster MC-S7 is an engineered system (often referred to as an integrated system). As such, it ...
    • Enabling Dead Connection Detection between Application and Database

      This document serves as a guide on how to enable DCD when there is firewall timeout set on an organization's network.