We recently upgraded the firmare on our Power frame, which required shutting down some of our AIX LPAR’s. The firmware upgrade went well, as did starting up all the AIX LPAR’s, except for one. This particular LPAR booted to HMC LED code 2700 and hung there. I restarted the partition to the Open Firmware (OF) prompt, and tried booting again using verbose mode to see where the boot process was hanging.
---------------- Attempting to configure device 'fscsi6' cfgmgr Time: 0 LEDS: 0x2700 Invoking /usr/lib/methods/cfgefscsi -1 -l fscsi6 Number of running methods: 4 ---------------- Completed method for: fcs7, Elapsed time = 0 Return code = 0 ***** stdout ***** fscsi7 *** no stderr **** ---------------- Time: 0 LEDS: 0x2700 for fscsi6 Number of running methods: 3 cfgmgr exec(/bin/sh,-c,/usr/lib/methods/cfgefscsi -1 -l fscsi6){135238,102466} ---------------- Attempting to configure device 'fscsi7' cfgmgr Time: 0 LEDS: 0x2700 Invoking /usr/lib/methods/cfgefscsi -1 -l fscsi7 exec(/usr/lib/methods/cfgefscsi,-1,-l,fscsi6){135238,102466} Number of running methods: 4 exec(/bin/sh,-c,/usr/lib/methods/cfgefscsi -1 -l fscsi7){122944,102466} exec(/usr/lib/methods/cfgefscsi,-1,-l,fscsi7){122944,102466} ---------------- Completed method for: fscsi7, Elapsed time = 0 Return code = 0 *** no stdout **** *** no stderr **** ---------------- Time: 0 LEDS: 0x2700 for fscsi6 Number of running methods: 3 cfgmgr ---------------- Completed method for: fscsi5, Elapsed time = 0 Return code = 0 *** no stdout **** *** no stderr **** ---------------- Time: 0 LEDS: 0x2700 for fscsi6 Number of running methods: 2 cfgmgr |
To verbose boot from OF, you can do the following.
$ chsysstate -r lpar -m -o on -n -b of -f $ mkvterm -m -p 0> boot -s verbose |
From the output, we can see that the virtual fibre channel adapter scans for visible LUN’s presented through the adapter and runs “cfgmgr” against them. In this instance, the “cfgmgr” process is hanging, hence causing the HMC LED 2700 code. This particular LPAR has LUN’s presented to it from EMC storage through VIOS via NPIV, some of which had a state of NR (Not Ready).
Note that I’ve deliberately masked the Symmetrix and LUN ID’s.
Name: aix_lpar1 Symmetrix ID : XXXXXXXXXX Last updated at : Mon Mar 18 16:20:54 2013 Masking Views : Yes FAST Policy : Yes Devices (62): { --------------------------------------------------------- Sym Device Cap Dev Pdev Name Config Sts (MB) --------------------------------------------------------- XXXX N/A RDF1+TDEV NR 16386 XXXX N/A RDF1+TDEV RW 20481 XXXX N/A RDF1+TDEV RW 51203 XXXX N/A RDF1+TDEV RW 8633 XXXX N/A RDF1+TDEV NR 65537 XXXX N/A RDF1+TDEV RW 79873 XXXX N/A RDF1+TDEV RW 79873 XXXX N/A RDF1+TDEV NR 71681 XXXX N/A RDF1+TDEV NR 61440 XXXX N/A RDF1+TDEV NR 131074 XXXX N/A RDF1+TDEV NR 153600 XXXX N/A RDF1+TDEV NR 212993 XXXX N/A RDF1+TDEV NR 311303 XXXX N/A RDF1+TDEV NR 16385 XXXX N/A RDF1+TDEV NR 16385 XXXX N/A RDF1+TDEV NR 16385 XXXX N/A RDF1+TDEV NR 16385 XXXX N/A RDF1+TDEV NR 16385 ... ... |
The LUN’s masked to this host have a clone configured from another AIX LPAR, which at some point resulted in the target LUN’s (our HMC LED 2700 host) being placed in a NR state. Resyncing the target LUN’s from the source put them back into the correct state, and allowed the AIX LPAR to boot successfully. During investigation, a PMR was raised with IBM, which eventually came to the same conclusion that EMC LUN’s in an NR state will cause an AIX LPAR to hang during boot. The “cfgmgr” operation will eventually fail the LUNs, and the LPAR will boot, but in our instance, with so many LUN’s in an NR state, failures (timeouts) across all LUNs wouldn’t have happened for many hours.
Preventing this from occuring in the future would be to verify from the storage end that the LUNs are in a ready state prior to boot, or to place all those LUNs in a ‘Defined’ state in AIX before shutting down. There wasn’t much information when I looked for HMC LED 2700 hangs, so hopefully this helps someone with an AIX environment and EMC storage having the same symptoms.
This information was usefull to us. Give us more information that have you engaged storage team to findout NR state Luns?
Hi Rammaghenthar,
Not sure if there was a question there, but the storage team was crucial in resolving this problem, as I didn’t have access to any of the storage consoles to view the LUN status…I haven’t found a way to verify this from the AIX end from either SMS or OF.
-Kristian
Hi Kristian,
Thank you for your reply. What I am trying to figure out here is why masked LUNs to this host have a clone configured from another AIX LPAR? Without storage team engagement, how Re-syncing was happened the target LUN’s from the source which put them back into the correct state?
Regards
Rammaghenthar
The clone was initially configured with the assistance of the storage team, and then scripts were written which use EMC’s Solutions Enabler to provide CLI access to the Symmetrix array.
The way it works is the target LUN’s are exported, and then re-synced from the source server. The source server was functioning correctly, and as I couldn’t boot the target servers (i.e. target LUN’s not mounted), I simply ran the script which performs the re-sync and booted.
how long does this hang for?
I left it for about 45 minutes before I stopped it and started looking into the issue itself. IBM said they’d eventually time out if left long enough, but it needs to time out every LUN it comes across in an “NR” state…and that could take quite some time.