It was a Tuesday morning and things were beginning to ramp up in the Sys Admin office. We were doing our routine AM checks when we noticed a production VM had the word “inaccessible” next to it in the console. The VM was completely un-responsive; we couldn’t open a console in vSphere and it wasn’t responding to pings. None of the typical vSphere operations did anything. 30 minutes later we started to get calls that one of the spam filters (a VM on a different host) had gone down. Same issue, the VM was labeled “inaccessible” and was dead to the world. And finally, a third VM on yet another host showed the same symptoms a while later.
We hadn’t seen this sort of thing in VMware before. One thought was to remove the VM’s from inventory and import them back into vCenter. But what if there were something more to it? The fact that it happened on three different hosts at three different times led us to believe this wasn’t just an environmental thing. We decided to place the hosts in maintenance mode and reboot them, after which the VM’s returned to normal operation.
VMware found identical issues on all three hosts.
2014-07-22T13:07:17.211Z cpu16:33706)HBX: 2692: Waiting for timed out [HB state abcdef02 offset 4087808 gen 201 stampUS 330309104827 uuid 53c95722-895ca497-ff59-00215a9b0500 jrnl drv 14.60] on vol 'SAN_Datastore1' 2014-07-22T13:07:21.682Z cpu4:1158818)WARNING: lpfc: lpfc_abort_handler:2989: 1:(0):0748 abort handler timed out waiting for aborting I/O xri x4c5 to complete: ret xbad0001, cmd x88, tgt_id x1, lun_id x0 2014-07-22T13:07:21.683Z cpu24:1271247)WARNING: lpfc: lpfc_abort_handler:2989: 1:(0):0748 abort handler timed out waiting for aborting I/O xri x513 to complete: ret xbad0001, cmd x88, tgt_id x1, lun_id x0
Support recommended we downgrade the LPFC (Emulex OneConnect OCe11100 HBA) driver from 10.2.216.7 to 10.0.727.44. If you’ve followed our saga, you might recall VMware recommended version 10.2.216.7 in PSOD Blues Part 2. Apparently we had been misinformed. Nice!
Meanwhile one of our vendor’s engineers mentioned he had seen similar anomalies with other customers. There was just something about running the ESXi HP build on Gen 8 hardware that caused frequent PSOD’s and even inaccessible VM’s. His recommendation was to run the “generic” VMware ESXi build as he had seen no issues after dozens of implementations with Gen 8 servers. All of this seemed to make perfect sense. My co-worker and I had never seen any instability at our previous jobs and neither of us had run vendor builds. Within a week and a half we had rebuilt every one of our hosts without any downtime (isn’t Virtualization great?). The only change I made was to downgrade the bundled HPSA driver from 5.5.0.58 to 5.5.0.50 as described in PSOD Blues Part 1.
It’s been just over a week since rebuilding our hosts and so far things have been stable. I’ll give an update here in a few weeks on how things are going.
Hey Matt, which firmware version do you have loaded on your Emulex cards? I’ve seen similar issues crop up recently on my G8 blades. Support is suggesting I bump mine up to 4.9.416.4 (from 4.9.416.0) to potentially correct the issue, but it seems like it’s primarily to fix the Emulex VMQ issue under Windows 2012 R2, so I’m not convinced at the moment.
Hey Joshua,
Who told you to upgrade to 4.9.416.4? VMware or HP? I’m curious because 4.9.416.0 is the latest firmware available on HP’s site. VMware told us we should be on 10.0.769.0 which I believe is the Emulex version number and not the HP version. Besides, Emulex removed this firmware version from their site because of known issues. HP told us to upgrade to 4.9.416.0 (we were on 4.6.247.5) and we just finished upgrading our hosts this week. It’s not reassuring to hear that you’re on 4.9.416.0 and still seeing these issues. What model blades are you using? We have BL460c’s with 554FLB adapters.
Thanks for the input!
Matt
The suggestion came from HP support. VMware confirmed that they knew about the issue, but had redirected me to HP and wouldn’t make any recommendations. I’ve been running 4.9.416.0 on my BL660c’s since around May, which use the same 554FLB CNAs as yours. I’m fairly certain we saw the same issues on 4.9.311.20/25 previously as well, but it wasn’t fully investigated at the time as it was a fairly rare occurrence.
I’ve only really seen the issue on one of my clusters, but it’s the most I/O heavy of them.
Hey Matt, send me an email so we can discuss more details of this issue.
Thanks
-Joshua
Interesting to see someone else having an issue with lpfc_abort_handlers and the emulex / HP combination. We have been chasing an issue for months with VMware / HP / IBM regarding inaccessible datastores that appear to revolve around these lpfc_abort_handler messages. To this point noone has suggested a solution to us.
We are running within the HP recipe for our Virtual Connect / OA version (4.2 / 4.21 respectively). Which leaves us with our 55FLB running with 4.9.416.0 / 10.0.725.203 for firmware / driver (all HP branded.)
For us the issue appears most often during any sort of SAN maintenance or storage performance issues. We encounter the error randomly on our FCoE connected hosts and if we are not quick to act we will eventually lose hostd on the host which requires us to power off VMs and reboot the host.
I dont suppose you have any vendor details you could share, as this issue is causing some serious grief!
Hi Jason!
Joshua, who posted above, sent me this Emulex KB which basically states that in some situations two consecutive I/Os can be sent to the same WQ index resulting in one of them being dropped. I was still seeing these errors in Log Insight on one of my hosts and as soon as I upgraded to the 10.2.340.18 driver the issue went away and I haven’t seen it since.
Hope this helps!
Hey Jason, you can find this driver update on the HP site (as of a few days ago) and the VMware site since about mid-August.
I’ve been running it for about three weeks on all of my blades with no issues as well.
FYI, the HP recipe for September has VC 4.30 with Emulex firmware 10.2.340.19 (but use the updated driver and not what’s listed in the recipe).
Cheers
-Joshua