Browsing the archives for the Technology category.

Upgrading to ESXi 5.5 Update Manager vs ISO

Technology

I recently upgraded a cluster of ESXi 5.1 hosts to 5.5 and learned a lesson about using a NIC that requires a third party driver.  All of my hosts are using Qlogic 10Gb CNA cards which requires a driver to be loaded (a custom vib).  Unfortunately if you try to upgrade to 5.5 using Update manager it disables the third party drivers before actually doing the update.  Since Update manager does the update over the network interface as soon as that driver is disabled the host loses contact with vCenter and Update manager and the upgrade hangs.  Luckily nothing has actually been done yet and the host can simply be rebooted and it will come back up as though nothing has been changed.  The solution is to always do the upgrade using the ISO (either as a physical DVD or USB flash drive or network mounted via a management card like an iLO on HP servers) so that there is no networking required during the upgrade.  Once the upgrade is completed I was able to install the newer driver for ESXi 5.5 and a reboot later and my host was back in business.

So the lesson learned is be very careful if using third party drivers not part of the base ESXi image.  It always seems to be safer to interact with the host directly rather than using vCenter for upgrades (never had an issue yet with patches though) if third party drivers are involved at all.

The next time I have to upgrade I think I may try to create a custom ISO with my third party drivers already integrated just to avoid the extra steps.  If I do I’ll try to post about the process and how it went.

No Comments

Microsoft Cluster Fun

Technology

I had an interesting experience recovering a single node windows 2008 R2 cluster running multiple MSSQL 2008 instances.  We suffered a power failure that caused the server to reboot and after coming up the cluster service would crash at start.

Initially the only thing to go on was a single entry in the System Event Log for Event ID 1573:

Node ‘Servername’ failed to form a cluster.  This was because the witness was not accessible.  Please ensure that the witness resource is online.

I checked on the quorum disk and it’s there and marked as reserved as expected.  Head scratching commenced for a bit.  Tried a reboot just to make sure and had the same issue.  Tried to manually start the service and had the same issue.  Did some googling on the error and chased down a few items that ended up not being anything.

Tried to start the service with the fixquorum flag with no result.  Also tried to use the resetquorumlog with no result.

I discovered the cluster log command to generate a text file log of the cluster service which is when I finally started to make some progress:

Open a command prompt and run cluster log /g

This will output a file Cluster.log in C:\Windows\Cluster\Reports

On initial review of the log I found:

00000990.00000cf8::2012/11/14-13:18:25.643 ERR   mscs::QuorumAgent::FormLeaderWorker::operator (): ERROR_FILE_NOT_FOUND(2)’ because of ‘OpenSubKey failed.’

Which told me there was something wrong in the registry hive for the cluster.  The hive for the cluster is located in C:\Windows\Cluster and is a file called CLUSDB.  This file is automatically expanded and loaded under HKLM in the registry when the cluster service starts.  It was during this process that the server was crashing out so something was corrupted or wrong in the file.

My first attempt at a fix was to recover the CLUSDB file from a midnight snapshot taken about 3 hours prior to the power issue that caused the reboot.  Unfortunately this did not solve the problem which made me realize that something had changed or corrupted the file prior to the reboot and it just didn’t show itself until the reboot.  I went back to the Cluster.log file to see if I could find any more information.  I was regenerating the cluster.log file (cluster log /g) after each attempt to start the service to see if anything was changing and I notice something common with each startup:

000014b0.00000cf8::2012/11/14-13:25:39.708 DBG   [RCM] Resource ‘SQL Server (INSTANCENAME)’ is hosted in a separate monitor.
000014b0.00000cf8::2012/11/14-13:25:39.708 DBG   [RCM] rcm::RcmAgent::Unload()
000014b0.00000cf8::2012/11/14-13:25:39.708 INFO  Shutdown lock acquired, proceeding with shutdown

On each startup it would fail after the same INSTANCENAME and start to shutdown the service but I knew there should have been more Resources listed which meant the problem may be with the resource right after the last INSTANCENAME noted in the log.

With the cluster service stopped (so it wouldn’t try to restart and the hive wouldn’t be loaded) I launched regedit.  I navigated to HKLM and did a File->Load Hive and selected the CLUSDB file in C:\Windows\Cluster and gave it the name “Cluster” when prompted.  I then expanded the new cluster folder and then the resource folder and started to go through the list.  I quickly realized the order of resources in the folder matched how they were being noted in the Cluster.log file.  The resource that was next after the INSTANCENAME that was last noted in the Cluster.log was the Available Storage resource.  In looking at the keys for that resource I realized it had other resource ID’s listed in the “contains” key which should be storage resources that were in the Available Storage group except I knew that there shouldn’t be any.  I made note of the two resource ID’s in the contains key and went through the rest of the resources to make sure they didn’t actually exist and they didn’t.  I then went back to the contains key for the Available Storage resource and edited it and removed the two entries.  I then highlighted the Cluster folder under HKLM and unloaded the hive File->Unload Hive and then closed out regedit.  I started up the cluster service manually and this time everything started up correctly.

So what happened?

Roughly 2 weeks prior to this outage an Instance had been removed from the cluster.  It had 4 storage devices associated with it which were initially moved to the available storage group after being removed from the instance group and then were deleted as disks from the cluster.  Apparently this process (done via the failover cluster gui) failed to fully remove 2 of the 4 objects from the registry correctly.  I’ve found a few other people suggesting to always use the command line cluster program to remove resources to be extra safe which I plan to do from now on.  The problem did not show up until the next time the cluster service restarted.

1 Comment

When power saving is not your friend

Technology

I’ve been investigating a performance problem in a VM on one of our ESXi 5 clusters that led to an interesting discovery about power savings settings on the ESXi host.  Basically under certain scenarios (and perhaps specific CPUs) they physical CPUs will be down clocked even though a VM is trying to use 100% of its CPU.

The physical host servers are HP DL385 G7 with 2 AMD Opteron 6174 12 core processors @ 2.2GHz and 128 GB of RAM.  They boot from an integrated SD Flash card and all other storage is provided by our Compellent SAN.

In the bios there are 3 key settings under the Power Management Options:

HP Power Profile – This defaults to “Balanced Power and Performance” but I’ve changed it to “Maximum Performance”

HP Power Regulator – This defaults to “HP Dynamic Power Savings Mode” but changes automatically to “HP Static High Performance Mode” after changing the power profile setting

Advanced Power Management -> Minimum Processor Idle Power State – This defaults to “No C-states” and that is what we want it set to

The VM I’m testing with has 4 vCPU and 8GB RAM assigned to it.  This VM is the host for a Lotus Domino server with some custom applications.  When the application is used it can cause the CPU to go to 100% utilization within the VM.

From testing the same processes over and over we observed that each process would take 50-150% longer to run with the bios set to Balanced vs having it set to Max.

What I believe is happening is that while the VM is running at 100% cpu it only using 4 of the 12 cores of a single physical socket (and 4 of 24 total in the host) and the other VMs on this host are all light CPU load so the physical host perceives itself to be lightly loaded and so is down clocking the CPU.  So our VM running at 100% CPU is not getting 2.2GHz of clock speed but some lesser amount depending on how much down clocking the host has done.  Since that down clocking is dynamic that would also account for the performance variance we are seeing.

In googling around I’ve found other people using the AMD Opteron 61xx series processors with VMWare having a similar issue.  It’s possible this is just an issue with that line as I don’t believe a CPU should slow the clock speed dynamically if a single core is being used completely (rather than relying on an average load accross all cores to determine if it should save power by down clocking).

We have another cluster that uses AMD Opteron 6282 SE processors I plan to do some additional testing on to see if the problem exists there as well.  I’ll update this post once I’ve had a chance to do that.

For now all of our hosts using the 6174 processors have been set to force max performance (more power and heat unfortunately).

No Comments