Wednesday, December 28, 2011

No Performance Data after Upgrading vCenter from 4.1 Update1 to Update2

I've been struggling with this for a couple weeks now.

Somehting had a lock on the database and it could not update. We fixed that, and that stopped the SQL errors and a stats insertion error, but still no historical perfromance data. This stopped working last Saturday at 2:30 am.

We followed the followiog KB article to delete and re-create the views in the SQL database. This resolved the issu:
http://kb.vmware.com/selfservice/microsites/search.do?cmd=displayKC&externalId=1030819
Interesting articles related to this problem:
http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1029824

http://communities.vmware.com/docs/DOC-14356

Tuesday, December 27, 2011

How to Configure Jumbo for NFS vmkernel ports on a Distributed Switch

OK, so you want to use jumbo because it is faster and more efficient, but there is no way to configure this in the vSphere Client. You are correct. But, it can be configured from the CLI.

To do this, first set the MTU size to 9000 on the distributed swtitch
   Select "Manage this vNetwork distributed switch"
   Clik on "Advanced"
   enter 9000 for "Maximum MTU:"

Next, create a vmkernel port the way you normally would. This will provide you with a dvport ID which you will need to configurure this.

At this point, your vSwitch is configured for juimbo, but your vmkernel port is still at 1500. To change this, go to the command line on your server and do these steps:
  1) run esxcfg-vswitch -l to document your vmk port, dvport ID, IP, mask, and dv switch name.
  2) delete the vmkernel port  you just created with this command:
       esxcfg-vmknic -d -s "dvSwitch name"  -v 'dvport ID'
       (substitute in the name of your distributed switch  and your dvport ID you just documented)
  3) re-create your vmkernel port with this command:
       esxcfg-vmknic -a -i 'ip address' -n 'mask' -m 9000 -s "dvSwitch name" -v 'dvport ID'
       (where ip address = the IP address of the vmkernel port, mask = its mask)

You now have a jumbo frame vmkenel port.

Make sure your switches all along the way are configured for jumbo and that your storage is likewise configured for jumbo.

You can test this with the vmkping command.
    # vmkping -s 9000 'destination ip'
(your should see successful returns with a a frame size of 9000.

Intel 10 GB NICs, distributed switches and ESX 4

If your ESX hosts are still at ESX 4.0 Update 2 and you have Intel 10 GB NICs, you are no doubt aware that update 2 caused you some problems if you are using a distributed switch for your Intel adapters.

I was fortunate to have discovered this in my lab first. When I tested this, I found that I lost  my network connections upon rebooting after applying Update 2.

ESX 4.0 Update 3 and ESX 4.1 Update 1 both fix this problem.

Advanced Settings for NFS

If you are using NetApp for storage and have NFS datastore, these are NetApp's recommended settings:

NFS.MaxVolumes               64
NFS.HearbeatFrequency     12
NFS.HeartbeatMaxFailures  10
NFS.HeartbeatTimeout           5


Net.TCPIPHeapSize             30
Net.TCPIPHeapMax           120

We have standardized on this and it has worked very well for us.

Hint: If you install VSC, it will recommend and set these for you.

A couple of these settings require a server reboot.

vmware tools failure with 4.1 Update 1

We started upgrading vmware tools after our ESX hosts were upgraded to 4.1 Update 1.. Our vCenter is at 4.1 Update 2.

I have not figured out what causes this, but an automatic vmware tools upgrade will occasionaly fail and it will not let me end the vmware tools upgrade. I get a pop-up mesasge saying this is not allowed in the current state.

To resolve this, I was able to log into the ESX host the VM was running on  using the vSphere Client, and end the vmware tools upgrade. I was then able to do the vmware toold upgrade from there.

Monday, December 19, 2011

ESXi 4.1 Update1 Upgrades

I'm finally close to finishing up my esxi 4.1 upgrades. This has been the smoothest, and easiest production  upgrade I've done in a while. Testing it in the lab was a different story, though.

I wanted to start this much earlier this year.  But, I tested this in my lab environment first soon after Update 1 was released.  All the problems apparently were due to using Oracle for my vCenter database. The primary issue was a VM would lose communication with the host when doing a vmotion. There were other vCenter communication problems and a major problem starting the vCenter service. All of these problems appear to have been traced back to Oracle. There was an extremely knowledgeable and nice lady in Ireland who provided Oracle support for me.  She resolved all the problems with the exception of  vmotion.  Vmware did not seem to have a solution for this. Finally, after working on this for a couple months, I switched to an SQL database and all my problems went away.

Another thing that made my lab upgrade more complex was View 4.5. I had this running in my lab environment and wanted to preserve it. Upgrading to 4.1 with View Manager and Composer made the upgrade more complex.

But, once I got this working in the lab, the production upgrade sailed through. I had absolutely no problems upgrading vCenter to 4.1 Update2 and my esxi hosts to 4.1 Update1. I only have 5 left to do out of 27.

4.1 has some very nice features I have come to appreciate. The obvious improvement is vmotion is faster and easier. Now you can just drag and drop a VM onto a host. Because some of my esx hosts use 10 GB, I moved the vmotion port off a 1 GB interface and onto 10 GB and configured a Traffic Shaping policy to limit it to 3,000,000 Kbits/sec for the avg and peak b/w with a burst size of 3,145,728 Kbytes. Those settings have worked very well for me.

Another feature I really like is tying some VMs to a host or hosts. Because we only have one host licensed for SQL, I tied all the SQL Vms  to this. One caveat is to select "Should run on this host" and not "Must run", unless the Vms will not run on any other hosts for some reason or other. This way, if the host is not available, the VM will run on another host. I'm also using this for my exchange Vms and my Linux vms.  Linux and SQL are for licensing, and Exchange is for performance.

I also like being able to enable and disable SSH from the vSphere Client.

And finally, the performance stats are improved. Now I can see my datastore stats for an individual VM and can see the stats for my NFS storage.

In summary, 4.1 has added some very useful features in addition to increasing performance, making it more scalable, more secure, and more reliable. Overall I am very happy with 4.1. And with 4.1, my 10 GB connections work great (the issue with 4.0 Update 2 has been fixed)

Friday, December 16, 2011

The esxi 4.x vmk0 feature

The esxi 4.x vmk0 bit me once again.   I should have known better. I had almost fogotten about this. In this case, I wanted to change my vmotion vlan and IP address. I failed to notice vmotion was using vmk0. I changed the vlan and IP, and voila.... everything went grey. I lost access to my esx server, and could only get to it via a remote console.

  I then got hold of vmware tech support, and with them on the line managed to also blow away all my nfs storage... Nice.   They quickly got things back to normal... More or less. The DCUI still showed my mgmt network as being one of my NAS storage IP addresses. Discovered a little known command that fixed this. But it unfixes itself at the next reboot.... Alas, after consulting again with vmware, my only hope was to rebuild the server....

Fortunately this esxi host was in maintenance mode, so no VMs were impacted by any of this. Just a couple of long days for me.

So, I asked myself , how did doing something as simple as changeing a vmotion IP address cause all these problems???

The answer is there is a part of the esxi 4 hypervisor that believes vmk0 is the mgmt network. So, even if you tell it otherwise, it does not believe you. If vmk0  gets assigned to anything else other than the mgmt network, you are in trouble.

But, it is even worse than that. You might think you can fix this by deleting vmk0 and the mgmt network and re-creating the mgmt network as vmk0. That sounds like a good plan. But you would be wrong. As soon as you delete vmk0, vmware looks for the next vmk and is convinced that is the mgmt network, even though something else is using it. In my case, it happenned upon one of my NAS IP addresses.

How might this happen? You may need to re-create your mgmt network. If you do, you might as well resign yourself to re-creating the server. Or, you might be trying to get a new server to comply with a host profile. If you were unlucky enough to use an esx server as a host profile, guess what? vmk0 is not used by the service console, and is assigned to something else, like vmotion or NAS storage or.... You will not be able to get your esxi host ti comply with the profile.  In desparation, you apply the profile, which fixes the problem. Unfortunately, it also assigned vmk0 to something other than your mgmt network. Your only recourse at this point is, you guessed it... to rebuild your esxi host.

The moral of the story is if you have an esxi host, make sure you do not change an IP address if it is uing vmk0 unless that is your mgmt network. Double check your vSphere client with your DCUI to make sure they both agree, and if you can,  check your host file to make sure it is in agreement as well. If they all agree, you are golden! Otherwise, think long and hard about changing that IP address!.. And never ever apply a host profile to an esxi server. It is too easy to mess this one up.  DO use a host profile to check your compliance, but do not apply it.