Link to home
Start Free TrialLog in
Avatar of RFVDB
RFVDB

asked on

vCenter Memory error

Hi,

I have a client with a couple of hosts that I recently upgraded some hardware on. I added a 2nd CPU and maxed out the RAM on both hosts from like 8GBs to 32GBs of RAM.

We're using vCenter 5.1 and ESXi 5.1 Build 941893.

One of the hosts came up with a memory alert a few days ago. It said to go to the hardware tab. Going to the hardware tab Shows the 8 4GB sticks and above it on the "Memory" header it just says "Alert", with nothing in the details tab. Rebooting the host and looking for memory alerts or errors while booting shows nothing. I don't know where to go to get details on this memory error (see attached image).

The motherboard is an older Intel Motherboard: S5000PAL. It is one revision behind on the BIOS so I tried updating the BIOS using a USB DOS Bootable stick. Every time it tries to boot to the USB Drive it freezes (I tried front and back panel USB ports). I tried the USB drive on another system and it boots into it just fine.

I didn't have time to use the Intel Deployment ISO to update the BIOS but will do that on the next visit. I will also do an extensive memory test when I get there.

I chatted with an Intel rep and they didn't know where to go other than update BIOS and do a memory test.

Anyone know where to get more details on this from VMware?

I also removed all of the RAM and re-seated it but it still has the alert when booting back up.

I need to add more servers to this host but don't want to do that until I've resolved this issue.
memory-error.PNG
Avatar of Andrew Hancock (VMware vExpert PRO / EE Fellow/British Beekeeper)
Andrew Hancock (VMware vExpert PRO / EE Fellow/British Beekeeper)
Flag of United Kingdom of Great Britain and Northern Ireland image

Is this server on the HCL?

Check the VMware Hardware Compatability Lists HCL here

The VMware Hardware Compatibility List is the detailed lists showing actual vendor devices that are either physically tested or are similar to the devices tested by VMware or VMware partners. Items on the list are tested with VMware products and are known to operate correctly.Devices which are not on the list may function, but will not be supported by VMware.

http://www.vmware.com/go/hcl

Whitebox HCL


The Whitebox Hardware Compatability Lists is a list put together by the community that have had success with whitebox servers, e.g. unbranded or homebrew, DIY servers, which have been found to work with VMware Products.

http://www.vm-help.com//esx40i/esx40_whitebox_HCL.php

VMware Communities

This list is maintained and put together by members of the VMware community forum, that have had success in building whitebox servers.

http://communities.vmware.com/cshwsw.jspa

If this server is not on the HCL, you may run the risk, of it no being compatible with ESXi, or a false alert.

I would check the hardware and memory using

Memtest86+
http://www.memtest.org/

1. Re-seat the memory
2. Obtain new Memory.
3. Replace motherboard, any blackplanes
4. Escalate this to Supermicro Support, and see if they have seen this issue with ESXi.

Using un-qualified, un-certified hardware with VMware vSphere Hypervisor/ESXi is a risk.

It's unlikely VMware Support will entertain any requests with unsupported hardware, and your best course of action is to discuss with Supermicro support.
SOLUTION
Avatar of Zephyr ICT
Zephyr ICT
Flag of Belgium image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Sorry for this answer, but I would recommend the same.

1.) Keep in mind, that the layout of differnt boards can vary, but the general rule is that every processor has its own RAM bench. So your RAM modules has possibly to be put into the right slots and even distributet equal to both processors. You possibly find some instructions in the board handbook or at intel. Even intel offers a ram configuration tool for some boards, in which slots to distribute the RAMs.

2.) You should use the same modules in all slots, at least the same technique (buffered, non-bufferend, single, double or quad modules, voltage, speed etc.) Recomendation is to use the same types of RAM. Even it is recommended to use 2x, 3x or 4x RAM module, so modules which are sold as set together.
Mixing different modules can produce problems.

3.) You may try with a lower configuration, i.e. 1 module per processor - same slot. Two modules per processor and so on.

4.) Updating the BIOS can help, Sometimes older BIOS may have problems with newer RAM modules or layouts.

5.) The USB crash can be or even not connected to the RAM, so you may try to burn it onto a bootable CD. Even here a BIOS update can help to silve the USB topic, or try a older stick.
You show vcenter client version and no ESXi version
Does not matter

Is your system configured according to Intel document  D31979-011

i.e both processors are exactly the same (both from either old batch or new batch) and at least 1 memory module installed in each of four channels?
Avatar of compdigit44
compdigit44

It is possible that one of the memory modules added to the host was faulty. I have run into issues in the past where a memory module may pass a memory test but still have problems.

On the host remove all added memory then add each memory module / pars until the error appears again

Also besides from running a memory diag you may want to run a full hardware diag on the server to make sure the memory slot the new modules is going into is not having problems
Avatar of RFVDB

ASKER

There are two Intel Xeon E5320 1.86Ghz CPUs. All 8 RAM slots have the exact same 4GB RDIMM RAMs. Motherboard and RAID card are on the VMware compatibility list.

So I guess as I had conjectured, my next steps are firmware updates and mem test.

I was hoping someone would know if there was a more detailed log that VMware had that might tell me more about the memory error rather than just "alert!".
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
If you connect to the ESXi server directly, not through vCenter Server, do you get the same error?

I'm afraid that components on the HCL, do not add up to make a certified server!

With uncertified hardware it will be difficult to fault find, I would suggest escaliting to SuperMicro Support for further information, if the memory test does not reveal any further fault.

Try new a motherboard, processors, and memory.

Most faults without additional CIM providers for the server, are phantom alerts. These are often present, with mis-communication from the motherboard and ESXi.

Often Memory faults, fan controller, temperature, storage controller faults appear, caused by incorrect drivers and firmware, it's not uncommon.
It appears this motherboard supports out of band management is it configured on the server???

http://www.synnex.com/intel/servers/platformrecipes.html


Have you checked the logs on the host?
Andrew - you are wrong. Intel and tyan motherboards are fully supported by HCL as long as your assembly follows manufacturers guides (Though I have some doubt about that asker has identical CPUs - that makes him off HCL) and does not need any drivers.

My suggestion would be to clear management log (where you see sensors in dropdown)
And certify new RAM using memtest86+ for 72 hours like vmware recommends.
Well if his server rig is Fully HCL Certified I suggest a support call to Intel and VMware and I know what VMware Support will say, contact Intel!
Only pain is that he added different stepping of same CPU as second CPU...
It is sort of easy t verify booting some linux live CD and reading /proc/cpuinfo
Avatar of RFVDB

ASKER

So I finally got back to the client and used the Intel Deployment ISO and was able to upgrade the BIOS, BMC and SRC. After reboot vcenter still showed the memory error. I clicked on "update" and also "reset sensors" and it still showed the error.

I then ran the Intel Platform Confidence Test (PCT) Utility for DOS which is an ISO that you boot from. I ran the default extensive test which took about 30 mins. I checked all components of the server and did a test of the RAM and everything passed.

Since everything passed I didn't bother checking the hardware status for at least an hour afterwards in vCenter. But when I did, the memory error was no longer there! weird! Wonder if some motherboard/bios/bmc log was reset or something. The motherboard has the IPMI port but not the remote access module allowing remote access.

Anyhow, during this process, I removed the heatsinks of the CPUs and wrote down all of the information on the CPUs. I don't remember which is the original CPU.

FIRST CPU
(what is written on the top of the die)
1.86GHZ/8M/1066
INTEL XEON
SLAC8 COSTA RICA
I,M,C ’05 E5320
3715A907
(# - below is written on the side of the CPU)
3570603
1A0172

SECOND CPU
(what is written on the top of the die)
1.86GHZ/8M/1066
INTEL XEON
SL9MV COSTA RICA
I,M,C ’05 E5320
3643A789
(# - below is written on the side of the CPU)
2L63846
2A0344

This is the link to the CPU that we purchased: http://www.amazon.com/Intel-1066MHz-LGA771-Quad-Core-Processor/dp/B000K1MW82

In a chat session with Intel the "Active" CPUs are means for Heatsinks with fans and the "passive" ones are for just heatsinks. The above link is for an "active" one yet the heatsinks we have in our Intel chassis are fanless. I'm thinking the SL8MV one above is the newer one.

Another link re the CPUs from the Intel rep.
http://ark.intel.com/products/28031/Intel-Xeon-Processor-E5320-8M-Cache-1_86-GHz-1066-MHz-FSB#@ordering

Well, I usually deal with Dell and HP server and never Intel servers and never have had to add a 2nd CPU so this is new info to me, didn't know there were "versions within versions".

Would I be safe to get another "active" one since the CPU wouldn't know the difference or should I go for getting the same one as the original?

Also, on the above Intel link, the ordering code for all of the "active" or the group of "passive" ones are the same. How would you find the correct CPU online if you were trying to pick the specific 1 out of the 3 passives? definitely not clear enough from Intel.
ASKER CERTIFIED SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
So maybe firmware upgrade, reset sensors has fixed the issue.
Avatar of RFVDB

ASKER

Yeah, possibly.

However, I'm still going to replace out the CPUs. Now I'm all freaked out about what stepping and all to get... Since the earlier LGA771 CPUs are quite cheap, I might as well buy a couple of better ones.

Looking at getting two BX80574E5420A. Even through the "A" on the end means active (for a CPU with a fan), mine is a passive case with just the heatsink, it should matter right?

http://www.memory4less.com/m4l_itemdetail.aspx?itemid=1438664461&partno=BX80574E5420A&rid=89&gclid=CIWtvOyx_L0CFWZo7Aod0gYAww
If you have good designed airflow, and the server is in a Air Con server room, with not hot spots, Passive is fine.

All Dell, HP, IBM servers are passive heat sinks, but very high air flow.

e.g. if you put a piece of paper at the front of your server, does it stick, this indicates high air flow.

if it does not, go with FANs on CPUs!
Avatar of RFVDB

ASKER

OK thanks, good to know.

But just when it comes to CPU functionality/compatibility with the motherboard/case/heatsink, etc. Whether I get two of the BX80574E5420A or BX80574E5420B (A or B at the end) for passive or active, shouldn't be a problem right?
You will need to refer to Supermicro Compatibility Guides for the motherboard.
Please refer to Intels' configuration guide for your motherboard here:
http:#a39993095 where it says in plain English that both CPUs must be same stepping. What you do now is shoot yourself in the leg by keeping incompatible CPU configuration for more than a week.