Solved

vCenter Memory error

Posted on 2014-04-10
20
628 Views
Last Modified: 2014-07-07
Hi,

I have a client with a couple of hosts that I recently upgraded some hardware on. I added a 2nd CPU and maxed out the RAM on both hosts from like 8GBs to 32GBs of RAM.

We're using vCenter 5.1 and ESXi 5.1 Build 941893.

One of the hosts came up with a memory alert a few days ago. It said to go to the hardware tab. Going to the hardware tab Shows the 8 4GB sticks and above it on the "Memory" header it just says "Alert", with nothing in the details tab. Rebooting the host and looking for memory alerts or errors while booting shows nothing. I don't know where to go to get details on this memory error (see attached image).

The motherboard is an older Intel Motherboard: S5000PAL. It is one revision behind on the BIOS so I tried updating the BIOS using a USB DOS Bootable stick. Every time it tries to boot to the USB Drive it freezes (I tried front and back panel USB ports). I tried the USB drive on another system and it boots into it just fine.

I didn't have time to use the Intel Deployment ISO to update the BIOS but will do that on the next visit. I will also do an extensive memory test when I get there.

I chatted with an Intel rep and they didn't know where to go other than update BIOS and do a memory test.

Anyone know where to get more details on this from VMware?

I also removed all of the RAM and re-seated it but it still has the alert when booting back up.

I need to add more servers to this host but don't want to do that until I've resolved this issue.
memory-error.PNG
0
Comment
Question by:RFVDB
  • 6
  • 6
  • 4
  • +3
20 Comments
 
LVL 117

Expert Comment

by:Andrew Hancock (VMware vExpert / EE MVE)
Comment Utility
Is this server on the HCL?

Check the VMware Hardware Compatability Lists HCL here

The VMware Hardware Compatibility List is the detailed lists showing actual vendor devices that are either physically tested or are similar to the devices tested by VMware or VMware partners. Items on the list are tested with VMware products and are known to operate correctly.Devices which are not on the list may function, but will not be supported by VMware.

http://www.vmware.com/go/hcl

Whitebox HCL


The Whitebox Hardware Compatability Lists is a list put together by the community that have had success with whitebox servers, e.g. unbranded or homebrew, DIY servers, which have been found to work with VMware Products.

http://www.vm-help.com//esx40i/esx40_whitebox_HCL.php

VMware Communities

This list is maintained and put together by members of the VMware community forum, that have had success in building whitebox servers.

http://communities.vmware.com/cshwsw.jspa

If this server is not on the HCL, you may run the risk, of it no being compatible with ESXi, or a false alert.

I would check the hardware and memory using

Memtest86+
http://www.memtest.org/

1. Re-seat the memory
2. Obtain new Memory.
3. Replace motherboard, any blackplanes
4. Escalate this to Supermicro Support, and see if they have seen this issue with ESXi.

Using un-qualified, un-certified hardware with VMware vSphere Hypervisor/ESXi is a risk.

It's unlikely VMware Support will entertain any requests with unsupported hardware, and your best course of action is to discuss with Supermicro support.
0
 
LVL 25

Assisted Solution

by:Zephyr ICT
Zephyr ICT earned 167 total points
Comment Utility
Besides the BIOS and Memory tests you could check for firmware upgrades for the board, if you didn't do that already ...

But yeah, best bet, check memory, upgrade BIOS, upgrade firmware if possible ... Not much else to do I'm afraid.
0
 
LVL 35

Expert Comment

by:Bembi
Comment Utility
Sorry for this answer, but I would recommend the same.

1.) Keep in mind, that the layout of differnt boards can vary, but the general rule is that every processor has its own RAM bench. So your RAM modules has possibly to be put into the right slots and even distributet equal to both processors. You possibly find some instructions in the board handbook or at intel. Even intel offers a ram configuration tool for some boards, in which slots to distribute the RAMs.

2.) You should use the same modules in all slots, at least the same technique (buffered, non-bufferend, single, double or quad modules, voltage, speed etc.) Recomendation is to use the same types of RAM. Even it is recommended to use 2x, 3x or 4x RAM module, so modules which are sold as set together.
Mixing different modules can produce problems.

3.) You may try with a lower configuration, i.e. 1 module per processor - same slot. Two modules per processor and so on.

4.) Updating the BIOS can help, Sometimes older BIOS may have problems with newer RAM modules or layouts.

5.) The USB crash can be or even not connected to the RAM, so you may try to burn it onto a bootable CD. Even here a BIOS update can help to silve the USB topic, or try a older stick.
0
 
LVL 61

Expert Comment

by:gheist
Comment Utility
You show vcenter client version and no ESXi version
Does not matter

Is your system configured according to Intel document  D31979-011

i.e both processors are exactly the same (both from either old batch or new batch) and at least 1 memory module installed in each of four channels?
0
 
LVL 19

Expert Comment

by:compdigit44
Comment Utility
It is possible that one of the memory modules added to the host was faulty. I have run into issues in the past where a memory module may pass a memory test but still have problems.

On the host remove all added memory then add each memory module / pars until the error appears again

Also besides from running a memory diag you may want to run a full hardware diag on the server to make sure the memory slot the new modules is going into is not having problems
0
 

Author Comment

by:RFVDB
Comment Utility
There are two Intel Xeon E5320 1.86Ghz CPUs. All 8 RAM slots have the exact same 4GB RDIMM RAMs. Motherboard and RAID card are on the VMware compatibility list.

So I guess as I had conjectured, my next steps are firmware updates and mem test.

I was hoping someone would know if there was a more detailed log that VMware had that might tell me more about the memory error rather than just "alert!".
0
 
LVL 61

Assisted Solution

by:gheist
gheist earned 333 total points
Comment Utility
The exactly same processor was sold in 3 different revisions...
Are you still so sure processors are exactly same?
If you cannot update firmware i'd guess system is wishing to reject either once CPU or one misplaced RAM stick.
If your motherboard has IMPMI support - vmware can read more than "Alert" if it has password for that.
0
 
LVL 117

Expert Comment

by:Andrew Hancock (VMware vExpert / EE MVE)
Comment Utility
If you connect to the ESXi server directly, not through vCenter Server, do you get the same error?

I'm afraid that components on the HCL, do not add up to make a certified server!

With uncertified hardware it will be difficult to fault find, I would suggest escaliting to SuperMicro Support for further information, if the memory test does not reveal any further fault.

Try new a motherboard, processors, and memory.

Most faults without additional CIM providers for the server, are phantom alerts. These are often present, with mis-communication from the motherboard and ESXi.

Often Memory faults, fan controller, temperature, storage controller faults appear, caused by incorrect drivers and firmware, it's not uncommon.
0
 
LVL 19

Expert Comment

by:compdigit44
Comment Utility
It appears this motherboard supports out of band management is it configured on the server???

http://www.synnex.com/intel/servers/platformrecipes.html


Have you checked the logs on the host?
0
 
LVL 61

Expert Comment

by:gheist
Comment Utility
Andrew - you are wrong. Intel and tyan motherboards are fully supported by HCL as long as your assembly follows manufacturers guides (Though I have some doubt about that asker has identical CPUs - that makes him off HCL) and does not need any drivers.

My suggestion would be to clear management log (where you see sensors in dropdown)
And certify new RAM using memtest86+ for 72 hours like vmware recommends.
0
Maximize Your Threat Intelligence Reporting

Reporting is one of the most important and least talked about aspects of a world-class threat intelligence program. Here’s how to do it right.

 
LVL 117

Expert Comment

by:Andrew Hancock (VMware vExpert / EE MVE)
Comment Utility
Well if his server rig is Fully HCL Certified I suggest a support call to Intel and VMware and I know what VMware Support will say, contact Intel!
0
 
LVL 61

Expert Comment

by:gheist
Comment Utility
Only pain is that he added different stepping of same CPU as second CPU...
It is sort of easy t verify booting some linux live CD and reading /proc/cpuinfo
0
 

Author Comment

by:RFVDB
Comment Utility
So I finally got back to the client and used the Intel Deployment ISO and was able to upgrade the BIOS, BMC and SRC. After reboot vcenter still showed the memory error. I clicked on "update" and also "reset sensors" and it still showed the error.

I then ran the Intel Platform Confidence Test (PCT) Utility for DOS which is an ISO that you boot from. I ran the default extensive test which took about 30 mins. I checked all components of the server and did a test of the RAM and everything passed.

Since everything passed I didn't bother checking the hardware status for at least an hour afterwards in vCenter. But when I did, the memory error was no longer there! weird! Wonder if some motherboard/bios/bmc log was reset or something. The motherboard has the IPMI port but not the remote access module allowing remote access.

Anyhow, during this process, I removed the heatsinks of the CPUs and wrote down all of the information on the CPUs. I don't remember which is the original CPU.

FIRST CPU
(what is written on the top of the die)
1.86GHZ/8M/1066
INTEL XEON
SLAC8 COSTA RICA
I,M,C ’05 E5320
3715A907
(# - below is written on the side of the CPU)
3570603
1A0172

SECOND CPU
(what is written on the top of the die)
1.86GHZ/8M/1066
INTEL XEON
SL9MV COSTA RICA
I,M,C ’05 E5320
3643A789
(# - below is written on the side of the CPU)
2L63846
2A0344

This is the link to the CPU that we purchased: http://www.amazon.com/Intel-1066MHz-LGA771-Quad-Core-Processor/dp/B000K1MW82

In a chat session with Intel the "Active" CPUs are means for Heatsinks with fans and the "passive" ones are for just heatsinks. The above link is for an "active" one yet the heatsinks we have in our Intel chassis are fanless. I'm thinking the SL8MV one above is the newer one.

Another link re the CPUs from the Intel rep.
http://ark.intel.com/products/28031/Intel-Xeon-Processor-E5320-8M-Cache-1_86-GHz-1066-MHz-FSB#@ordering

Well, I usually deal with Dell and HP server and never Intel servers and never have had to add a 2nd CPU so this is new info to me, didn't know there were "versions within versions".

Would I be safe to get another "active" one since the CPU wouldn't know the difference or should I go for getting the same one as the original?

Also, on the above Intel link, the ordering code for all of the "active" or the group of "passive" ones are the same. How would you find the correct CPU online if you were trying to pick the specific 1 out of the 3 passives? definitely not clear enough from Intel.
0
 
LVL 61

Accepted Solution

by:
gheist earned 333 total points
Comment Utility
You need to put CPUs of same batch in same server. That will fix errors you experience and bring you back into vmware HCL.
0
 
LVL 117

Expert Comment

by:Andrew Hancock (VMware vExpert / EE MVE)
Comment Utility
So maybe firmware upgrade, reset sensors has fixed the issue.
0
 

Author Comment

by:RFVDB
Comment Utility
Yeah, possibly.

However, I'm still going to replace out the CPUs. Now I'm all freaked out about what stepping and all to get... Since the earlier LGA771 CPUs are quite cheap, I might as well buy a couple of better ones.

Looking at getting two BX80574E5420A. Even through the "A" on the end means active (for a CPU with a fan), mine is a passive case with just the heatsink, it should matter right?

http://www.memory4less.com/m4l_itemdetail.aspx?itemid=1438664461&partno=BX80574E5420A&rid=89&gclid=CIWtvOyx_L0CFWZo7Aod0gYAww
0
 
LVL 117

Expert Comment

by:Andrew Hancock (VMware vExpert / EE MVE)
Comment Utility
If you have good designed airflow, and the server is in a Air Con server room, with not hot spots, Passive is fine.

All Dell, HP, IBM servers are passive heat sinks, but very high air flow.

e.g. if you put a piece of paper at the front of your server, does it stick, this indicates high air flow.

if it does not, go with FANs on CPUs!
0
 

Author Comment

by:RFVDB
Comment Utility
OK thanks, good to know.

But just when it comes to CPU functionality/compatibility with the motherboard/case/heatsink, etc. Whether I get two of the BX80574E5420A or BX80574E5420B (A or B at the end) for passive or active, shouldn't be a problem right?
0
 
LVL 117

Expert Comment

by:Andrew Hancock (VMware vExpert / EE MVE)
Comment Utility
You will need to refer to Supermicro Compatibility Guides for the motherboard.
0
 
LVL 61

Expert Comment

by:gheist
Comment Utility
Please refer to Intels' configuration guide for your motherboard here:
http:#a39993095 where it says in plain English that both CPUs must be same stepping. What you do now is shoot yourself in the leg by keeping incompatible CPU configuration for more than a week.
0

Featured Post

Scale it in WD Gold

With up to ten times the workload capacity of desktop drives, WD Gold hard drives employ advanced technology to deliver among the best in reliability, capacity, power efficiency and performance.

Join & Write a Comment

Suggested Solutions

Data center, now-a-days, is referred as the home of all the advanced technologies. In-fact, most of the businesses are now establishing their entire organizational structure around the IT capabilities.
This article will show you how to create an ISO CD-ROM/DVD-ROM image (*.iso), and MD5 checksum signature, for use with VMware vSphere Hypervisor 6.5 (ESXi 6.5). It's a good idea to compare checksums, because many installations fail because of a corr…
Teach the user how to delpoy the vCenter Server Appliance and how to configure its network settings Deploy OVF: Open VM console and configure networking:
Teach the user how to use create log bundles for vCenter Server or ESXi hosts Open vSphere Web Client: Generate vCenter Server and ESXi host log bundle:  Open vCenter Server Appliance Web Management interface and generate log bundle: Open vCenter Se…

762 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

11 Experts available now in Live!

Get 1:1 Help Now