VMWare esxi 5.5 cluster hosts keep crashing

We have a dual esxi 5.5 cluster with 2 IBM x3530 M4's ( 2x Xeon E5-2450 and 96 gig memory) connected to an IBM v3700 DAS. Both hosts are running esxi 5.5.0 2143827. They are in a fail over cluster with one server running all 13 VM's and the other just waiting. They run fine for about a month then start failing with errors "Memory, Group 4 CPUs: Bus Uncorrectable error, Group 1 One of the DIMMS 0: Uncorrectable ECC". IBM has been out to replace the system board, both CPU's, all the memory DIMMS and back plain. Also both servers are running esxi from an IBM commercial graded USB drive connected to hypervisor port on system board which has also been replaced. Since same error messages happen on both servers, I am starting to think it is a vmware issue? Has anyone ran into this kind of issue, and if so what was your fix?
jvillareal78Asked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

Andrew Hancock (VMware vExpert / EE MVE^2)VMware and Virtualization ConsultantCommented:
Are you servers on the HCL ?

I assume all the servers are on the latest firmware ?

This looks like a hardware issue, and we've seen this before with Dell, IBM and HP.
0
jvillareal78Author Commented:
This setup was built by the IBM persons at CDW. This is what they recommended for our VMWare environment. When IBM came out multiple time, they did upgrade the firmwares on both machines.
0
Andrew Hancock (VMware vExpert / EE MVE^2)VMware and Virtualization ConsultantCommented:
I would bounce it back to IBM, with STILL FAULTY!
0
The Ultimate Tool Kit for Technolgy Solution Provi

Broken down into practical pointers and step-by-step instructions, the IT Service Excellence Tool Kit delivers expert advice for technology solution providers. Get your free copy for valuable how-to assets including sample agreements, checklists, flowcharts, and more!

jvillareal78Author Commented:
I just can not see the exact same issues on both servers being a hardware issue. The only thing is VMWare esxi 5.5 on both.
0
Andrew Hancock (VMware vExpert / EE MVE^2)VMware and Virtualization ConsultantCommented:
Do the hosts crash with a PSOD ?

Has the memory been tested using memtest86+ ?

Both are on the HCL for 5.5 U2.

I would escalate to VMware and IBM! VMware are likely to through it back to IBM!

Have you tried 5.1.
0
andyalderCommented:
Reboot an press F2 for diags and look at the BMC log. You could also slow the RAM down in BIOS which is a bit of a bodge but may work if it's a timing issue.
1
jvillareal78Author Commented:
Went into BIOS and did not see any timing settings that I could change. I did change to non-numa and so far hasnt gone down.
0
Andrew Hancock (VMware vExpert / EE MVE^2)VMware and Virtualization ConsultantCommented:
Are you using the correct memory for NUMA and two processors, and are your CPUs balanced correctly with the correct memory?
0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
jvillareal78Author Commented:
When NUMA turned off have not had an issue with either server. Issue seems to have been a BIOS setting.
0
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
VMware

From novice to tech pro — start learning today.