Link to home
Start Free TrialLog in
Avatar of wfcrr
wfcrrFlag for United States of America

asked on

How to test RAM on Dell T430

I bought a used Dell T430 off ebay to practice bare metal restore and also to learn.  It has 96GB ram but occasionally when it boots there are errors:


UEFI0058 Uncorrectable Memory Error

 UEFI0107 memory error slot A2 


I tried reseating the DIMMs in the A block and I also tried swapping slots.  


I have rebooted about 20 times now and maybe 1 out of 4 times it errors with the UEFI0058m and UEFI 0107 errors and then shows only 80GB ram   The rest of the time it boots without error and shows 96GB ram.  

Avatar of kenfcamp
kenfcamp
Flag of United States of America image

How many CPU's are in the server?

What size memory is in the server?
Are they all the same size??
Avatar of Member_2_231077
Member_2_231077

Does it stick to A2 slot or does it follow that DIMM about when you move it to another slot?
Trade the stick with another one and see if the error follows it. (@andyalder beat me to it!) :)
seems 16 GB ram slot each and one of the RAM is not functioning on A2 did u swap it with other ram and check whether the issue is from ram stick or slot to make it sure

all the best
Avatar of wfcrr

ASKER

This server has 2 CPU's.  
I moved the stick by swapping sticks from A2 to A1 and had the same problem.  Then I decided to swap it back to A2, same problem, then I decided to try swapping it into a different block so I swapped it into A3 and since doing that there is no error.
I have restarted the machine probably 6 times now and don't get an error and every time it boots up it shows 96GB of ram.  
I also have run Windows Memory Diagnostics and it says there is no error.
Are there any other tests I can run, or what is the best way to see if this is going to fail again?  Is rebooting repeatedly a good way to make it error, or is there a specific RAM test that will make it error?  Or is this just a weird thing that happened in shipping and now it's fixed?
This server has 2 CPU's

What size memory does the server have in it?
Are they all the same size and type??

Which banks have memory in them
Are all of the other sticks in B3/C3/ETC or B1/C1/ETC?

The third slot is usually used when the DIMMs are either x1 or x2 Ranked (DIMMs can have 1, 2, or 4 ranks) where a channel with three slots can have a total of 8 Ranks installed. So, 3x 1 Rank okay or 3x 2 Rank okay but one cannot install 3x 4 Rank since that would be 12 Ranks total.

Does a stick from B1 installed into A1 or A2 cause the error to come back? If yes, then it's either the board or the CPU that has a bad pin-out or chipset.
>I moved the stick by swapping sticks from A2 to A1 and had the same problem.

That's not very informative since "same problem" could mean it always complains about A2 slot or it always complains about the DIMM. For all we know you've shoved a HPE 3-rank DIMM in it.
Avatar of wfcrr

ASKER

I included a couple of pics.  There is 96GB total, there are 6 sticks 16GB each. They are in A1, A2, A3, B1, B2, B3

@Philip Elder-I tried swapping the B1 stick into A2 and booted and got no errors.  Prior to boot it said something about part replaced in DIMM socket B1 and in DIMM socket A2, then it booted.  It shows 96GB ram.  Does this mean that problem was just from being shipped and the RAM that was originally in A2 was just unseated in shipping, or, is there a way to test it?  I have run Windows Memory Diagnostics twice now with no errors.  Does that mean it's all good, or is there a more robust test to run to test the RAM?

One more question. This server is just for me to practice on and also to be here in case our live server goes down and I need to restore on this one.  I was thinking I would leave it unplugged in storage, but wonder about the battery going bad in the PERC H730.  Is it ok to leave it unplugged, or will stuff go bad without electrical plugged in? I would leave it plugged in and turned off, but the fan comes on for like 2 seconds every 10 minutes or so. I think it's the power supply fan, it just randomly comes on for 2 seconds then it turns off. This while the machine is plugged in but shut down.  Is there a way to keep it plugged in but make the fan stop coming on very 10 minutes?
User generated image
User generated image

Plugged in but turned off still uses power, iDRAC, WoL etc are active so the PSU would overheat providing standby power if it didn't turn the fan on occasionally.  So long as you shut it down properly the OS will tell the PERC to flush its cache before it powers off so you can unplug it without data loss.
Does this mean that problem was just from being shipped and the RAM that was originally in A2 was just unseated in shipping

Very possible, I've seen unseated memory more than once.
Reseating memory is generally one of the first things I do when a PC or Server is delivered

is there a way to test it?  I have run Windows Memory Diagnostics twice now with no errors.  Does that mean it's all good, or is there a more robust test to run to test the RAM?

You can run hardware diagnostics through the Lifecycle Controller
https://www.dell.com/support/kbdoc/en-us/000132726/how-to-run-hardware-diagnostics-on-your-poweredge-server

Select #2 -PowerEdge Servers 12G and later for more information and video tutorial
Avatar of wfcrr

ASKER

It is doing it again.  How to figure out which stick or which slot is bad?  Booted and only shows 80GB ram.  I restarted and went into Lifecycle Controller and ran Hardware Diagnostics and it stops during the Memory testing and says there is a problem but it does not tell me which stick or which slot is erroring.  
Is Idrac active?

Memory errors should be displayed in its logs
Run "omreport chassis memory" from command line under Windows. You will have to install OMSA managed node to have omreport available but OMSA should be installed anyway.
Avatar of wfcrr

ASKER

I don't have it on the network or internet.  Since it is a restore of our live server I can't really connect it to network/internet.

How do I get it to show me which RAM slot has the error?  Yesterday it gave UEFI errors and said A2 had the DIMM error.  Today it is not showing any of those errors, just that it has 80GB ram.
OMSA will tell you.
Leave it to Dell to butcher a spec in order to cut costs in my not so humble opinion.
User generated image
The white tabbed slots are the primary slots while the black tabbed slots are secondary.

The 16GB sticks should only be populating the white tabbed slots to keep things balanced.

  • A1-A4 and B1-B4


That's why I want them to post result of "omreport chassis memory" , can check what is in what slot and see errors but I think they're ignoring me.
Avatar of wfcrr

ASKER

Sorry to be so dense, I am learning. I can't seem to run OMSA.  I find an installer and when I try to run it, it says there is a newer version already installed, but I can't seem to find it. I find an "OpenManage" folder on C drive and it has folders and files in it, but only installers.  

I ran Lifecycle Controller and then ran the tests and it errored on Memory, then I finally realized there are reports and events in Lifecycle Controller and once I reviewed those I found the error reports.  The DIMM stick seems to be the problem, it caused the A2 DIMM slot to to have the exact same error and B2 DIMM slot and it is the stick, or I am saying it has to be the stick because I moved it from A2 to B2 and the error followed.  The error today is Critical. Mem ECC Warning: Memory sensor, transition to critical from less severe B2 was asserted....that was a few minutes ago.  Earlier I had that stick in A2 and when I had run the test in Lifecycle it gives the exact same error.  

That means it is just a bad stick, right?
ASKER CERTIFIED SOLUTION
Avatar of Philip Elder
Philip Elder
Flag of Canada image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of wfcrr

ASKER

I found the exact same DIMM on ebay and will probably buy it and use that.  However, if I decide to just run it with 5 sticks, wonder about how to have them inserted?   I don't understand why A4 was not used.  With 6 sticks they used A 1-3 and B 1-3.  Anyway, if I use just 5 sticks would I insert in A1-3 and B 1 and 2?

Our live server only uses 64 GB ram, so I don't think we need 80 or 96 in this alternate/substitute server.  I mostly just wanted to make sure this problem wasn't an issue with the motherboard or a cpu.
Ebay vendor will probably send you another stick if they are selling off a pile of machines, different if that was the only server they sold.
A is one CPU, B is the other. You should have the same amount f RAM on each CPU to minimise use of quickpath bus between CPUs.