Link to home
Start Free TrialLog in
Avatar of Randy A
Randy A

asked on

Would love some help with ideas to troubleshoot my Dell PowerEdge r900 - having some major RAM issues.

I'm having some issues with the performance of my Dell Poweredge r900.
I'm using this machine at my house as a test/lab unit (if you know what the fans sounds like, you are probably thinking this guy is crazy - it basically sounds like an airport in our house) but that is for another day!

Equipment Details:
Dell Poweredge r900 (super old)
Intel X7460 – 24 Cores Total (4 x Intel Xeon Hex Core 2.66ghz 16mb 1066fsb CPUs)
128GB RAM (32 x 4gb DDR2 ECC Memory)
Perc6i - Raid 1 - 2 x 500GB SSD
WIndows Server 2012 r2 - Datacenter
DRAC 5 Remote Access Card
4x Gigabit Ethernet
2x Power Supplies
8x Fans - Boeing 727 engine fans (joking ... it is super LOUD)

Note: I found one of the RAM chips to be bad so I've removed that and the same chip from the other 3 risers so right now I actually have 112GB RAM. Before I did this, I couldn't even get an OS installed. I was originally getting alerts in my log files but they are all clean now. Including logs from drac.

I've been using virtualbox and hyper-v for a few years now and I know what to expect so the results I'm getting are alarming (actually more frustrating).

Rather than start testing from within a virtualbox/hyper-v host, I ran my test on the actual server running 2012 r2 with nothing else running.

The results look like this:
CPU Mark: 9683 - 81st percentile
2D Graphics Mark: 190 - 4th percentile
3D Graphics Mark: N/A
Memory Mark: 6167 - 7th percentile . <-----------
Disk Mark: 2356 - 61st percentile

note: I'm going to come back and add the results from running inside one of my Windows 10 LTSB installs in virtualbox.
Update: the Memory Mark was 223.5 on the Virtualbox Windows 10 LTSB with 4096GB Ram allocated to the virtual machine.

I also ran a Window Memory Diagnostic test and it came back with no issues (can't remember the exact wording from event viewer).

Does anyone have any suggestions? I'm open to anything at this point although I'd love to start with the equipment I have rather than replacing 128GB RAM :-)

Thanks in advance for ANY and ALL suggestions!!
Avatar of Andrew Hancock (VMware vExpert PRO / EE Fellow/British Beekeeper)
Andrew Hancock (VMware vExpert PRO / EE Fellow/British Beekeeper)
Flag of United Kingdom of Great Britain and Northern Ireland image

what's the actual issue, can you take 4 x  memory modules and test them, or do you only have 4 ? 1 per  CPU

We have found that the only way with lots of memory, and CPUs, is to test one slot at a time, per CPU, and build up slowly until faulty memory module or faulty slot is found...
It's been a LONG time since I used an r900, but one of the key details I remember -- the Dell Systems Management Tools DVD. It provided a lot of tookit needs to determine if there are underlying issues with the hardware.

https://www.dell.com/support/home/us/en/04/Drivers/DriversDetails?driverId=4HHMH&fileId=3485849165&osCode=WNET&productCode=poweredge-r900&languageCode=EN&categoryId=SZ

Have you tried using that for setting up the environment first? Also, I recommend taking a look-see if the BIOS needs to be updated as well.
https://www.dell.com/support/home/us/en/04/product-support/product/poweredge-r900/drivers
Avatar of Randy A
Randy A

ASKER

Andrew,
I'll be honest, I'm not a server hardware expect but I'll tell you want I have researched.

I found this r900 ram population "rule set":
  • So this server has 4 cpus and 4 ram riser boards
  • I found a good article about the r900 memory population "rules".
  • The R900 memory population rules are as follows;
  • All systems will include 4 Memory Risers
  • Supported configuration is a minimum of one DIMM in each riser
  • Memory must be populated beginning with DIMM Slot 1
  • Additional memory must be added sequentially beginning with Slot 2
  • DIMMs in the same number sockets in risers 1 & 2 or risers 3 & 4 must be the same
  • Identically numbered FBDIMM sockets for both memory boards must be populated with FBDIMMs identical in terms of timing, technology, and size. Example: DIMM A1 and B1 must be identical.
  • FBDIMMs installed in different socket positions (numbers) on a memory board can be different for dual-channel operation. Example: DIMMs A1 and B1 can be different from DIMMs A2 and B2.
  • Additional memory can be added by installing identical pairs of DIMMs in the lowest numbered available slots.

I'm not with the server right now but I will do the 1 RAM per riser test following the rules and report back.
I think I was just trying to be lazy and try to find software or a diagnostic tool that could explain it.

I will try and knock that out tonight and get back to you.

Again - thanks for the suggestion!!
We've sat with Dell servers for five days, trying to work out memory issues!

working out slot or memory faults.

it really helps if you know you have good memory, or a good server, otherwise, it's test slowly....
Avatar of Randy A

ASKER

Michael,
It's been a long time for you because this server is old as dirt :-)

Based on everything I've done so far, I feel confident that all firmware/bios are up to date.
It is running "Dell BIOS 10G, 1.2.0"

I have NOT run the "Dell Systems Management Tools DVD" but I'm download it now and will test when I get home along with Andrew's suggestions.

Thanks and stay tuned!
Avatar of Randy A

ASKER

Andrew - I've been very tempted (and am getting really close) to pulling the power and take my losses. I've already wasted a TON of time and as I've mentioned, even if I get it working, I'm going to have to find a solution for the Boeing 727 fans they have in there. I posted here as a last ditch effort to get some advice from professionals before I give up. If we can get it figured out here, I'm going to cut my losses. Thanks again.
We use smaller servers in our labs now, due to high noise, and electrical costs...
Avatar of Randy A

ASKER

Andrew - amen to that.
As for the fan noise, I found this articles / threads...
https://www.reddit.com/r/homelab/comments/4oed3o/how_to_quiet_down_a_dell_r900/
http://www.ratzblog.com/2014/08/reducing-dell-poweredge-pe-295029002800.html
http://s.co.tt/2013/06/08/reducing-noise-from-a-dell-poweredge-r905/

After reading the articles, it seems you can either resistor mod the fans themselves (if you like playing with solder) or take a look on eBay (or elsewhere) to see if someone else has already done that and is selling them for more a plug-n-play.
Avatar of Randy A

ASKER

Michael,
Apparently my google searching isn't as good as yours!! Thank you so much.
You may get thank you cards from everyone in my house and our neighbors ;-)
I used to run a Beowulf cluster in my garage back in the late 90's / early 2000's using Dell equipment, so I know all about fan noise and power consumption (like trying to explain to the wife a $800+ power bill was not fun). After building enough data centers from the ground up and knowing what all goes into it, lets just say I am glad for other hosting companies for taking over so I no longer have to do that work.
Avatar of Randy A

ASKER

Michael/Andrew,
First of all, thanks again for the suggestions. Here is what I did last night.

The following 2 tests were done with passmark to be consistent. I also confirmed the chips I'm testing with are EXACTLY the same
64GB - 4x4 chip test - removed 4 of the 8 ram chips from each riser and did a test using passmark - no change
16GB - 4x1 chip test - removed 7 of the 8 ram chips from each riser and did a test using passmark - no change

Downloaded and tried to use "Dell Systems Management Tools DVD" - honesty, I'm not really sure what to do with it - I think I need to do some additional reading on it).

I went back to the 6GB 4x4 ram chip setup and decided to run a test using memtest86. BUT, now I'm away from my house and having issues connecting to the iDRAC to see the screen. When I saw it this morning, it had been running for 12 hours.

I'll update this when I get home.
I would not bother with any Memory Tester application, the best test is to install an OS.

If memory fault exists the OS will crash (BSOD).
Avatar of Randy A

ASKER

Andrew -
Question - I haven't had the OS fail at all (no BSOD). If I'm on the OS (Windows 2012 r2), it is pretty quick and runs nice.
If I run the passmark performance ram test, the results are terrible (unless I'm not understanding it).
And obviously, the BIGGEST issue is when I attempt to use a virtual machine (virtualbox or hyper-v) - that is when everything goes down hill.

Do you think I'm wrong to think that my passmark results are concerning? If I'm wrong, then I probably need to move onto to testing something else as I just don't have any RAM errors.

Thoughts?
Thanks again!
ASKER CERTIFIED SOLUTION
Avatar of Andrew Hancock (VMware vExpert PRO / EE Fellow/British Beekeeper)
Andrew Hancock (VMware vExpert PRO / EE Fellow/British Beekeeper)
Flag of United Kingdom of Great Britain and Northern Ireland image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of Randy A

ASKER

Andrew -
I will absolutely stop using those tools - that is just what I started with and it "appeared" to point to an issue that made sense.

I think I'm going to close this question and open up a new one regarding virtualbox (regardless of how much RAM I give one machine, it runs TERRIBLE).

Thanks everyone for their help and feedback!
Avatar of Randy A

ASKER

Thank you Andrew and Michael!
All else fails, sell it on eBay and use funds to get something a little more up to date.