Solved

Hardware Error on Sun Blade 100

Posted on 2001-07-17
4
979 Views
Last Modified: 2010-04-29
Help!
I got some errors like this one:
warning: uncorrectable error from pci0(upa mid0) during DVMA write transaction byte mask is ff.
ASFR=240000ff.00000000 AFAR=00000000.6c6db180
double word offset=0,memeory module Dimm4 id 31
secondary error from DVMA write transactiom

panic[cpu0]/thread=2a10001fd40: Fatal PCI UE ERROR

Computer reboots
It's difficult to track the problem, because my system runs for 2 or 3 days fine, than I get such errors following each other (2 till 5 times), later it's ok for some next days.


My configuration
SB100 with
1024 MB RAM
            MICRON
            MT18LSDT3272AG-133B1 PC133U-333-542-A
            SG CBNAKKA005 200114
            256MB, SYNCH, 133Mhz, CL3, ECC
 
2xHD nr 1 standard 15 GB (was shipped with)
nr 2 40 GB MAXTOR

SOFTWARE: Solaris 04/01 with recomended Patches for solaris 8 date 11.07.2001

Oracle 9i RDBMS with 2 databases

output of /usr/platform/sun4u/sbin/prtdiag follows:

# /usr/platform/sun4u/sbin/prtdiag
System Configuration: Sun Microsystems sun4u Sun Blade 100 (UltraSPARC-IIe)
System clock frequency: 84 MHZ
Memory size: 1GB

==================================== CPUs ====================================
E$ CPU CPU Temperature
CPU Freq Size Impl. Mask Die Ambient
--- -------- ---------- ------ ---- -------- --------
0 502 MHz 256KB US-IIe 1.4 79 C 35 C

================================= IO Devices =================================
Bus Freq
Brd Type MHz Slot Name Model
--- ---- ---- ---- -------------------------------- ----------------------
0 pci 33 7 isa/dma-isadma (dma)
0 pci 33 7 isa/serial-su16550 (serial)
0 pci 33 7 isa/serial-su16550 (serial)
0 pci 33 8 sound-pci10b9,5451.10b9.5451.1 (+
0 pci 33 12 network-pci108e,1101.1 (network) SUNW,pci-eri
0 pci 33 12 firewire-pci108e,1102.1001 (fire+
0 pci 33 13 ide-pci10b9,5229.c3 (ide)
0 pci 33 19 SUNW,m64B (display) ATY,RageXL

============================ Memory Configuration ============================
Segment Table:
-----------------------------------------------------------------------
Base Address Size Interleave Factor Contains
-----------------------------------------------------------------------
0x0 256MB 1 Label DIMM0
0x20000000 256MB 1 Label DIMM1
0x40000000 256MB 1 Label DIMM2
0x60000000 256MB 1 Label DIMM3

=============================== usb Devices ===============================

Name Port#
------------ -----
mouse 2
keyboard 4
#

If it helps I can send crash core files.
But they are 87 and 88 MB large!



Any suggestions?
thanx.
 


Iouri
 
0
Comment
Question by:bespalov
  • 3
4 Comments
 
LVL 13

Expert Comment

by:magarity
ID: 6290300
"double word offset=0,memeory module Dimm4 id 31"

This line is the kicker; it indicates an ECC fault with DIMM #4.  Replace this stick and the problem should go away.  If you have logs of these errors, double-check that it is always DIMM #4 that is the troublemaker.

regards,
magarity
0
 
LVL 13

Expert Comment

by:magarity
ID: 6290324
PS - Yes, ECC is supposed to correct errors in memory, but it only corrects single bit errors.  This error message indicates that ECC is failing because more than one bit is incorrect.

Oh, and I don't know if Sun starts numbering the DIMMs at 0 or 1 so read the PCB closely as it should be labeled somewhere.
0
 

Author Comment

by:bespalov
ID: 6294960
Hi magarity,
it doesn't help. If i replace this dimm, I get this error on one other, but it is the last one all the time.
Something else - if I don't start the databases I do not get errors.
If I start all Databases I have only 30-50 MB RAM left.
0
 
LVL 13

Accepted Solution

by:
magarity earned 200 total points
ID: 6295415
Well, a couple of observations:
1. That error message means a parity error in the RAM.
2. Starting the databases makes heavy use of all the RAM.

Conclusion:
There is a problem somewhere in either the RAM or the RAM controller logic.  It would be too easy if you had an identical machine whose memory you could swap and see if the problem follows the RAM...?

My primary source has been Sun Manager's mailing list archives.  This person has the exact same error message:
http://www.sunmanagers.org/pipermail/sunmanagers/2001-January/000832.html
and here is the final message in the thread where the solution was revealed:
http://www.sunmanagers.org/pipermail/sunmanagers/2001-January/000872.html

A couple of other posters in that forum had similar problems and were all solved by replacing the one offending stick explicitly given in the text of the error message, thus my original diagnosis.  Only the one quoted above had to replace all of his memory.

I realize that memory for Sun servers isn't exactly cheap to replace.  Is it warranted?

Good luck!
magarity
0

Featured Post

What Security Threats Are You Missing?

Enhance your security with threat intelligence from the web. Get trending threat insights on hackers, exploits, and suspicious IP addresses delivered to your inbox with our free Cyber Daily.

Join & Write a Comment

Moving your enterprise fax infrastructure from in-house fax machines and servers to the cloud makes sense — from both an efficiency and productivity standpoint. But does migrating to a cloud fax solution mean you will no longer be able to send or re…
What do we know about Legacy Video Conferencing? - Full IT support needed! - Complicated systems at outrageous prices! - Intense training required! Highfive believes we need to embrace a new alternative.
Excel styles will make formatting consistent and let you apply and change formatting faster. In this tutorial, you'll learn how to use Excel's built-in styles, how to modify styles, and how to create your own. You'll also learn how to use your custo…
Access reports are powerful and flexible. Learn how to create a query and then a grouped report using the wizard. Modify the report design after the wizard is done to make it look better. There will be another video to explain how to put the final p…

747 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

9 Experts available now in Live!

Get 1:1 Help Now