• Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 998
  • Last Modified:

Hardware Error on Sun Blade 100

I got some errors like this one:
warning: uncorrectable error from pci0(upa mid0) during DVMA write transaction byte mask is ff.
ASFR=240000ff.00000000 AFAR=00000000.6c6db180
double word offset=0,memeory module Dimm4 id 31
secondary error from DVMA write transactiom

panic[cpu0]/thread=2a10001fd40: Fatal PCI UE ERROR

Computer reboots
It's difficult to track the problem, because my system runs for 2 or 3 days fine, than I get such errors following each other (2 till 5 times), later it's ok for some next days.

My configuration
SB100 with
1024 MB RAM
            MT18LSDT3272AG-133B1 PC133U-333-542-A
            SG CBNAKKA005 200114
            256MB, SYNCH, 133Mhz, CL3, ECC
2xHD nr 1 standard 15 GB (was shipped with)
nr 2 40 GB MAXTOR

SOFTWARE: Solaris 04/01 with recomended Patches for solaris 8 date 11.07.2001

Oracle 9i RDBMS with 2 databases

output of /usr/platform/sun4u/sbin/prtdiag follows:

# /usr/platform/sun4u/sbin/prtdiag
System Configuration: Sun Microsystems sun4u Sun Blade 100 (UltraSPARC-IIe)
System clock frequency: 84 MHZ
Memory size: 1GB

==================================== CPUs ====================================
E$ CPU CPU Temperature
CPU Freq Size Impl. Mask Die Ambient
--- -------- ---------- ------ ---- -------- --------
0 502 MHz 256KB US-IIe 1.4 79 C 35 C

================================= IO Devices =================================
Bus Freq
Brd Type MHz Slot Name Model
--- ---- ---- ---- -------------------------------- ----------------------
0 pci 33 7 isa/dma-isadma (dma)
0 pci 33 7 isa/serial-su16550 (serial)
0 pci 33 7 isa/serial-su16550 (serial)
0 pci 33 8 sound-pci10b9,5451.10b9.5451.1 (+
0 pci 33 12 network-pci108e,1101.1 (network) SUNW,pci-eri
0 pci 33 12 firewire-pci108e,1102.1001 (fire+
0 pci 33 13 ide-pci10b9,5229.c3 (ide)
0 pci 33 19 SUNW,m64B (display) ATY,RageXL

============================ Memory Configuration ============================
Segment Table:
Base Address Size Interleave Factor Contains
0x0 256MB 1 Label DIMM0
0x20000000 256MB 1 Label DIMM1
0x40000000 256MB 1 Label DIMM2
0x60000000 256MB 1 Label DIMM3

=============================== usb Devices ===============================

Name Port#
------------ -----
mouse 2
keyboard 4

If it helps I can send crash core files.
But they are 87 and 88 MB large!

Any suggestions?

  • 3
1 Solution
"double word offset=0,memeory module Dimm4 id 31"

This line is the kicker; it indicates an ECC fault with DIMM #4.  Replace this stick and the problem should go away.  If you have logs of these errors, double-check that it is always DIMM #4 that is the troublemaker.

PS - Yes, ECC is supposed to correct errors in memory, but it only corrects single bit errors.  This error message indicates that ECC is failing because more than one bit is incorrect.

Oh, and I don't know if Sun starts numbering the DIMMs at 0 or 1 so read the PCB closely as it should be labeled somewhere.
bespalovAuthor Commented:
Hi magarity,
it doesn't help. If i replace this dimm, I get this error on one other, but it is the last one all the time.
Something else - if I don't start the databases I do not get errors.
If I start all Databases I have only 30-50 MB RAM left.
Well, a couple of observations:
1. That error message means a parity error in the RAM.
2. Starting the databases makes heavy use of all the RAM.

There is a problem somewhere in either the RAM or the RAM controller logic.  It would be too easy if you had an identical machine whose memory you could swap and see if the problem follows the RAM...?

My primary source has been Sun Manager's mailing list archives.  This person has the exact same error message:
and here is the final message in the thread where the solution was revealed:

A couple of other posters in that forum had similar problems and were all solved by replacing the one offending stick explicitly given in the text of the error message, thus my original diagnosis.  Only the one quoted above had to replace all of his memory.

I realize that memory for Sun servers isn't exactly cheap to replace.  Is it warranted?

Good luck!
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

Join & Write a Comment

Featured Post

Get expert help—faster!

Need expert help—fast? Use the Help Bell for personalized assistance getting answers to your important questions.

  • 3
Tackle projects and never again get stuck behind a technical roadblock.
Join Now