Solved

Hardware Error on Sun Blade 100

Posted on 2001-07-17
4
982 Views
Last Modified: 2010-04-29
Help!
I got some errors like this one:
warning: uncorrectable error from pci0(upa mid0) during DVMA write transaction byte mask is ff.
ASFR=240000ff.00000000 AFAR=00000000.6c6db180
double word offset=0,memeory module Dimm4 id 31
secondary error from DVMA write transactiom

panic[cpu0]/thread=2a10001fd40: Fatal PCI UE ERROR

Computer reboots
It's difficult to track the problem, because my system runs for 2 or 3 days fine, than I get such errors following each other (2 till 5 times), later it's ok for some next days.


My configuration
SB100 with
1024 MB RAM
            MICRON
            MT18LSDT3272AG-133B1 PC133U-333-542-A
            SG CBNAKKA005 200114
            256MB, SYNCH, 133Mhz, CL3, ECC
 
2xHD nr 1 standard 15 GB (was shipped with)
nr 2 40 GB MAXTOR

SOFTWARE: Solaris 04/01 with recomended Patches for solaris 8 date 11.07.2001

Oracle 9i RDBMS with 2 databases

output of /usr/platform/sun4u/sbin/prtdiag follows:

# /usr/platform/sun4u/sbin/prtdiag
System Configuration: Sun Microsystems sun4u Sun Blade 100 (UltraSPARC-IIe)
System clock frequency: 84 MHZ
Memory size: 1GB

==================================== CPUs ====================================
E$ CPU CPU Temperature
CPU Freq Size Impl. Mask Die Ambient
--- -------- ---------- ------ ---- -------- --------
0 502 MHz 256KB US-IIe 1.4 79 C 35 C

================================= IO Devices =================================
Bus Freq
Brd Type MHz Slot Name Model
--- ---- ---- ---- -------------------------------- ----------------------
0 pci 33 7 isa/dma-isadma (dma)
0 pci 33 7 isa/serial-su16550 (serial)
0 pci 33 7 isa/serial-su16550 (serial)
0 pci 33 8 sound-pci10b9,5451.10b9.5451.1 (+
0 pci 33 12 network-pci108e,1101.1 (network) SUNW,pci-eri
0 pci 33 12 firewire-pci108e,1102.1001 (fire+
0 pci 33 13 ide-pci10b9,5229.c3 (ide)
0 pci 33 19 SUNW,m64B (display) ATY,RageXL

============================ Memory Configuration ============================
Segment Table:
-----------------------------------------------------------------------
Base Address Size Interleave Factor Contains
-----------------------------------------------------------------------
0x0 256MB 1 Label DIMM0
0x20000000 256MB 1 Label DIMM1
0x40000000 256MB 1 Label DIMM2
0x60000000 256MB 1 Label DIMM3

=============================== usb Devices ===============================

Name Port#
------------ -----
mouse 2
keyboard 4
#

If it helps I can send crash core files.
But they are 87 and 88 MB large!



Any suggestions?
thanx.
 


Iouri
 
0
Comment
Question by:bespalov
  • 3
4 Comments
 
LVL 13

Expert Comment

by:magarity
ID: 6290300
"double word offset=0,memeory module Dimm4 id 31"

This line is the kicker; it indicates an ECC fault with DIMM #4.  Replace this stick and the problem should go away.  If you have logs of these errors, double-check that it is always DIMM #4 that is the troublemaker.

regards,
magarity
0
 
LVL 13

Expert Comment

by:magarity
ID: 6290324
PS - Yes, ECC is supposed to correct errors in memory, but it only corrects single bit errors.  This error message indicates that ECC is failing because more than one bit is incorrect.

Oh, and I don't know if Sun starts numbering the DIMMs at 0 or 1 so read the PCB closely as it should be labeled somewhere.
0
 

Author Comment

by:bespalov
ID: 6294960
Hi magarity,
it doesn't help. If i replace this dimm, I get this error on one other, but it is the last one all the time.
Something else - if I don't start the databases I do not get errors.
If I start all Databases I have only 30-50 MB RAM left.
0
 
LVL 13

Accepted Solution

by:
magarity earned 200 total points
ID: 6295415
Well, a couple of observations:
1. That error message means a parity error in the RAM.
2. Starting the databases makes heavy use of all the RAM.

Conclusion:
There is a problem somewhere in either the RAM or the RAM controller logic.  It would be too easy if you had an identical machine whose memory you could swap and see if the problem follows the RAM...?

My primary source has been Sun Manager's mailing list archives.  This person has the exact same error message:
http://www.sunmanagers.org/pipermail/sunmanagers/2001-January/000832.html
and here is the final message in the thread where the solution was revealed:
http://www.sunmanagers.org/pipermail/sunmanagers/2001-January/000872.html

A couple of other posters in that forum had similar problems and were all solved by replacing the one offending stick explicitly given in the text of the error message, thus my original diagnosis.  Only the one quoted above had to replace all of his memory.

I realize that memory for Sun servers isn't exactly cheap to replace.  Is it warranted?

Good luck!
magarity
0

Featured Post

Courses: Start Training Online With Pros, Today

Brush up on the basics or master the advanced techniques required to earn essential industry certifications, with Courses. Enroll in a course and start learning today. Training topics range from Android App Dev to the Xen Virtualization Platform.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Title # Comments Views Activity
laptop estate analysis 5 72
Making a PC to access Bluetoothe devices 2 58
Hard Drive surgery 10 128
Inventory equipment in the office 7 119
Moving your enterprise fax infrastructure from in-house fax machines and servers to the cloud makes sense — from both an efficiency and productivity standpoint. But does migrating to a cloud fax solution mean you will no longer be able to send or re…
What do we know about Legacy Video Conferencing? - Full IT support needed! - Complicated systems at outrageous prices! - Intense training required! Highfive believes we need to embrace a new alternative.
This tutorial gives a high-level tour of the interface of Marketo (a marketing automation tool to help businesses track and engage prospective customers and drive them to purchase). You will see the main areas including Marketing Activities, Design …
This Micro Tutorial demonstrates using Microsoft Excel pivot tables, how to reverse engineer competitors' marketing strategies through backlinks.

816 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

8 Experts available now in Live!

Get 1:1 Help Now