Solved

Hardware Error on Sun Blade 100

Posted on 2001-07-17
4
987 Views
Last Modified: 2010-04-29
Help!
I got some errors like this one:
warning: uncorrectable error from pci0(upa mid0) during DVMA write transaction byte mask is ff.
ASFR=240000ff.00000000 AFAR=00000000.6c6db180
double word offset=0,memeory module Dimm4 id 31
secondary error from DVMA write transactiom

panic[cpu0]/thread=2a10001fd40: Fatal PCI UE ERROR

Computer reboots
It's difficult to track the problem, because my system runs for 2 or 3 days fine, than I get such errors following each other (2 till 5 times), later it's ok for some next days.


My configuration
SB100 with
1024 MB RAM
            MICRON
            MT18LSDT3272AG-133B1 PC133U-333-542-A
            SG CBNAKKA005 200114
            256MB, SYNCH, 133Mhz, CL3, ECC
 
2xHD nr 1 standard 15 GB (was shipped with)
nr 2 40 GB MAXTOR

SOFTWARE: Solaris 04/01 with recomended Patches for solaris 8 date 11.07.2001

Oracle 9i RDBMS with 2 databases

output of /usr/platform/sun4u/sbin/prtdiag follows:

# /usr/platform/sun4u/sbin/prtdiag
System Configuration: Sun Microsystems sun4u Sun Blade 100 (UltraSPARC-IIe)
System clock frequency: 84 MHZ
Memory size: 1GB

==================================== CPUs ====================================
E$ CPU CPU Temperature
CPU Freq Size Impl. Mask Die Ambient
--- -------- ---------- ------ ---- -------- --------
0 502 MHz 256KB US-IIe 1.4 79 C 35 C

================================= IO Devices =================================
Bus Freq
Brd Type MHz Slot Name Model
--- ---- ---- ---- -------------------------------- ----------------------
0 pci 33 7 isa/dma-isadma (dma)
0 pci 33 7 isa/serial-su16550 (serial)
0 pci 33 7 isa/serial-su16550 (serial)
0 pci 33 8 sound-pci10b9,5451.10b9.5451.1 (+
0 pci 33 12 network-pci108e,1101.1 (network) SUNW,pci-eri
0 pci 33 12 firewire-pci108e,1102.1001 (fire+
0 pci 33 13 ide-pci10b9,5229.c3 (ide)
0 pci 33 19 SUNW,m64B (display) ATY,RageXL

============================ Memory Configuration ============================
Segment Table:
-----------------------------------------------------------------------
Base Address Size Interleave Factor Contains
-----------------------------------------------------------------------
0x0 256MB 1 Label DIMM0
0x20000000 256MB 1 Label DIMM1
0x40000000 256MB 1 Label DIMM2
0x60000000 256MB 1 Label DIMM3

=============================== usb Devices ===============================

Name Port#
------------ -----
mouse 2
keyboard 4
#

If it helps I can send crash core files.
But they are 87 and 88 MB large!



Any suggestions?
thanx.
 


Iouri
 
0
Comment
Question by:bespalov
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 3
4 Comments
 
LVL 13

Expert Comment

by:magarity
ID: 6290300
"double word offset=0,memeory module Dimm4 id 31"

This line is the kicker; it indicates an ECC fault with DIMM #4.  Replace this stick and the problem should go away.  If you have logs of these errors, double-check that it is always DIMM #4 that is the troublemaker.

regards,
magarity
0
 
LVL 13

Expert Comment

by:magarity
ID: 6290324
PS - Yes, ECC is supposed to correct errors in memory, but it only corrects single bit errors.  This error message indicates that ECC is failing because more than one bit is incorrect.

Oh, and I don't know if Sun starts numbering the DIMMs at 0 or 1 so read the PCB closely as it should be labeled somewhere.
0
 

Author Comment

by:bespalov
ID: 6294960
Hi magarity,
it doesn't help. If i replace this dimm, I get this error on one other, but it is the last one all the time.
Something else - if I don't start the databases I do not get errors.
If I start all Databases I have only 30-50 MB RAM left.
0
 
LVL 13

Accepted Solution

by:
magarity earned 200 total points
ID: 6295415
Well, a couple of observations:
1. That error message means a parity error in the RAM.
2. Starting the databases makes heavy use of all the RAM.

Conclusion:
There is a problem somewhere in either the RAM or the RAM controller logic.  It would be too easy if you had an identical machine whose memory you could swap and see if the problem follows the RAM...?

My primary source has been Sun Manager's mailing list archives.  This person has the exact same error message:
http://www.sunmanagers.org/pipermail/sunmanagers/2001-January/000832.html
and here is the final message in the thread where the solution was revealed:
http://www.sunmanagers.org/pipermail/sunmanagers/2001-January/000872.html

A couple of other posters in that forum had similar problems and were all solved by replacing the one offending stick explicitly given in the text of the error message, thus my original diagnosis.  Only the one quoted above had to replace all of his memory.

I realize that memory for Sun servers isn't exactly cheap to replace.  Is it warranted?

Good luck!
magarity
0

Featured Post

Forrester Webinar: xMatters Delivers 261% ROI

Guest speaker Dean Davison, Forrester Principal Consultant, explains how a Fortune 500 communication company using xMatters found these results: Achieved a 261% ROI, Experienced $753,280 in net present value benefits over 3 years and Reduced MTTR by 91% for tier 1 incidents.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

In this article we have discussed the manual scenarios to recover data from Windows 10 through some backup and recovery tools which are offered by it.
Skype is a P2P (Peer to Peer) instant messaging and VOIP (Voice over IP) service – as well as a whole lot more.
Finding and deleting duplicate (picture) files can be a time consuming task. My wife and I, our three kids and their families all share one dilemma: Managing our pictures. Between desktops, laptops, phones, tablets, and cameras; over the last decade…
Suggested Courses

751 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question