Enterprise 450 How to Find the Failed Disk ????

huffmana used Ask the Experts™
Hi Everyone, I have a failed disk in an Enterprise 450 (it has 20 scsi HDD slots).  I presently have 12 disks set up in a RAID 0:

c0t0d0s7      System Disk i      Mirror *
c0t1d0s0      Conactinated 3      Mirror @
c0t2d0s0      Conactinated 2      Mirror =
c0t3d0s0      Conactinated 1      Mirror =
c2t0d0s0       Conactinated 1      Mirror =
c2t1d0s0      Conactinated 4      Mirror @
c2t2d0s0      Conactinated 3      Mirror @
c2t3d0s0      Conactinated 2      Mirror =
c3t0d0s7       System Disk ii      Mirror *
c3t1d0s0      Conactinated 2      Mirror =
c3t2d0s0      Conactinated 4      Mirror @
c3t3d0s0      Conactinated 1      Mirror =

The disk chassis has a labeled number for each disk slot.  The disks are using slots 0 to 11.  12 to 19 are empty.

Disk in position c0t2d0 is reported as "Invoke: metareplace d9 c0t2d0s7 <new device>"  This is a production server so I can not do much experimentation to find the disk (besides all off line work has to be done on the weekend :-(

My problem is determining which is the failed disk in the HDD chassis.  I can't find a slot numbering diagram for the Enterprise 450 Part Number 600-6363-01 that correlates the scsi bus and scsi drop to the HDD chasis slot number.

Anyone have any ideas?

Here is the applicable output for matastat -i

d9: Mirror
    Submirror 0: d19
      State: Okay
    Submirror 1: d29
      State: Needs maintenance
    Pass: 1
    Read option: roundrobin (default)
    Write option: parallel (default)
    Size: 214496640 blocks (102 GB)

d19: Submirror of d9
    State: Okay
    Size: 214496640 blocks (102 GB)
    Stripe 0:
        Device     Start Block  Dbase        State Reloc Hot Spare
        c2t0d0s7          0     No            Okay   Yes
    Stripe 1:
        Device     Start Block  Dbase        State Reloc Hot Spare
        c3t3d0s7          0     No            Okay   Yes
    Stripe 2:
        Device     Start Block  Dbase        State Reloc Hot Spare
        c0t3d0s7          0     No            Okay   Yes

d29: Submirror of d9
    State: Needs maintenance
    Invoke: metareplace d9 c0t2d0s7 <new device>
    Size: 214496640 blocks (102 GB)
    Stripe 0:
        Device     Start Block  Dbase        State Reloc Hot Spare
        c2t3d0s7          0     No            Okay   Yes
    Stripe 1:
        Device     Start Block  Dbase        State Reloc Hot Spare
        c3t1d0s7          0     No            Okay   Yes
    Stripe 2:
        Device     Start Block  Dbase        State Reloc Hot Spare
        c0t2d0s7          0     No     Maintenance   Yes
Watch Question

Do more with

Expert Office
EXPERT OFFICE® is a registered trademark of EXPERTS EXCHANGE®

should give you the mapping from slot no to pci location and you can marry that with an
ls -l /dev/rdsk/c0t0d0s7
Have a look at the following doc, pay attention to the diagram, to see if you can
locate the HD:

Found a method here:
ls -l /dev/rdsk/c0t2d0s7
then eg
lrwxrwxrwx   1 root     root          45 Aug 28  2002 /dev/rdsk/c0t2d0s7 -> ../.

then type
prtconf -vp |grep pci@1f,4000/scsi@3/disk@0
Note we change sd@0 to disk@0
and look for the slot number
Introduction to Web Design

Develop a strong foundation and understanding of web design by learning HTML, CSS, and additional tools to help you develop your own website.

As a double check, run something disk-intesive like "find" on each file system and note which disks are being used. You can often work out the pattern of disk allocation from that.
huffmanaSystem Admin and Network Engineer


Hi Everyone,

The link you sent gave an "Internal Server Error
The server encountered an internal error or misconfiguration and was unable to complete your request."  But I used the ls -l to get the following:

Question.  I though that the c in c0t0d0s7 was the scsi bus number...  That is why I planned the mirror to always be across different scsi busses.  But the listing seems to say that I've used scsi@3 and scsi@2.  What is scsi@2,1?  

# ls -l /dev/rdsk/c0t0d0s7


Then I tried the Sun reference that yuzh sent me and Sun says to look at the output of prtconf -vp.  But, unfortunately, prtconf -vp gives only the following for slot assignments (nothing more - I did a vi on the output):
 pci-slot-skip-list:  'none'

Questiuon: So is this a dead end unless there is a way to fill-in the pci-slot-skip-list....?

Question: Why is there a reference to PCI ?  Are the disk scsi busses connected to the PCI bus?

I will be trying the find command on the respective partitions but the failed disk is in a group of 6 disks (3 concatenated disks per submirror).  And the partition is only 49% full.  It may not reach the 3rd disk.

Question: will I still see activity on a concatenated disk if there is no data stored on it?

Thanks everyone for your help.  I really appreciate it!  Allan :-)

huffmanaSystem Admin and Network Engineer


Find Command Results:
Slot 6 and 10 are the system disks (they are always blinking).  Slot 4 blinks like crazy when I execute a find command on the vol1 partition.  So the answer is: no not all the disks are being accessed during the find command.  

What can I do next without taking the system offline?

prtconf -vp |grep pci@1f,4000/scsi@3/disk@0 |grep slot
will give u the slot number of c0t0d0s7
huffmanaSystem Admin and Network Engineer


Slot 6or10 /dev/rdsk/c0t0d0s7->/pci@1f,4000/scsi@3/sd@0,0:h,raw
vol1          /dev/rdsk/c0t2d0s0->/pci@1f,4000/scsi@3/sd@2,0:a,raw
vol1          /dev/rdsk/c0t3d0s0->/pci@1f,4000/scsi@3/sd@3,0:a,raw
Slot 4 or    /dev/rdsk/c2t0d0s0->/pci@6,4000/scsi@2/sd@0,0:a,raw
Slot 4        /dev/rdsk/c2t3d0s0->/pci@6,4000/scsi@2/sd@3,0:a,raw
Slot 6or10 /dev/rdsk/c3t0d0s7->/pci@6,4000/scsi@2,1/sd@0,0:h,raw
vol1           /dev/rdsk/c3t1d0s0->/pci@6,4000/scsi@2,1/sd@1,0:a,raw
vol1           /dev/rdsk/c3t3d0s0->/pci@6,4000/scsi@2,1/sd@3,0:a,raw

I think that this is a summary up to now.  Based on the metastat data:

d19: Submirror of d9
    Stripe 0:         c2t0d0s7          0     No            Okay   Yes
    Stripe 1:        c3t3d0s7          0     No            Okay   Yes
    Stripe 2:        c0t3d0s7          0     No            Okay   Yes
d29: Submirror of d9
    State: Needs maintenance
    Invoke: metareplace d9 c0t2d0s7 <new device>
    Stripe 0:        c2t3d0s7          0     No            Okay   Yes
    Stripe 1:        c3t1d0s7          0     No            Okay   Yes
    Stripe 2:        c0t2d0s7          0     No     Maintenance   Yes
Using the dd command you have some options:

1) cmd> dd if=/dev/rdsk/c0t2d0s2 of=/dev/null
        break out of comand at anytime

        if the disk is semi functional... the HDD LED will light up

2) cmd> dd if=/dev/rdsk/c0t1d0s2 of=/dev/null
        break out of comand at anytime
    cmd> dd if=/dev/rdsk/c0t1d0s2 of=/dev/null
        break out of comand at anytime

        If failed HDD is "TOO BROKEN" to light the LED... then the above command will light the HDD LEDs around it... you can of course do this on all drive for ultimate process of elimination.
Q: What is scsi@2,1?

if u have dual PCI scsi card in one of the slot,  u will see both scsi@2 and scsi@2,1, each representing one of the controllers.

Question: Why is there a reference to PCI ?  Are the disk scsi busses connected to the PCI bus?

PCI in these device names are referring to the PCI type of IO boards, where all these slots are placed. this is the way the full board is connected to the system, within this board your scsi/isp/hme/qfe/glm kind of cards are placed and then those cards are connected to your disks.

Now if u go one by one.

slot 10 referes to the card placed within the board not actual slot number for the disk. so in slot 10 u will see a wire connected to some disks.

The device path /pci@1f,4000/scsi@3 may be reported. This device path references the disk controller built onto the system board that controls the first four internal disk slots in a Sun Enterprise 450 - the bottom four slots.

so these are internal disk slots.
vol1          /dev/rdsk/c0t2d0s0->/pci@1f,4000/scsi@3/sd@2,0:a,raw
vol1          /dev/rdsk/c0t3d0s0->/pci@1f,4000/scsi@3/sd@3,0:a,raw

these next 4 disks are connected to card in slot 3.

sorry for unfinshed comment above, network problem.
anyway so u will some disk storage connected to slot 3 card.

sc@0,0 and tell u the target, check your disk storage and u can get the targets number and map all the disks to individual device.
same thing for the first 4 internal disks also.
hope this help.
Here's a better doc for indentifying the Faulty Disk Drive:

Identifying the Faulty Disk Drive
Disk errors may be reported in a number of different ways. Often you can find messages about failing or failed disks in your system console. This information is also logged in the /usr/adm/messages file(s). These error messages typically refer to a failed disk drive by its UNIX physical device name (such as /devices/pci@6,4000/scsi@4,1/sd@3,0) and its UNIX device instance name (such as sd14). In some cases, a faulty disk may be identified by its UNIX logical device name, such as c2t3d0. In addition, some applications may report a disk slot number (0 through 19) or activate an LED located next to the disk drive itself (see Figure 3–3).




Mapping Between Logical and Physical Device Names :

What is the output of:
prtconf -vp |grep pci@1f,4000/scsi@3/disk@2 |grep slot
huffmanaSystem Admin and Network Engineer


dd command did it!! :-)  The mapping looks like this:

Slot 0   c0t0d0s7->/pci@1f,4000/scsi@3/sd@0,0:h,raw
Slot 1   c0t1d0s0->/pci@1f,4000/scsi@3/sd@1,0:a,raw
Solt 2   c0t2d0s0->/pci@1f,4000/scsi@3/sd@2,0:a,raw
Solt 3   c0t3d0s0->/pci@1f,4000/scsi@3/sd@3,0:a,raw
Slot 4   c2t0d0s0->/pci@6,4000/scsi@2/sd@0,0:a,raw
Slot 5   c2t1d0s0->/pci@6,4000/scsi@2/sd@1,0:a,raw
Slot 6   c2t2d0s0->/pci@6,4000/scsi@2/sd@2,0:a,raw
Slot 7   c2t3d0s0->/pci@6,4000/scsi@2/sd@3,0:a,raw
Slot 8   c3t0d0s7->/pci@6,4000/scsi@2,1/sd@0,0:h,raw
Solt 9   c3t1d0s0->/pci@6,4000/scsi@2,1/sd@1,0:a,raw
Slot 10 c3t2d0s0->/pci@6,4000/scsi@2,1/sd@2,0:a,raw
Solt 11 c3t3d0s0->/pci@6,4000/scsi@2,1/sd@3,0:a,raw

Note 1: When the dd command runs the light turns OFF.  Also, slot 0 turned off a short time and then turned right back on.  The terminial shows that the process stopped as follows:
# dd if=/dev/rdsk/c0t0d0s7 of=/dev/null
40320+0 records in
40320+0 records out

Since it is not the failed disk (c0t2d0) I assume that it is the active system disk and did not like being scanned....

Note 2: Since the "failed disk" acted the same as all the other disks I'm wondering if it is really failed or just needs to be reseated or something.  Perhaps tomorrow evening I can stay late and remove it and put it back in....
sweeeeeet...  one of my 450's did this about 4 months ago... kinda a pain

because of issues like this, SUN added some handiness to openboot

you can use "disk-led-assoc" with setenv at the openboot prompt to map disks and controllers to "led flags" in the prtdiag display.  It is something you have to take the time to setup BEFORE a problem arises, but will save you time when this happens again.

for info on this: search for "Enterprise 450 diag" on docs.sun.com
huffmanaSystem Admin and Network Engineer


Another way that I found is #format>analyze>test
The light blinks very quickly on the selected disk :-)

The dd command was the one that solved my problem but everyone was very helpful!  Is it OK if I split the points with
40% stewbeast
20% yuzh
20% shivsa
10% lidder
10% glassd

I don't know this experts-exchange GUI very well and may make a mistake allocating points.  I also don't know how to close a question.  I'll have to do some reading... it would be good to accept multiple answers (several of the methods could have worked).

I should like to give everyone who helped a million points....

Do more with

Expert Office
Submit tech questions to Ask the Experts™ at any time to receive solutions, advice, and new ideas from leading industry professionals.

Start 7-Day Free Trial