Link to home
Start Free TrialLog in
Avatar of Indyrb
IndyrbFlag for United States of America

asked on

Sun Solaris v210 Sunfire question

I am completely new to SUN (Solaris)

The person that supported these servers left, and I have issues I need to fix.
The problem is I have windows Exp, but not much *Nix and ontop of that it is SPARC hardware.

So I put in Solaris 10 CD\DVD and powered it up
It booted into this thing that said (ok)

Not quite sure what to do and how to configure.
I typed boot cdrom and it started giving funcky gibberish text

Can someone please assist with setting up this v210 Sunfire with Solaris, I would greatly appreciate it
Avatar of Pepe2323
Pepe2323

what are you looking to do with that  server ?

if you want to re install it then on OK prompt ( openboot) then yes type OK> boot cdrom

and the install process will begin.

If you try to troublooshoot the server then try to boot as single user mode

OK> boot cdrom -s

When you get the prompt try to mount the root file system

This link can give you an idea how to install solaris
http://forums.halcyoninc.com/archive/index.php/t-264.html
Avatar of Indyrb

ASKER

What exactly is this open boot.

I found a server that was sstuck on that open boot at the command prompt (ok) and iyt wasn't pingable or anything, so I typed boot, and I think the OS came up... Was this the right thing to do, why does it go into this openboot thing...


Okay back on the question above, I typed boot cdroom,,, like you suggested...
But it has all these weird characters... Is this normal?
also says something about SC Alert: PSU @ PS0 has FAULTED


Rebooting with command: boot cdrom                                    
Boot device: /pci@1e,600000/ide@d/cdrom@0,0:f  File and args:
¿¿¿m¿+!^B¿¿¿"m¿S%
                 ¿
¿¿J¿^DTm¿^P¿¿¿^R"¿¿ ¿^Z
                       0V!^F¿¿¿¿^R+¿F!¿J[S¿¿¿¿|¿|¿o¿m¿|¿|¿o¿m¿|¿|¿o¿m¿|¿|¿o¿m¿|¿|¿o¿m¿|¿|¿o¿m¿|¿|¿o¿m¿|¿|¿o¿m¿|¿|¿o¿m¿|¿|¿o¿m¿|¿
SC Alert: PSU @ PS0 has FAULTED.
|¿o¿m¿|¿|¿o¿m¿|¿|¿o¿m¿|¿|¿o¿m¿|¿|¿o¿m¿|¿|¿o¿m¿|¿|¿o¿m¿|¿|¿o¿m¿|¿|¿o¿m¿|¿|¿o¿m¿|¿|¿o¿m¿|¿|¿o¿m¿|¿|¿o¿m¿|¿|¿o¿m¿|¿|¿o¿m¿|¿|¿o¿m¿|¿|¿o¿m¿|¿|¿o¿m¿|¿|¿o¿m¿|¿|¿o¿m¿|¿|¿o¿m¿|¿|¿o¿m¿|¿|¿o¿m¿|¿|¿o¿m¿|¿|¿o¿m¿|¿|¿o¿m¿|¿|¿o¿m¿|¿|¿o¿m¿|¿|¿o¿m¿|¿|¿o¿m¿|¿|¿o¿m¿|¿|¿o¿m¿|¿|¿o¿m¿|¿|¿o¿m¿|¿|¿o¿m¿|¿|¿o¿m¿|¿|¿o¿m¿|¿|¿o¿m¿|¿|¿o¿m¿|¿|¿o¿m¿|¿|¿o¿m¿|¿|¿o¿m¿|¿|¿o¿m¿|¿|¿o¿m¿|¿|¿o¿m¿|¿|¿o¿m¿|¿|¿o¿m¿|¿|¿o¿m¿|¿|¿o¿m¿|¿|¿o¿m¿|¿|¿o¿m¿|¿|¿o¿m¿|¿|¿o¿m¿|¿|¿o¿m¿|¿|¿o¿m¿|¿|¿o¿m¿|¿|¿o¿m¿|¿|¿o¿m¿|¿|¿o¿m¿|¿|¿o¿m¿|¿|¿o¿m¿|¿|¿o¿m¿|¿|¿o¿m¿|¿|¿o¿m¿|¿|¿o¿m¿|¿|¿o¿m¿|¿|¿o¿m¿|¿|¿o¿m¿|¿|¿o¿m¿|¿|¿o¿m¿|¿|¿o¿m¿|¿|¿o¿m¿|¿|¿o¿m¿|¿|¿o¿m¿|¿|¿o¿m¿|¿|¿o¿m¿|¿|¿o¿m¿|¿|¿o¿m¿|¿|¿o¿m¿|¿|¿o¿m¿|¿|¿o¿m¿|¿|¿o¿m¿|¿|¿o¿m¿|¿|¿o¿m¿|¿|¿o¿m¿|¿|¿o¿m¿|¿|¿o¿m¿|¿|¿o¿m¿|¿|¿o¿m¿|¿|¿o¿
Are you connected by remote console ? or you are in front of the server ?

If you are used to PC hardware, you are used to interact with its BIOS. SPARC computers have an Openboot, it may seem to be like your BIOS, but it is actually far more powerfull. The Openboot performs the following tasks:

    Testing and initializing the hardware.
    Starting the operating System.
    Giving you acces to a set of tools to program and to debug it.

SC Alert: PSU @ PS0 has FAULTED. -- > means you have a power supply issues

Those weird characters sound to me like the connection to the server by serial is nor proper configured
Avatar of Indyrb

ASKER

I am using a KVM like management system that is connected to the Serial MGMT port of the v210 Sunfire...

When connected, I see all the text just fine

{1} ok boot cdrom

SC Alert: Host System has Reset
^QProbing system devices
Probing memory
Probing I/O buses
Probing system devices
Probing memory
Probing I/O buses


Sun Fire V210, No Keyboard
Copyright 2010 Sun Microsystems, Inc.  All rights reserved.
OpenBoot 4.30.4.a, 4096 MB memory installed, Serial #123456789.
Ethernet address 0:1:ab:1a:1a:11, Host ID: 123a1b11.

Rebooting with command: boot cdrom                                    
Boot device: /pci@1e,600000/ide@d/cdrom@0,0:f  File and args:


Then all that gimberish starts.. with a few SC Alert: PSU @ PSO has faulted.

What should I do next? how do you break from it, what is it suppose to look like?
Checking the sunfire v210 specs,  i found that this server only has one Power supply; that means that you need to replace it before you start thinking to re install or trouble shoot

You might need to open a case with Oracle to get the replacement or look for it with an other vendor
Avatar of Indyrb

ASKER

Thank you --- is there a way to pull the service tag / make/model and serial number without going to the physical server. from the (ok) prompt?

Also the powersupply is different than I am used to.. other servers, you can hot swap, or pull out.. this appears to be mounted in the v210... Does it require Oracle support or can I swap with another v210.. if so, any steps?

How can I run diags to make sure everything else in the system is good... test-all?
I want to make sure hard drives are okay..,..

And I am a little curious,  -- if the power supply was bad, wouldn't it just not come on...
Its kinda weird, that it would make the installer giberrish.
Avatar of arnold
I do not believe the v210 is still supported, yes, you can replace the PS form any other v210.

If the system boots, you can get its info.  I think 123a1b11 is your systems ID.
Check if your system is still under a support contract.
If it is, they may ship you a replacement ps.
As far replacement it is the same process as any other non hot swap open the case, etc.

Depending on what it is you need, if you have a functional v210, an option could be to move the hot swap drives.
http://docs.oracle.com/cd/E19088-01/v210.srvr/819-4207-10/819-4207-10.pdf
Avatar of Indyrb

ASKER

Like I mentioned the previous employee no longer works here, and my solaris exp is limited.. but I did find out what all they needed.

First the system needs brought back online and a backup of its old database.

This is all running solaris

Somehow during this process I thought a reinstall was needed, which took me down the path above...

But all I need to do is get the system up, do a backup, and move on.

However, this orginal server was off the network for security reasons and last backup was in 2010...

So now I am trying to figure how to bring this server online, logon, and either install the client or do a backup...

took out disk and booted to system...
at logon prompt, but no one knows root password....

Also it was indicated this server has a  corrupt OS, BAD magic number.

Is there a way to do a repair, or install without removing applications and configuration

also how do I get back on the network

how do I install and backup database? netbackup

and where do I put it? how do I get it to backup to fileshare\windows\


And I was trying to get firmware,ALOM, OBP but cant get it... My oracle support asks for CSI, and know one here knows the number and/or etc....


But all these systems have various errors PSU @ PS0
Bad magic number deals with a failed disk.
At the ok prompt probe-scsi but make sure you get to the ok during.  If you drop the system to the ok prompt by sending control-break , probe-scsi will hang.

You can boot the system using the OS DVD! go through the steps up until it gives an install go ahead, at that time, you can drop into a terminal.  The difficulty is that the complexity of what you need/want is not so simple,

The root password can be reset. You would need to mount the partition where /etc/shadow and /etc/passwd are.
Editing /etc/shadow and removing or replacing the root:password: with a new encrypted password, upon boot will get you access.

The issue might be the PS overheats or the cooling fan is not reporting.

Once the password is changed and the system is booted, you can login.

You have to do it a step at a time, first step gain access tot the system when OS boots.
Avatar of Indyrb

ASKER

Thanks for the reply...

I will boot the system again... and try the probe-scsi and give you the results.,

but booting to the DVD (boot cd-rom) gives all the gibersish, I tried a few disk and the same results happen.

On resetting the root password, how do you mount /etc/shadow and etc/passwd without knowing the user, and able to get into the machine... I use a windows box, with a serial terminal via the SERIAL MGMT port on a cyclade... the server doesn't appear to be on the network.... I get to the openboot okay, and it boots to the OS and asks for logon id.. but that's it. cant seem to boot from DVD drive.. Anything else I can try...
You have to boot the system using the OS CD/DVD and drop into a shell.
Similar to a repair console in windows.

Which KVM are you using, it should mimic the graphical interface.  see whether you can change the parametes/settings on your display
Avatar of Indyrb

ASKER

KVM is Cyclades as ACS 16

Advocent/Emersonnetworks.



Remember this is connected via the Serial MGMT port, as no network connections are installed at the moment.


Port Settings on Cyclades for MGMT


Connection Protocol Console SSH
Buad Rate (Kbps) 9600
Flow Control  None
Data 8
Parity None
Stop Bits 1
DCD State  Disregard

No Data Buffering
No power Management
No IPMI

TCP Port 7007
TCP Keep-alive 1000
idle Timeout (min) 0
Break Interval 500
Break Sequence ~Break
Terminal Type vt100


Cyclade Version:

Linux version 2.6.22
(gcc version 3.3.1 (MontaVista 3.3.1-3.0.10.0300532 2003-12-24)) #1
Tue Aug 30 00:21:36 PDT 2011
Cyclades-ACS16-Linux V_3.3.0-10 (Aug/30/11)
Any chance, you can post an image of what the screen looks like?
The terminal emulator needs to be vt100, try changing the baud rate after connecting to a higher rate.
Avatar of Indyrb

ASKER

yea I had it set at vt100 and connected to console.... it was sitting at the login prompt, and I don't know the password. No one knows... but I clicked the Send -Break and it dumped me to the {1} ok  prompt...

I typed in probe-scsi as requested.
It said please type reset-all to reset the system before executing this command.
Do you wish to continue (y/n)

I was uncertain what any of these commands did, so I typed help diag, which showed the probe-scsi and probe-scsi-all along with test-all or test (component), but not sure what the reset-all does.  It doesn't wipe or delete anything right?

Also do you know what the latest firmware, OBP, PROM, ALOM and etc is for the SPARC sun fire v210

Also this Openboot shows a serial number, but its not the same serial number as the one in the back of the server. FM........

I typed printenv.
and it shows
ttyb-mode and ttya-mode   as 9600,8,n,1,-

How do you reset the root password from here.

Also trying to get you PROM,OBP, and etc version, but Im not seeing it.


Back to the probe-scsi (I went ahead and pushed Y to continue.

It keeps saying BUS Fault (line after line after line after line...

Send break, control space.. nothing stops this bus fault continuing error.

See print screen... I guess I can go downstairs and reboot.
 but still need some guidance please.. Thank you : )
busfault.jpg
You can not run probe-scsi after dropping to the ok prompt.
Reset-all clears system registers so probe-scsi can run.

http://docs.oracle.com/cd/E19683-01/817-3814/6mjcp0qhe/index.html

Bus fault points to a failed device.

In order to reset the root password, you must boot the system from CDROM
http://docs.oracle.com/cd/E19253-01/817-0403/tsgeneral-18/index.html
Avatar of Indyrb

ASKER

I ran setenv auto-boot? false per doc.

Then I typed reset-all

Gave output:
Sun Fire V210, No Keyboard
Copyright 1998-2003 Sun Microsystems, Inc.  All rights reserved.
OpenBoot 4.11.4, 4096 MB memory installed, Serial #123456789.
Ethernet address 0:1:ab:23:c4:56, Host ID: 09876a543.

the went back to {1} ok prompt
type probe-scsi
and probe-scsi-all

{1} ok probe-scsi-all

Output below:

/pci@1c,600000/scsi@2,1
/pci@1c,600000/scsi@2

Target 0
Unit 0   Disk     SEAGATE ST3300007LC     0003

Target 1
Unit 0   Disk     SEAGATE ST3300007LC     0003

Then ran test-all

{1} ok test-all
Testing /pci@1d,700000/network@2,1
Testing /pci@1d,700000/network@2
Testing /pci@1c,600000/scsi@2,1
Testing /pci@1c,600000/scsi@2
Testing /pci@1e,600000/ide@d
Testing /pci@1e,600000/isa@7/rmc-comm@0,3e8
Testing /pci@1e,600000/isa@7/serial@0,2e8
Testing /pci@1e,600000/isa@7/serial@0,3f8
Testing /pci@1e,600000/isa@7/i2c@0,320
Testing /pci@1e,600000/isa@7/rtc@0,70
Testing /pci@1e,600000/isa@7/flashprom@2,0
Testing /pci@1f,700000/network@2,1
Testing /pci@1f,700000/network@2

trying to boot to cdrom
type boot cd-rom

gave SC aLert PSU @ PS0 has faulted   *** how important is this error** what issues can be cuased... It only has one Power Supply so I thought if it failed it just wouldn't work...  



Rebooting with command: boot cd-rom                                  
Boot device: /pci@1c,600000/scsi@2/disk@0,0:a  File and args: cd-rom
boot: cannot open cd-rom
Enter filename [cd-rom]:

retyped boot cdrom with cdrom all one word....

Rebooting with command: boot cdrom                                    
Boot device: /pci@1e,600000/ide@d/cdrom@0,0:f  File and args:

Spinning | cursor thingy..
Doesn't appear to be doing anything.. (attached print screen)
boot-cdrom.jpg
Avatar of Indyrb

ASKER

another error:

SC Alert: CPU_FAN @ MB.P0.F0.RS has FAILED.


Can I rob these parts from a working chassis. which ones are MB.P0.F0.RS...
If you have a working chassis, it is  simpler to move the drives.
Boot from cd, update/change password. Let the system boot, login.
Avatar of Indyrb

ASKER

I had one v210 that had a green light on power supply, and one that had an amber light on the power supply on power up....

Not sure if both reported PSo errors, but I took the green (good) power supply in the other, now it is amber too...
Put it back, move the drive making bay for bay.

Is it reporting ps0 issues?
Avatar of Indyrb

ASKER

Did same thing. Perhpas both chassis has issue. Id there a way to clear all events and retest. I assmue its the reset-all. But wanted to make sure... the os boots up. But I don't have root password and the boot cd-rom does its spinny thing.  Maybe its the cyclade terminal thatis not presenting the terminal right. Idk.. thoughts
The below link discusses and offers the different diagnostic settings for post.

http://docs.oracle.com/cd/E19088-01/v210.srvr/819-4208-10/diags.html
Avatar of Indyrb

ASKER

thanks for all the feed back

Does the v210 Sunfire have compatiable issues with certain DVD/CD types.

I tried a DVD+r and wasn't sure if this worked with the SPARC system.


ran
{1} ok
setenv diag-switch? true
setenv diag-level? max
setenv diag-script all


setenv diag-trigger user-reset
setenv diag-trigger all-resets
setnenv auto-boot? false -- so it doesn't boot to OS

reset-all

Also typed obdiag
setenv test-args bist, debug, iopath, loopback, media, restore, silent, subtests, verbose, callers=0, errors=0

test-all

said errors 544 of 544 test failed 6, didn't give specifics. unless I am blind, and when I hit enter. it went back to menu. still gave the ps0 error.
even after moving it back.

maybe a board or something idk
Since you disabled boot, it has nowhere else to go.
The issue might be with the drive.

Did you burn the ISO onto the DVD?
Depending on how you transferred the ISO onto the disk, it might be that the DVD is not seen as bootable.
Avatar of Indyrb

ASKER

To answer your question,
I downloaded the sparc iso from oracle. since I have SPARC v210 Servers.

right clicked the iso and selected burn disk image.or burn image to disk
something along those lines.
Note this was from my Windows 7 Pro. computer.
I done this several times with windows iso,(bootable) but first time with sparc solaris images.

Like I said I am a complete newbie with solaris, so I thank you a great deal for your assistance and timely responses.
Experts like you, is why I invest and subscribe to EE every year.

Anyways, looks like it burned okay. but I am not 100% certain....
Tried other cd-R \DVD I found floating around with Sun SPARC images,
The ones floating around were burnt by other people..
And all previous solaris admins, users, and etc no longer work here, no diags, and no one to support these systems other than me -- ooh no documentation either...ooh lucky.,
So again, thank you for assistance.

The scope:
I have about 6-7 servers with various issues.
some hardware and some software, and all new to me...
The more I dig, the more I find, and then some are uncovered in the depths of hell located in some hidden closet, and I am learning as I go
Usually user complaints and loss of functionality sends me down the rabitt hole, and what I have visually seen (amber lights) -- its a mess.

The info you provided has definitely got me going in the right direction, which I can use for all the various systems... Thank you : )

Ill try to limit my question to proper commands and understanding of diagnosing system health, performance, and integrity.

I will try to limit sending you all the logs, and have you read them, then say, yep its XYZ hardware or etc. but instead just a tad more insight, may use a couple examples though.

I assume some of these servers maybe under some sort of support agreement, and I know others are not. I am on the hunt for who they are supported through, either vendor or third party. But so far, Its a huge feat, as no one knows anything...

Either way, I will have to do some diags, and testing prior to calling support, so they can identify which parts need replaced based on commands or troubleshooting steps I did to isolate all other potentials problems.

Like I said, the commands you provided gave me a good launching point, and I'm underway.

Just a few final questions please if you don't mind...

Abviously if it boots directly to openboot, because of the auto-boot? setting or which ever.

you can run the above tests mentioned in previous post directly from Openboot.

Curious, Which ones from openboot is most verbose and gets everything.
There was a few methods in the doc, a few types, and different ways.
I am looking for most complete and best practice when fighting the unknown.

When I first jaunted down this process, I typed help diag in openboot {1} ok
and saw the test-all command..
Not sure how it compares to the others listed below and in the document you provided, but it didn't seem to show the CPU and Fan errors within the logs or summary after test was complete..
It only showed them via the Termianl window and said SC ALERT blah blah blah
Which gave me a little insight of the issue, along with the amber light.

Weird, okay, moving on, running the methods you provided, I believe showed some errors.
But appears there were a few ways/methods you could complete this...

some said Method 1
{1} ok
setenv diag-switch? true
setenv diag-level max
setenv diag-script all
Power Cycle

{1} ok  another Method 2
setenv diag-switch? false
setenv diag-level max
sentenv diag-trigger user-reset
sentenv diag-trigger all-resets
reset-all

then type obdiag
typed the
test-args bist, media,debig,iopath,loopback,restore,silent,subtests,verbose,callers=0,errors=0
then a test-all

I am guessing its all the same "test-all" command within openboot, just extending what it tests and etc.. just curious...

Like I said, From my intial test-all didn't show errors in the summary/log output, unless I was blind.

But as mentioned I constantly saw SC Alerts about Power or FAN issues, in the terminal.
So I was curious why these weren't in the diag summary.

Between openboot diags and OS prtdiag -v, which ones is more complete and accurate, and should be considered priorty over the other and/or best practice. I am sure each has their use case, in example, maybe there is no OS... but if both are available.

In one example, I have a server that boots, but zero people know the password, and this is the one server that wont boot to the cd-r /dvd/.. It also keeps yelling at me about FAN and power too. Must of got a bad bunch of Sunfire v210 with power and FAN issues that either has always been there, and no one did anything about it, or they al triggered at once, like the new-York time square new years eve ball drop.

Im trying to swap more parts around, and also will try swapping dvd roms (drives) if needed
I will remake bootable disk.. Any special considerations? for the SPARC iso? was my process okay?

But once I do boot to the DVD (fingers cross), I assume your previous post will get me started on how to reset root password. Lets say I am successful in this endeavor, and now I have access to the OS and command prompt

Should I run OS prtdiag -v or does the openboot SC alerts and diags report the same thing.

Are there other diag commands and overall system health check, disk integrity, and OS functionality commands that can be done. I thought I overheard other commands. foo something or idk.... I want to also be proactive, instead of reactive, and do overall health check.. I'd like to check OS for curuption too.

I saw these from the link you provided and should be ran from the OS commandline.

vi /var/adm/messages
prtdiag -v
prtconf
prtfru -l or -c
psrinfo -v
showrev

Are there othere that should be ran?

Next is a little understanding of the results.

/var/adm/messages on one of the servers, I do have root access to
has error:
[ID 431074 kern.error] CPU_FAN @ MB.P1.F0.RS has FAILED.

Okay looks like another Fan.. but I wasn't seeing it in Prtdiag -v anywhere.
Which is right? Amber light is on, so I am guessing the /var/adm/messages

Misspoke:  I am a dummy:
I reran, apparently I was typing prtdiag /v (told ya Im a windows guy) sorry. moving on...
found line in prtdiag -v
MB/P1/F0             RS              failed   0 rpm  

Okay this system has a CPU_FAN error == sweet , will call support or swap with one of the other servers.

But still wondering why openboot didn't show this error, except for the constant SC alerts in the terminal screen.

Like I said I have several servers with amber lights and my diagnostics are underway...

But some of them are mission critical too, and uptime is the most importance, they cant be shut off or rebooted with out special launch codes from the president and the stars align just right a full moon sitting in the EAST, and only on Tuesdays, that land specifically on November 45th, whenever that comes around.

Okay, a slight exerageration, but you get the idea, there are mission critical machines that cant be powered down, with act of God.
However these mission critical machines have amber lights and errors on them..
Alright, I am sure there is a little risk management here. but for the sake of agreement,
lets say I want to identify all the errors without disruption to the OS, or the services it provides to the clients, allowing me to identify hardware issues, then on that spatacular Tuesday Novemeber 45th. We will perform scheduled maintenance to replace the part.
I just was uncertain if running any of the diags, disrupts functionality or performance in anyway.

I have root access.
I logon.
I run the commands from the attached link you provided before (thank you)

first:
vi /var/adm/messages
Shows some errors:
next prtdiag -v
you get the jest.

Is there also commands to test OS integrity, or data currpution. or process errors
One person told me the system had black magic number errors.
Where would you see this, or what diag reports this..
is there a repair command to fix issues?

Can you check OS, and hard drives without causing downtime to the server, or any loss of functionality while diags run.

I also saw once I was in the terminal windows, loged in as root, on OS 5.9, I accediently click the control break button.

This sent me to the Openboot {1} ok prompt.
How is this different then the one that boots directly to openboot.

Also I am guessing you could run test from here too, but are there any commands you don't want to run, since this server is suppose to be up 24x7 and also available, and its still running the OS in the background somehow, even though you are at the {1} ok prompt..
I don't want to run some test, that makes it reboot and go offline..

Thanks in advance and again appreciate all your input.
will close this request, and open new request if additional needs are needed.
I will award all point, wish I had more I could give..
But if you could kindly copy and paste my post reply, and write your answers underneither or something, I would greatly appreciate it.

I vote Arnold -- hope your in the pools November 45th.
Starting with last thing first, if you hit ctrl-break or stop-A and end up at the ok prompt, to return the system to its former state, type go.

Commands executed within the OS do not disrupt the OS function I.e. Prtdiag, showrev (deals with OS and patch/update level.)

A bad magic number indicates a failed drive
Format
Is the command to run which will list the disks in the system in the format cxtyd0sz where
c indicates controller
   x 0 means internal, 1 additional, 2 additional
t indicates scsi ID these days they are auto set by the position in the backplane
   y 0-6 8-15 7 is reserved for the controller with sas the range is higher.
s indicates slice partition

   z from 0-7 excluding 2 which is the "backup" partition which spans the entire range of cylinders.

Since you have some logins, are they similar, you can try them on the system that boots.

The burn image to disk should work, what happens, when you break the boot of the OS, hit reset-all
boot CDROM

What does the system do?


As far as support, contact oracle with all your serial numbers in hand and this way you can check whether and which systems are still under support.

There are systems that certain failing/failed components can be turned off, but I do not believe your system has any redundant components other than drives if so setup (RAID using metadata. Software)

There were versions of redhat 2,3, and possibly 4 that could boot a sparc system.
The difficulty when a raid is setup, that it has to be reassembled (without initialization) in order to gain access to the / and then etc/shadow file.

While not an optimal situation, this way you are forced to learn quickly and debugging, troubleshooting ........
As you have began to do.

Presumably other than lack of documentation, there is no way to determine what and how these system do, did or were used for.
Avatar of Indyrb

ASKER

A bad magic number indicates a failed drive
Format
Is the command to run which will list the disks in the system in the format cxtyd0sz where
c indicates controller
   x 0 means internal, 1 additional, 2 additional
t indicates scsi ID these days they are auto set by the position in the backplane
   y 0-6 8-15 7 is reserved for the controller with sas the range is higher.
s indicates slice partition 

   z from 0-7 excluding 2 which is the "backup" partition which spans the entire range of cylinders.

Open in new window


Is the format command, where you identify if there are black magic errors on the drive?
Where would it show the error or log the error, so I can validate? is it reported in the prtdiag -v too.

Is there any OS scan, repairs, defrags, healtcheck, and integrity check on the Hard drives?
I know you mentioned the probe-scsi, but I am talking about os integrity, like chkdsk or sfc /scannow or dism or third party apps,

There are systems that certain failing/failed components can be turned off, but I do not believe your system has any redundant components other than drives if so setup (RAID using metadata. Software)

Open in new window


You bring up an interesting comment. I worked with RAID before on the norm, dells, hp, ibm, the norm x86 servers. always choising hardware raid over software raid.

The hardware vendor has either quickstart Cds that help you configure all hardware and prep for OS installation with drivers, agents, and necessary software.

And for ones that like to manually setup, usually the RAID card has a ALT-I or some command to configure the card, the drives, the raid, and etc.

I saw nothing like this with the SUN... how and where do you configure raid...

Another quick question on the DRIVES:
As I mentioned, replaced cpu fans and powersupplies and etc. looked at summary, and I swore it said hard drives 1 and 2 were offline or not connected, something amiss..
it didn't boot to the OS either.... nothing,
is there a way to see if these are failed drives, do you configure a setting to make them online... Am I missing a step.

The burn image to disk should work, what happens, when you break the boot of the OS, hit reset-all
boot CDROM

What does the system do?

Open in new window



Yea, I did the boot cdrom thing and that's what gave all the gibberish,.. truth be known, my first attempt of the command was boot cd-rom (note the dash) which yelled screamed and kicked at me, SunOS servers can be so hateful, I swore it called me names and laughed historically.

Later I figured out it was a oneword sentence. cdrom....Damn you PBS Seaseme street buy sounding out word making it look like two words until they snuggle together to one word, but only after repeating the word 15 times saying it faster each time and closer together.....
I still say CD ROM, but that's okay -- I am now educated...

None the less when typing boot cdrom, it either did the funky gibberish or the spinning cursor via the remote terminal window hooked up to the serial mgmt. port via Cyclades.  Thats how I access the server...
via Cyclade website of device, with java applet?
Which Has its own woes which Ill get to in a minute...

Now, earlier today, I replaced power supply and CPU fan on a couple servers, and so far so good. I think,.. crossing fingers. didn't have time to retest, but some of the amber lights went away... I also swapped the dvd-rom drive... I ran out of time and didn't get a chance to try booting to cdrom again. also found earlier version 9, was using 10.. I will test these as well... still a lot of stuff to do, stuff to learn, but thanks to you I am making progress..
You are awesome!! If you have kids, you should have a #1 dad coffee cup too.


My thoughts: If it still gives gibberish and or doesn't work on boot cdrom, I am guessing either my cds or dvds or not burned right, or perhaps viewing the install / boot from cdrom portion is presenting an issue through Cyclades ACS 16, or AFTERPATH Terminal Window.. and I should actually validate booting and cd\dvd functionality by going to the server room, and plugging directly into serial MGMT port and connect to my laptop...

Just a thought...

As far as support, contact oracle with all your serial numbers in hand and this way you can check whether and which systems are still under support.

Open in new window


Captured the serial numbers, along with make and model of the failed Sun servers...
FM.. blah blah....
If I boot to openboot or even the OS, how can you check the serial number without physically looking at the sticker. I did see something about system id, or something but the numbers didn't match. Not a big deal for local servers, but wouldn't you know it. I have remote servers I need to capture as well.

Actually, I need to go all the way... I should do an inventory of all the Sun servers, It would be good to check support even on ones working, as they could potentially go bad, and then investigate if we have support and if not, decide who and how much would it take to get support agreements underway. and which boxes need to be under agreement.
Having the serial number and etc, will help a great deal,
Only if there was a way to see what the heck it actually does, who accesses it. when last access was. any connected ip addresses. Like I said, no one has any documentation, yea a few end users pointed out some issues here and there, but they barely even know what the heck they connect to, just theres some issue.

I still need to hunt down ones in deep hiding... Is there a tool that scans for all solaris Sun servers on a subnet. .. Come to me, and update my mass inventory list...

will reach out to support perhaps tomorrow for validation, but someone told me they didn't think they were uder support or warranty agreement with oracle. but perhaps some third part vendor that does some hardware repairs. looking into this as well.

Due to your help and guidance, you identified and validated legit hardware issues, and by moving parts around, you enabled me to resolve some issues.. YeaH!

You said something weird with Cyclades?

Open in new window


I almost forget, so thanks for reminding me... So there is this Cyclades applicance... First time using this box, but its like a KVM of some sorts... Without monitor or keyboard/usb dongles... Looks like a Cat5 connection to the SERIAL MGMT port of the SUNOS server and goes into some random port in the back of the Cyclades. Well... for some reason, I get communication though one port... yep, this thing has a least 16 or more ports, all plugged into the back of the sun server mgmt. serial port, but only one port on lucking port 7 allows access to the terminal window... I tried changing to RAW, SSH, console, Terminal, but wasn't seeing much difference. I don't know if you have experience with this device or not, but just jotting it down anyways... Morever, It appears that by default, it would probably just say Port 1 or Port 2. well, when accessing that, you wouldn't know what port goes to what server by the default port number name... So someone in the past reconfigured the port name to match its server name, so when selecting port, it slected the right server...
Okay, this is al messed up. the names don't match their ports, I went and looked. Server Summerbeehive is connected to port ChaChaChiapet.. so its all crazy... At this point its kinda irrelevant, since only port 7 is working somehow... But hopefully there is some magic command that either diags the device, and reports errors, or reengages the port.. IDK..

Then I was hoping that this applicance could magically detect the hostname and then through this magic, go ahead and replace the port name with its server name. AutoMAgically, no user or alien intervention needed. Damn I could use an alien right now... Handy... Also if someone manually put in the wrong name, hopefully this magic process says, hey this port shouldn't be named hannetanne its true hostname is rockonsuckas, so I will update the port name to rockonsuckas,


You mentioned that running diags in the OS command line does not cuase loss of functionality to the services. sweet.
And it is known that in openboot, if you type certain commands it will reboot and start post and diags.... some diags don't require a reboot, but nonethe less if booted directly to openboot Os activaty is non existant.

Now the thing I am trying to piece together, is the control a break voodoo.
This drops you back to the {1} ok prompt...
This looks just like open boot, it even smells like openboot.. BBut is it the ssame... meaning, if running certain diags, since the OS is running will it reboot the system to start its stuff, or will any diags running in this secondary openboot clone cause disruption

Yep you were right the go resume thing put me back....

So I thought I saw a site say there are three modes, Openboot which we discussed, the OS, which that's apparent if installed. but I thought one site said something about a service console or sc, but I cant seem to refind this... how do you get to this place... follow the yellow brick road?  any certain command? what does this allow you to do? why would it be used?

Its kinda funny, but yet efficient looking at the back of the server.. No keyboard, no mouse, no video card.. damn near nothing... a net MGMT port a Serial MGMT port, a Network Port, and im guessing a serial port... aint used a serial port since setting up COM2 to connect to my blazing 56K external modem, where I looked at alt. binaries all night, and played on BBS old school style. I can still hear the crackling and screatching as it connected to the ISP now.. CCCCH TRSH ding DOng - hoooo haaa. bep.

So Is ALOM and Serial MGMT the same thing? what is net MGMT, when and why ould you use this.. if you didn't have a Cyclades with only one working port? and I am flamilar with ssh and working with putty. I have had to use it in the past for VMware... So I am guessing if the solaris box is on the network, with a valid ip address, gateway, and etc. you can connect to ssh. ( oh previding its allowed access through firewall, and since I only know the root user name and password on these systems. (minus one server that no ones knows) will try boot to dvd tomorrow)  I would have to enable ssh root privideges.
Where is this done in Solaris? cant sudo or use any other account. don't know what they are or what passwords they are... And you would probably say, well to minimize security vulnerabilities to ssh and root, just create your own account... This is where the Sun OS is calling me names and laughing historically at me, all because I don't even know how to perform this probably basic feat.
This will be a short response.
When you have a bad magic number, there is no test.
format has multiple tests to check the disk for bad blocks, etc.
analyze
format
select disk (1-x) (use d to get the disks on the system listed)
help will list your options

The different tests list read which are safe to run with data on them, other tests that can be run are destructive i.e. read/write.

another thing to try is boot cdrom -s

the display should be vt100 terminal emulation.

Sun/Solaris have long ago used Software RAID built-in (Solstice Suite from Solaris 6 or 7)
metastat
Software RAID in a heavy OS complicates matters further.


I'll have to reread your post to answer what I missed here.
Avatar of Indyrb

ASKER

I finally was able to boot to the cdrom
{1} ok    boot cdroom -s
Came to # prompt

I see the /etc/shadow file
but it says root:NP:6445:::::::

I couldn't edit this file, matter of fact it said something like
I don't know what kind of terminal you are on -- all I have is 'unknown'
[Using open mode]

so I tyoed
TERM=vt100

Cool, slightly better. but I found that I think this shadow file is the one on the DVD rom.

So apparently I must mount to the OS... when going to /dev/dsk/
there are probably 20-30 c0t0d0s0 - s1t1d0s7

Per a quick google search, I found that I needed to run
mount /dev/dsk/c0t0d0s0 /a

This gave me an error
nfs mount: nfs file system; use [host:]path

(1) not sure which of the dsk is the one with the OS
(2) not quite sure how to identify then mount.
(3) if it is nfs, not quite sure what to do...
Avatar of Indyrb

ASKER

tried boot cdrom  

Saw weird error:

WARNING: /usr/sbin/zfs mount -a failed: one or more file systems failed to mount

and at prompt for language 0 for English

Then there is a bunch of options.

tried a few options, but I get to the point where it says name host and its blank

I don't want to reinstall or over install and cause issues, just want to reset root pwd.
Avatar of Indyrb

ASKER

Based on a google search typed:

on cdrom -s
TERM=vt100
zpool import -R /a rpool  (gave error no such pool available)



went back and redid the boot cdrom -s
typed TERM=vt100
export TERM
mount /dev/dsk/c0t0d0s0 /a
changed to /a
There is nothing inside here.

I am confused.....
Avatar of Indyrb

ASKER

this is a Sun SPARC v210...

tried the mount /dev/dsk/c1t1d0s0 /a
The state of /dev/dsk/c1t1d0s0 is not okay and it was attempted to be mounted read/write
mount: please run fsck and try again.

retried
mount /dev/dsk/c0t0d0s0 /a
says write-protected

But agqain nothing in "a"
Instead of trying to guess which drives exists, run format which will list the available disks.

You can the select each drive and use partition to see the partition table.

Use the existing functional v210 to see how they are setup in terms of partitions as a reference.


On solaris, the only slice that has to be there is s2 which is the whole disk map which can not be mounted nor written to.

You have to see the partition table
http://docs.oracle.com/cd/E23824_01/html/821-1459/disksprep-24.html


To access the shadow on the hard drive, you must find the partition that has this information.

Not that if the system uses metadevices you have to assemble them prior to making any attempts to modify files.
Avatar of Indyrb

ASKER

Ut uh --- so
I did type format.

I saw two listed

0. c1t0d0 <Seagate> blah blah
1. c1t1do <Seagate> blah blah

it says specify disk, so I choose 0
it then said disk formatted>?

Did I loose my data?

 my heart is racing....

I typed print
and it said
Part
0               root                    wm  
1               swap                   wu
2               backup               wm
3-7 unassigned.


Sorry, my bqackground is windows, and typing format, using means destroy data.
Avatar of Indyrb

ASKER

I did same thing for c1t1d0

part
0   unassigned   wm
1 swap    wu
2 backup  wu
3-7 unassigned
Avatar of Indyrb

ASKER

I don't know what a metadevice is -- sorry.

AVAILABLE DISK SELECTIONS:
       0. c1t0d0 <SEAGATE-ST3300007LC-0003 cyl 45265 alt 2 hd 16 sec 809>
          /pci@1c,600000/scsi@2/sd@0,0
       1. c1t1d0 <SEAGATE-ST3300007LC-0003 cyl 45265 alt 2 hd 16 sec 809>
          /pci@1c,600000/scsi@2/sd@1,0
Specify disk (enter its number):

DSIK 0

Current partition table (original):
Total disk cylinders available: 45265 + 2 (reserved cylinders)

Part      Tag    Flag     Cylinders         Size            Blocks
  0       root    wm     653 - 45264      275.35GB    (44612/0/0) 577457728
  1       swap    wu       0 -   647        4.00GB    (648/0/0)     8387712
  2     backup    wm       0 - 45264      279.38GB    (45265/0/0) 585910160
  3 unassigned    wm       0                0         (0/0/0)             0
  4 unassigned    wm       0                0         (0/0/0)             0
  5 unassigned    wm       0                0         (0/0/0)             0
  6 unassigned    wm       0                0         (0/0/0)             0
  7 unassigned    wm     648 -   652       31.60MB    (5/0/0)         64720

partition>


Disk 1

Current partition table (original):
Total disk cylinders available: 45265 + 2 (reserved cylinders)

Part      Tag    Flag     Cylinders         Size            Blocks
  0 unassigned    wm     653 - 45264      275.35GB    (44612/0/0) 577457728
  1       swap    wu       0 -   647        4.00GB    (648/0/0)     8387712
  2     backup    wu       0 - 45264      279.38GB    (45265/0/0) 585910160
  3 unassigned    wm       0                0         (0/0/0)             0
  4 unassigned    wm       0                0         (0/0/0)             0
  5 unassigned    wm       0                0         (0/0/0)             0
  6 unassigned    wm       0                0         (0/0/0)             0
  7 unassigned    wm     648 -   652       31.60MB    (5/0/0)         64720

partition>
A metadevice is a software RAID configuration.
The two drives are partitioned identically.
Partition/slice 7 appears to be the location where the metadb data is stored which is the information on the RAID volume.

fsck /dev/dsk/c1t0d0s0
Then repeat the same for c1t1d0s0

Then you should be able to mount it as /a and then look in /a/etc/fstab

To avoid shifting the boot from software raid to an individual drive  your best approach is to look how the other systems setup, re they booting of a c1t type drive or do they have /dev/dsk/md type of device.


Look
metattach d10 /dev/dsk/c1t0d0s0
Repet thr spp
You can then try mounting /dev/md/d10 /a

http://docs.oracle.com/cd/E19683-01/817-0660/6mgep4cj9/index.html

The difficulty I have is that the uncertainty how your system is setup makes me cautious in telling you do x, y,  z which if the assumption is wrong could cause you more troubles than you now have.
Avatar of Indyrb

ASKER

I didn't noticed the similarities before....
They are the same  except for 0 Root and the pther one is 0 unassigned,.. why is this?

I ran fsck /dev/dsk/c1t0d0s0   and all came back good
I ran fsck /dev/dsk/c1t10d0s0 and last phase says found something wrong FIX? so I pushed Y

went back to #
tried to mount the c1t0d0s0 but it didn't work
mounted c1t1d0s0 and it worked.
TERM=vt100
export TERM

browsed /a and could see directories /a/etc/shadow

vi /a/etc/shadow
but it was being a pain to edit, had to lookup all the move right insert blah blah as the normal way wasn't working...
but finally somehow was able to put root:NP:2435235
the normal :wq! wont work, but :w!  then :q!  worked.

rebooting system  

control break
{1}ok  boot

ERROR: The following devices are disabled:
    net2&3
    net0&1

Rebooting with command: boot                                          
Boot device: disk0:a  File and args:
Loading ufs-file-system package 1.4 04 Aug 1995 13:02:54.
FCode UFS Reader 1.12 00/07/17 15:48:16.
Loading: /platform/SUNW,Sun-Fire-V210/ufsboot
Loading: /platform/sun4u/ufsboot
SunOS Release 5.9 Version Generic_122300-51 64-bit
Copyright 1983-2003 Sun Microsystems, Inc.  All rights reserved.
Use is subject to license terms.
WARNING: forceload of misc/md_trans failed
WARNING: forceload of misc/md_raid failed
WARNING: forceload of misc/md_hotspares failed
WARNING: forceload of misc/md_sp failed
Hardware watchdog enabled
ifconfig: plumb: bge0: no such interface
moving addresses from failed IPv4 interfaces: bge0 (couldn't move, no alternative interface).
Hostname: serverA
The / file system (/dev/md/rdsk/d0) is being checked.



now its doing some stuff... I kinda loss ya on the other steps... I wore a helmet as a kid.. sorry, can you elaborate?

looks like its stuck
Avatar of Indyrb

ASKER

wow -- it booted... did some voodoo and some errors (corrected)
but it was up to the os

pushed enter

came to logon
typed root
password (blank)

Nothing login incorrect
tried NP
.... confused....
What should I of typed in /a/etc/shadow file
Your system was configured with software raid.
Your editing requires, that you change to boot system using an individual disk.

The likely issue is that a resync will wipe the changes you made.
The data on the software raid is in /etc/LVM/mdtab if I'm not mistaken for solaris 9(5.9)

You could look at using metattach d0 /dev/dsk/c1t0d0s0
metattach d1 /dev/dsk/c1t1d0s0
metattach d10 d0 d1

Then mounting the /dev/md/rdsk/d10 /a
metastat will report the status of it.

Search for "solaris recover root password"
There are several option that includes guides to modify the open boot directive to use disk1 c1t1d0s0 as the boot device as well as altering fstab to remove referenced to the metadevices that will not be reassembled.

These attempts could result in data loss.
The direction is to leave the password as blank, or paste an encrypted password.

your issue is that you modified the file on c1t1d0s0 and I suspect during boot, c1t0d0s0 is accessed and sync to the other reversing what you have done.

on another system run
perl -e 'print scalar crypt("mypassword","MW") ."\n";'

You should have a string MW........... Where the (.) represents a character
MW is used as a hash key.
Avatar of Indyrb

ASKER

I didn't get to this today, and will revist it tomorrow.
But the whole software rad and etc. confused me...

I will try to follow your process, as outlined and respond accordingly. hopefully this is all I need


The data on the software raid is in /etc/LVM/mdtab if I'm not mistaken for solaris 9(5.9)

You could look at using metattach d0 /dev/dsk/c1t0d0s0
metattach d1 /dev/dsk/c1t1d0s0
metattach d10 d0 d1

Then mounting the /dev/md/rdsk/d10 /a
metastat will report the status of it.

Search for "solaris recover root password"
There are several option that includes guides to modify the open boot directive to use disk1 c1t1d0s0 as the boot device as well as altering fstab to remove referenced to the metadevices that will not be reassembled.

These attempts could result in data loss.
Avatar of Indyrb

ASKER

When I ran the metattach d0 /dev/dsk/c1t0d0s0
and metattach d1 /dev/dsk/c1t1d0s0

it failed, I assumed because d0 and d1 directory didnt exist.
So I tried to mkdir d0 and failed due to read-only
I tried chmod and it failed too
So I mounted one to /a and the other to /mnt, but didnt help much.

Started over
boot cdrom -s

# eeprom boot-device
boot-device=disk:a /pci@1c,600000/scsi@2/disk@0,0:a disk1:a
# format

AVAILABLE DISK SELECTIONS:
       
0. c1t0d0 <SEAGATE-ST3300007LC-0003 cyl 45265 alt 2 hd 16 sec 809>
   
/pci@1c,600000/scsi@2/sd@0,0
       

1. c1t1d0 <SEAGATE-ST3300007LC-0003 cyl 45265 alt 2 hd 16 sec 809>
         
/pci@1c,600000/scsi@2/sd@1,0

partition
print
showed 0 root on c1t0d0

fsck -y /dev/dsk/c1t0d0s0
fsck -y /dev/dsk/c1t1d0s0

found and fixed issues

cd /
mount /dev/dsk/c1t0d0s0 /a

verified mounted to /a
TERM=vt100
export TERM

vi /a/etc/shadow
root::6445::::::
:wq!

vi /a/etc/system
removed rootdev:/pseudo/md@0:0,0,blk
:wq!

vi /a/etc/vfstab
changed only for ROOT /

/dev/md/dsk/d0  /dev/md/rdsk/d0 / ufs 1 no -
to
/dev/dsk/c1t0d0s0 /devrdsk/c1t0d0s0 / ufs 1 no -


cd /
umount /a
fsck -y /dev/rdsk/c1t0d0s0
it said to do a stop-a, but it didnt work
Control Break
{1}ok
boot -sw

at root password push entered.

got to command line
#
typed metaclear -f -r d0

next hit control-D or type exit
reboots

login as root no password

typed passwd  "enter in new password"

So How do I recreate mirror and raid?

can I do it via terminal or do I need to reboot to cdrom

also can you upgrade the os, without causing application issues or a need to reinstall
not sure how solaris works.
Avatar of Indyrb

ASKER

This post is for my notes:
But I do need help with recreating mirror and raid please.


this v210 has a quad nic card / ports available.

I plugged up to port 0 and nothing... no lights or anything....

I noticed on boot up – even before we did any repairs several days ago. It said

ERROR: The following devices are disabled:  net2&3  net0&1

ran ifconfig -a

only showed:

lo0: flags=1000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv4> mtu 8232 index 1
        inet 127.0.0.1 netmask ff000000

I ran prtdiag -v

pci     33    MB          network (network)                                
              failed      /pci@1d,700000/network@2

pci     33    MB          network (network)                                
              failed      /pci@1d,700000/network@2,1


pci     66    MB          network (network)                                
              failed      /pci@1f,700000/network@2

pci     66    MB          network (network)                                
              failed      /pci@1f,700000/network@2,1


hit control-break

at opb

{0} ok .asr
net2&3    Disabled by FWDIAGS
                 OBDIAG failure
net0&1    Disabled by FWDIAGS
                OBDIAG failure

{0} ok asr-enable net0&1
No action taken because device was not disabled by USER
{0} ok asr-clear
{0} ok  reset-all

its rebooting
The / file system (/dev/rdsk/c1t0d0s0) is being checked.
taking forever...

Hopefuly I did the right thing,,, not even sure what asr is....

Noticed these errors on bootup too
WARNING: forceload of misc/md_trans failed
WARNING: forceload of misc/md_raid failed
WARNING: forceload of misc/md_hotspares failed
WARNING: forceload of misc/md_sp failed
WARNING: forceload of misc/md_stripe failed
WARNING: forceload of misc/md_mirror failed
Hardware watchdog enabled
configuring IPv4 interfaces: bge0.

What do these mean--
the configuring IPv4 interfaces: bge0 I think is new since I did the asr thing.

booted to command prompt
#
typed ifconfig -a
now get new line:

bge0: flags=1000803<UP,BROADCAST,MULTICAST,IPv4> mtu 1500 index 2
        inet 10.130.1.156 netmask ffffff00 broadcast 10.130.1.255
        ether 0:3:ba:7e:8f:99


After finding a live line I can ping and etc.  Yea!

Need to know how to resetup mirror and etc.
ASKER CERTIFIED SOLUTION
Avatar of arnold
arnold
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of Indyrb

ASKER

I should accept multiple answers "ALL"
As each response, you provided guidance and continued assistance, but for ease, I choose to accept your last response. You definitely went above and beyond, and I am truly appreciative of your support and help.

I did not complete the last step yet, but I assume it will work without issue.
If I run into issues or if needed I will post a new request.

Again thanks for your help