Start Free Trial

asked on

Using a raid 5 storage device to store email for 2 solaris 8 sparc servers runnung communigate pro software

This is the situation. I am presently running the email program communigate pro on a solaris 8 sparc box. What I want to do is setup another server running the exact same thing so that if one server goes down the other one will automatically start up. Also by just having one running at a time I will need only 1 license of communigate pro. Basically I want to store the data in a central storage RAID 5. Now I was wondering if anyone had any suggestions on how to acomplish this. Is there a raid 5 box that would do this job and how would i setup a server to automatically switch on and resume the duties of a failed email server. Does this sound like a good idea or does anyone have any better ones. Any info would be appricated. I currently host email for about 10,000 users and I need a good backup like this to minimize downtime.

Hi,

There are a number of good RAID5 boxes out there. What you get will depend on your budget. One ISP I know uses A5200 (upto 22 fibre channel disks) boxes. I suppose you will want to allow for a couple of hundred GB of storage, depending on your clients.

Hardware raid is better but more expensive
FC-AL disks can give better performance, speeds up to 15k RPM.
Spread load over many disks - but this can limit your potential capacity.

Think about a backup solution as well, paying customers don't like their email being lost.

Let us know your estimated budget and requirements.

Regards, Nisus
http://www.omnimodo.com

ASKER

My customer base is about 10,000 accounts with about a growth of 2,500 a year, so I would like to beable to plan for atleast a five year solution. Speed is not the biggest issue, its more the reliability and reduced downtime. Our current structure consists of tape backup and one very unreliable sparc solaris 8 server running sendmail, and one sparc server running communigate pro, and this is becoming quite a hassle. Also our server experienced quite a bit of downtime earlier this year because of still unknown reasons so what we would like to do is run our existing sparc box with communigate pro and we will have an identical box running the same. Communigate recommended we run a raid 5 box for our data, so I guess even though hardware raid is more expensive, that is the way we would go. Our buget is pretty flexiable at this point because we are actually under budget for the year and this is definately a pressing issue. Basically we want an idea of the devices involved and the best way to set this up. Storage would have to be a couple hundred GB. I guess a couple different options would be helpful with a less expensive and a more expensive with tradeoffs. Thankyou for your time.
Regards
John
www.arvig.com

Solaris already has all you need:
a) Install a RAID-box (not matter if hardware RAID or JBOD) between
the two servers
b) Cable SCSI-From first server to RAID-Box and on to second server.
server1 ====== disk-box ====== server2
c) Make sure you change the SCSI initiator ID on one of the two servers
do a different value ()
d) Configure SDS (Solstice Disk Suite, aka Sun Logical Volume Manager or
LVM) and put the disk box's disks into a disk set (to allow failover)
e) You may also want to mirror the system disks on the server itself
using SDS

Cheers

Here is how to change the SCSI initiator ID on Sun machines:

There are basically 2 ways to change the scsi-initiator-id.

1) Simply go to the OpenBoot prompt (ok prompt) of one of the machines and
change the scsi-initiator-id accordingly.

#### Steps to take:
ok setenv scsi-initiator-id 6
ok reset-all

NOTE:
Since the CD-ROM's scsi-initiator-id is 6 as well, it will be disabled for that
machine. The setenv variable will change ALL adapters on the system, thus
making the CD-ROM drive useless on the built in SCSI adapter.
This may be a probable solution if one does not require the use of that
CD-ROM. The other machine with initiator-id remaining at 7, can still make use
of its CD-ROM.

2) This is a better solution because it only changes the initiator-id on the
adapters that need to be changed. The CD-ROM remains functional after the
change.

Go to the OpenBoot prompt and follow the steps accordingly:

### These steps to find out relavant adapters' paths:
ok set NVRAM auto-boot? false
ok reset-all
ok probe-scsi-all

### These steps to edit the nvramrc script, to run
### at bootup, to change the scsi-initiator-id:
ok nvedit
0 probe-all
1 cd /sbus@1f,0/SUNW,fas@2,8800000
2 6 encode-int " scsi-initiator-id" property
3 device-end
4 cd /sbus@1f,0/SUNW,fas@0,8800000
5 6 encode-int " scsi-initiator-id" property
6 device-end
7 install-console
8 banner
<CTRL-C>
ok nvstore
ok setenv use-nvramrc? true
ok reset-all

NOTE:
At step 1 and 4, the paths were taken from "ok probe-scsi-all".
At step 2 and 5, there is a white space between the first (") and scsi.
Remember also to save the nvramrc script by executing "nvstore"
and set to use nvramrc script at bootup by executing "setenv use-nvramrc? true"

You might want to investigate clustering for your automatic failover needs.
Sun and Veritas both offer very good cluster solutions, but I prefer Veritas.

As far as I know, Veritas does not support those external disks (sitting in
between the two servers) with SDS but VxVM only (matter of cost = $$$)

That is correct, Veritas doesn't support SDS disk sets.
However, simply connecting a storage device to two machines and creating an SDS disk set doesn't provide for automatic failover.
If budget allows, the best solution is to cluster.

ASKER

Well what I would require if possible would be automatic failover as to limit downtime. Now clustering, I'm not too familar with. Now wouldn't that involve writing parts accross several disks over the two servers. If I did it that way I would probably need two licenses for communigate, but it would provide redundency if one of the servers crashed. If I used clustering, I would only need the two servers, right? What would I need for storage, how many disks and so forth. As for this Veritas, how can I find more information out about its capabilities and cost. How would I implement it on the two existing sparc boxes I have, or would it require something else. Thankyou.

To use the Veritas clustering, you would use the same storage box, preferrably something in the order of a Sun A5200 or SAN.
The one storage box is connected to both systems (you have to change the scsi-initiator-id on one of the systems).

The clustering software will probably cost about $25K.
Basically, you would have your communigate running on only one server at all times. I don't know if you would have to purchase two licenses for that or not.
That is a question for your communigate vendor.

The only trick to all of this is configuring the cluster configuration and writing scripts to start/stop/monitor the communigate application.
The Veritas Cluster Software comes with lots of documentation and isn't horribly difficult to set up or maintain.
Training from Veritas is recommended.

Go to http://www.veritas.com/products/category/ProductDetail.jhtml?productId=vcsquickstart

Hey guys! I think we're getting a little bit off topic here discussing clustering.
I understood herbherb wanted was the chance of setting up some sort of
"stand by" server for is application. But if you (and herbherb) are interested,
here's the basics:
a) Clustering involves at least two systems (Sun allows up to four and
Veritas up to 16 hosts)
b) Each of the participating systems will have to be configured with as few
single points of failure "SPOF" as possible (dual SCSI controllers, mirrored
OS disks, dual network etc.). This is often not considered by many admins
planning to set up clusters!
c) Data disks (anything besides th OS and clustering software, wich can be
viewed also as part of the OS) will have to be accessable for any of the
participating servers. With two hosts this can be done with SCSI (as I
already mentioned) or SAN, for more than two servers you'll need SAN.
d) The application(s) will have to be monitored if "alive" and in the event
that they seem to be "dead" they either will have to get restarted on
one of the servers or "failed over" from one server to another server if
the problem is the server itself (not just the app).

The last point usually is the most difficult to solve, as you will have to get
the right monitoring into place. Running standard applications (email, web
server, databases etc.) the Cluster vendors (here: Sun or Veritas) already
deliver several matching agents to provide this functionality.
In case you don't have one of the applications they support "out of the
box" you will have to implement your own agent. Depending on your app
this can be better suited for Sun Cluster of Veritas Cluster.

If you don't want to spend the extra money involved for
- redundant hardware (minimizing SPOF)
- cluster software
- cluster setup (includes vendor approval)
- cluster training and administration
- support contract(s)
you could set up "simple standby" with some sort of "manual failover"
using the "Disk Set" approach with Sun's SDS (aka. LVM) or Veritas'
disk groups. The problem with this is that an admin has to react in a
timely manner whenever the app stops working (restarting on the
server, initiiating a failover or recover from a crash).

Cheers

ASKER

Disk clustering sounds like a great way to provide redudency and I may consider it. It does sound quite indepth though. As for setting up a simple manual failover using the disk set approach, tell me more about this. Where can I go to find out more information about this. This may work to, although not providing automatic failover, it would provide failover at a moderate price. Anything has to be better than my simple setup now, and thats what I'm looking for, something better.

Okay, here you go:
a) For better reliabilty of your host you should have two disks in it to setup mirroring
using SDS
- Make a copy of /etc/system and /etc/vfstab (just in case ...)
- Install the packages from the second Solaris CD (you need SUNWmdr, SUNWmdu
and SUNWmdx, the other packages are optional). The mount point is usually
/cdrom/sol_8_202_sparc_2/Solaris_8/EA/products/DiskSuite_4.2.1
- After loading the packages you will have to re-layout your system disks slightly,
freeing up at least one cylinder for the SDS metastate database (mddb). I'd
suggest you "steal" one cylinder from your swap space and create a new slice
(number 4 as an example)
BEFORE:
Part Tag Flag Cylinders Size Blocks
0 root wm 0 - 2607 5.86GB (2608/0/0) 12288896
1 swap wu 2608 - 4897 5.15GB (2290/0/0) 10790480
2 backup wm 0 - 7505 16.86GB (7506/0/0) 35368272
3 var wm 4898 - 7505 5.86GB (2608/0/0) 12288896
4 unassigned wm 0 0 (0/0/0) 0
AFTER:
Part Tag Flag Cylinders Size Blocks
0 root wm 0 - 2607 5.86GB (2608/0/0) 12288896
1 swap wu 2609 - 4897 5.14GB (2289/0/0) 10785768
2 backup wm 0 - 7505 16.86GB (7506/0/0) 35368272
3 var wm 4898 - 7505 5.86GB (2608/0/0) 12288896
4 unassigned wu 2608 - 2608 2.30MB (1/0/0) 4712
b) Supposed, you have the following slices on your boot disk (c0t0d0)
c0t0d0s0 (root slice), c0t0d0s1 (/var slice), c0t0d0s3 (swap slice),
c0t0d0s6 (/usr slice), c0t0d0s7 (some other slice)
and your second disk (c0t1d0) has the same physical characteristics, you can
now repartiton the second disk ("copy" the VTOC) using:
prtvtoc -h /dev/rdsk/c0t0d0s0 | fmthard -s - /dev/rdsk/c0t1d0s0
c) Create the required mddb on both disks (on slice 4) now with
metadb -a -f -c 4 c0t0d0s4 c0t1d0s4
If you got an error that the device size is too small use -c 3 instead.
d) Now, you're ready to create the meta devices:
for i in 0 1 3 6 7 ; do
metainit -f d1$i 1 1 c0t0d0s$i # first half of mirror
metainit d2$i 1 1 c0t1d0s$i # second half of mirror
metainit -f d$i -m d1$i # mirror device (one-way mirror for now)
done
And set the metadevice d0 as the root device:
metaroot d0
Change the entries for your other slices accordingly in /etc/vfstab (edit the
file by changing /dev/dsk/c0t0d0s to /dev/md/dsk/d and /dev/rdsk/c0t0d0s
to /dev/md/rdsk/d
e) If everything is changed OK you will have to reboot now, to switch do the
new metadevices.
f) After the system comes back up again, you can verify the metadevice setup
with
metastat
or
metastat -p
g) Connect the seconds half(s) of the mirror(s) to have it synchronized now:
for i in 0 1 3 6 7 ; do
metattach d$i d2$i # connect second half of mirror
done
This will resync all mirrors simultaneously in background (if you're in a hurry
you may want to have them synched one after the other ...

h) Now it's time to define you disk set for the external storage (I hope to make
it right just working from the top of my head ...). I will put his in a next post.

i) Define a new diskset for SDS for your two hosts using (host1 and host2 being the
name of your two servers and maildisks the name you want to give to the diskset):
metaset -s maildisks -a -h host1 host2
j) Put some disks into the diskset (notice that no slice is mentioned):
metaset -s maildisks -a c2t0d0 c2t1d0 c2t2d0 c2t3d0 c2t4d0 c2t5d0
Warning: The disks will get repartioned automatically unless special conditions are
being met! I've copied these from sunsolve:
slice 2 is zeroed out and slice 7 has cylinders 0-4 or 0-5 allocated to it for the diskset metadb
k) Take over control of this diskset now:
metaset -s maildisks -t
This command will fail if the other host still has control over the diskset. But if we
take over with option -f ("force") the other host will panic and we get the diskset.

Some other option you may want to know:
metaset -s maildisks -r # release a disk set (so the other host can take over)
metaset -s maildisks -d c2t3d0 # remove a disk from the diskset
metaset -s maildisks -d -h host2 # remove host2 from my diskset

Adding something (maybe not so obvious as I've thought it to be):

1. After putting disks in the diskset you will have to set them up to create a filesystem
on them (define stripes, mirrors, RAID5 ...)
2. After taking over control of a disk set you will first have to mount them. If the other
host did not cleanly unmount the filesystem (or you took over with -f option) you
will have to fsck first.

Cheers and good luck!

ASKER

Well we have been discussing it and we seem to come to the conclusion that we will need automatic failover in the even of a crash so we are looking at Raid boxes such as the Dell PowerVault 220s, and the Sun StoragE 3310 scsi array. Now as for clustering software, what would you recommend would be the best for my situation and can you point me in a direction so as that I can do some research onto the costs and measures that would be needed to implement the clustered servers attached to the raid array

To setup automatic failover does NOT require shiny RAID-boxes. As I've already pointed
out: You can use any kind of disks (RAID or JBOD ...). It really depends on how much
data you have and how changing disks has to be performed and the resync after the
disk has been replaced.
Therefore, buying RAID-boxes is not a question of automatic failover but of servicabilty
and manageabilty.

Now to automatic failover:
a) The main goal of a cluster is to monitor the application (!) and restart it in the unlikely
event of a halt of that piece of software. Therefore you will have to do some research
as to which cluster software already has support for your application (sometimes the
cluster vendors call the monitoring software an "agent"). If there is no such "agent"
readily available, find out how you can write your own using the cluster software.
This is the point where you really have to look deeper into the cluster software's
inner workings. From my experience it's sometimes easier to add new software moni-
toring to Veritas Cluster than it is for Sun Cluster -- but it depends !
Failing over the application from one host to another (and taking over the disk set
from one host to the other) only happens if the first system is rendered unusable.
b) Sun Cluster supports Veritas Volume Manager as well as Solstice Disk Suite (see my
former posts whereas Veritas Cluster only supports Veritas Volume Manager.
c) Also make sure you check all applications you are running on your system if they are
"cluster ready" (DNS server, mail server, file server, web server ...)
d) Make sure you and/or your admins have the right skills developed to setup/maintain
the cluster 7x24 -- this is key! You will want to make sure you get the appropriate
training classes (either Sun or Veritas).

Everything else has been said in my former posts already. If I've missed something
really important you look at my profile a send email ;-)

ASKER

Thankyou justunix for all the very useful information. I understand that automatic failover doesn't have to do with RAID. We have decided that we will first just start with the RAID array with the manual fialover in a disk set and after awhile set up the clustering for automatic failover. That part will be the tough part since none of us has experience with that as of yet. Now to the point of transerfering our mail clients over to the raid box (sun storEdge a1000) Does anyone know a command that will show the last time a user accessed their mail. I know finger will show each account. Is there a command that will show every account on the system and their last accessed date? Also is there a good way to transfer over the mail accounts to the raid box with the least amount of interruption?

Have a look at their mailbox files. On a standard UNIX box this will be the files under /var/mail or
/usr/mail (use "man ls()" to learn more):
# ls -lc /var/mail (time of last change)
# ls -lu /var/mail (time of last access)
Unfortunately, I don't know how (and where) communigate stores its mail data, hope this helps
anyway.

Hi,

looks like you still have this one open ?!?

ASKER

Yeah I did. I Have been busy implementing a a1000 netra storage device using RAID level 5 on my sun fire v120 and sun netra t1 servers. It seesm to be running great. Now I guess I just have one more question and then I will close this. I need to find a good way to monitor the raid array in terms of failures and usage. With my servers, I don't have a setup like a sun work station, I either just telnet into the servers or use the lom port to administer the servers using a windows 2000 machine. I'm looking for a good way to view the solaris gui and administer the servers and raid array using the 6.22 raid software. I have looked into vnc but when I try to load it on my netra t1 its errors out. I have no problem installing the package, I have trouble running the vncserver. When I issue that command it gives me "couldn't find Xauth on your path." What is that and where is it located? Or do you have any other suggestions on how I could accomplish my task. I also would like to setup some type of notification, such as snmp, to notify me of failures through email. Any suggestions would be appricated

ASKER CERTIFIED SOLUTION

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

ASKER

Are there any good X11 display server software out there. Possibly ones with low cost or free?

Most commonly "Reflection X11" or Hummingbird's Exceed" is being used -- unfortunately these are not free.
Exceed is more felexible in configuration and has a lot of additional tools but Reflection is easier to set up and use for novice users.
Free X11 = Unix ;-) (use Solaris x86 or Linux to run on PC hardware)

ASKER

I seem to have run into a problem. My netra box attached to my raid array has been working great, but now as I try to hookup the other server, the fire box, I get the following errors,
WARNING: glm1: SCSI bus reset recursion
WARNING: pcisch1: ino 0x1 has been blocked
WARNING: glm1: interrupt #0 has been blocked
WARNING: /pci@8,600000/scsi@1 (glm0):
unexpected SCSI interrupt while idle
WARNING: /pci@8,600000/scsi@1 (glm0):
SCSI bus reset recursion
WARNING: glm0: fault detected in device; service unavailable
WARNING: glm0: SCSI bus reset recursion
WARNING: pcisch1: ino 0x0 has been blocked
WARNING: glm0: interrupt #0 has been blocke
Then if I reboot the fire box it gives the same error and then says no raid devices found. Now I understand the A1000 only has one controller, but shouldn't I beable to hook both servers up to the raid box so they both can see it, yet have just one mount the filesystem. Any suggestions?

Did you check that the two Netra do not use the same scsi target ID for the host bus adapter (HBA)? I mentioned this already in one of my earlier posts.
How about spending some more points for this Q ?

Cheers

ASKER

Do you mean the netra server and the netra raid box? I switched the scsi id, as u specified above on the fire server to id 6 but I still got that same error. Would I have to switch it somewhere else like on the raid box? Yeah I will definately spend more points, I'm just not sure how to do that.

As I understand, you have two Netra Servers and a Netra disk box.
Try this:
On both servers set
auto-boot? = false
in NVRAM and reset them. If everything is connected properly, do a
probe-scsi
on both systems. You should see the disks in the netra disk box from both sides now. If not, check your SCSI initiator ID on both hosts.

To spend more points, you may accept an answer and post a new Q with some points just saying "point for XYZ" and wait for the user XYZ to answer to that one.
I've seen Qs that had points increased but I don't know how to do that. Maybe EE's help feature could help?

You may also want to do a
probe-scsi-all
on your Netra Server(s)

Cheers

ASKER

When I have both systems at the ok prompt and I issue a probe-scsi-all command on the fire box the command finds the 6 disks on the raid array and then suddenly gives an Fast instruction access mmu miss error before it lists the server harddrive. Yet if I unhook the array it finds both controlers (on board and expansion) and finds the server harddisk on the onboard controller withoutr giving any errors. Could this be caused by some type of scsi id conflict on the fire server itself. Does the expanison controller and the onboard have some id that needs to be different? After it gives me the miss error i load the server and it doesn't find the raid arry, even though it saw those disks. I can have the netra unplugged completely from the array, yet the fire still does the same thing. It must be some type of configuration error on the fire v120 or a hardware problem on the fire because the netra server works fine with the array. Thoughts?

Did you make sure the cabling is OK? You don't have SE-SCSI mixed with DIFF-SCSI?

You you draw a small picture of your setup including the adapters being used?

ASKER

Nope, all cabling is diff and besides i have tried just using one server with the other a1000 port terminated.

outside network pix
-----------------------------------
| |
| |
netra server fire server (backup server in case netra goes down one turned on one turned off)
--------- ---------
| |(host adapter cards ib both servers x6541a pci ultrascsi)
Diff Scsi | |
| |
scsi 1 ------------------ scsi 2 --------------------------
Sun netra a1000

ASKER

I just wanted to let you know I figured out the problem myself. I switched the scsi id on the pci card to 5 instead of 6 and that worked. Thanks for all your help.

Good! But there is nothing on ID 6, is it?

ASKER

No there isn't so thats what is strange. I figured maybe some how it was conflicting with a scsi device on the array itself. One other thing though. The funny thing now is that everytime I issue a reboot from a telnet session into the fire server it doesn't come back up until I console into the server and type in the root and password. I must have switched something somewhere that causes this to happen, and I'm not sure what it is. Oh and by the way I posted another question with some points for you. Hey and by the way do you know how to check what programs are listening on what ports. I thought it was something like netstat -a | grep LISTEN but that doesn't show me the PIDs.