Solaris Cluster primary node boot disk full

I have a Solaris 10 two node cluster that has filled up it's disk. I don't know Solaris well and I certainly don't know Clustering but I'm tasked with figuring out where the disk space on this node has gone.

I have had people that do know Solaris spend an hour or more trying to find what is using up the space without any luck.

This cluster was built by a group in our engineering dept. and I have had no prior experience with the cluster until about two days ago.

According to the Solaris people in my group the OS seems to be a typical install with typical space usage. They have basically gone through each directory and looked for something that is out of the ordinary and have found nothing.

The secondary node in this cluster has what seems to be normal usage of the file system. I have supplied output of the partition information for the primary node below if that will help.

bash-3.00# df -h
Filesystem             size   used  avail capacity  Mounted on
/dev/dsk/c1t0d0s0      111G   104G   6.1G    95%    /
/devices                 0K     0K     0K     0%    /devices
ctfs                     0K     0K     0K     0%    /system/contract
proc                     0K     0K     0K     0%    /proc
mnttab                   0K     0K     0K     0%    /etc/mnttab
swap                    19G   2.0M    19G     1%    /etc/svc/volatile
objfs                    0K     0K     0K     0%    /system/object
/platform/SUNW,T5140/lib/libc_psr/libc_psr_hwcap2.so.1
                       111G   104G   6.1G    95%    /platform/sun4v/lib/libc_psr.so.1
/platform/SUNW,T5140/lib/sparcv9/libc_psr/libc_psr_hwcap2.so.1
                       111G   104G   6.1G    95%    /platform/sun4v/lib/sparcv9/libc_psr.so.1
fd                       0K     0K     0K     0%    /dev/fd
swap                    19G    64K    19G     1%    /tmp
swap                    19G    56K    19G     1%    /var/run
swap                    19G     0K    19G     0%    /dev/vx/dmp
swap                    19G     0K    19G     0%    /dev/vx/rdmp
/dev/did/dsk/d39s3     3.9G   8.5M   3.9G     1%    /global/.devices/node@2
/dev/did/dsk/d2s3      3.9G   8.5M   3.9G     1%    /global/.devices/node@1
/dev/vx/dsk/otl/vol4    20G    14G   5.8G    71%    /global/otl/vol4
/dev/vx/dsk/otl/vol1    20G    14G   5.8G    71%    /global/otl/vol1
/dev/vx/dsk/otl/vol2    20G    14G   5.8G    71%    /global/otl/vol2
/dev/vx/dsk/otl/vol3    20G    14G   5.8G    71%    /global/otl/vol3
/dev/vx/dsk/otl/vol5    20G    14G   5.8G    71%    /global/otl/vol5
/dev/vx/dsk/otl/vol7    20G    14G   5.8G    71%    /global/otl/vol7
/dev/vx/dsk/otl/vol6    20G    14G   5.8G    71%    /global/otl/vol6
/dev/vx/dsk/otl/vol8    20G    14G   5.8G    71%    /global/otl/vol8
/dev/vx/dsk/otl/vol10
                        20G    20M    19G     1%    /global/otl/vol10
/dev/vx/dsk/otl/vol9    20G    14G   5.8G    71%    /global/otl/vol9
popimap:/var/spool/mail
                        51G    40G   5.7G    88%    /var/mail
kevinmcse1Asked:
Who is Participating?
 
robocatConnect With a Mentor Commented:

>I can't really run the same command on / because unless I'm in single user it will try and read some very large shares that take an hour or more to finish with.

As you've been looking for a few days now, I would suggest you run a du -hs  /* (even if it takes all night).
It really is the only way to proceed.
0
 
AmolCommented:
first go to the root directory (/) and run du -sh * and see which directory is taking maximum space then go to that directory and run above command again. likewise you will get to know which dir/file is consuming more space.
0
 
RowleyCommented:
check /var/core to start. If you find nothing:

cd /var
du -sh *

...and follow the results. Also worth having a look in /opt and /usr/local if you have one as you may find an application logging to it.
You should consider moving these dirs to their own file system.


0
Learn to develop an Android App

Want to increase your earning potential in 2018? Pad your resume with app building experience. Learn how with this hands-on course.

 
kevinmcse1Author Commented:
/var, /opt and /usr don't show anything out of the ordinary. We drilled down inside them as well

I'm running the du -sh * on / and I'll relay the findings...
0
 
robocatCommented:

Another magic command to find large files is:

find / -size +100000000c -xdev -ls

(this will search for all files larger then 100MB ; btw there are 8 zeroes in the command, the large number is in bytes)
0
 
kevinmcse1Author Commented:
We have some mounted resources that are very large and it looks like the "df -sh *" command is stuck churning away on those. Is there a way to ignore the remotely mounted resources with the df command?

Does the "find / -size +100000000c -xdev -ls" drill down into sub-directories?
0
 
AmolCommented:
Try

du -dk / | sort -n

The -d option to du restricts it to the partition you start from. This will also give you the totals for the directories.
0
 
robocatCommented:

>Does the "find / -size +100000000c -xdev -ls" drill down into sub-directories?

the -xdev restricts the command to the root partition
0
 
kevinmcse1Author Commented:
Using these commands found only a few large files, nothing over 130mb and there were only 3 or 4 of those. No smoking gun.

I ran this command on the nodeā€¦

du -dh / | sort -n | more

and this is a snip of some of the output:

8.0M   /usr/appserver/docs/api/com
8.0M   /usr/appserver/docs/api/com/sun
8.0M   /usr/appserver/docs/api/com/sun/appserv
8.0M   /usr/perl5/5.8.4/lib/sun4-solaris-64int/auto
8.1G
8.1M   /usr/openwin/server/modules
8.1M   /usr/share/webconsole/private/container
8.2M   /opt/emc/SYMCLI/V7.1.0/PERL/lib/perl5/5.8.8/sun4/auto
8.2M   /usr/apache/tomcat/webapps/tomcat-docs/catalina/docs

What is the 8.1G hole in the output?
0
 
AmolCommented:
can you check if there is any hidden directory under /usr
0
 
kevinmcse1Author Commented:
ls -a shows nothing hidden under /usr
0
 
AmolCommented:
To find all files bigger then 100M :

find / -type f -size +100M -ls


Find top 10 largest files,

find / -type f -size +100M -printf "%s:%p\n" | sort -nr | head -10
0
 
AmolCommented:
try the above command in usr to find that large file size...or on the directories which are local to the system.
0
 
Joseph GanSystem AdminCommented:
Have you did a reboot in first place? If not, a simply reboot could solve your problem in this case. Because looks like you have some open files in /usr directory.
0
 
kevinmcse1Author Commented:
I have rebooted into non-cluster mode and I ran FSCK to be sure there wasn't any free space issues or something like that. Nothing changed after the reboot and FSCK did not report any problems or fixes it needed to make.

ganjos:
From the info here what lead you to the conclusion that I had open files under /usr?
0
 
arthurjbCommented:
Normally, the /var partition is what fills up on a sun box.

It looks like you have /var/run and /var/mail as separate partitions, but /var is part of /

It is for problems like these that I preach that the file system should be broken up and not just on big partition on a disk.

It is not likely that the /usr partition is causing the problem. since normally the stuff there dos not change much.

I would suggest running
du -dhs *
on /root
this will show you the size of the directories.

If you have other systems with a similar layout, you can run the command on one of those, then compare the output.  It should be fairly obvious which partition is overgrown.

You may also want to spend time on those other systems, since 95% full could be showing that you have an approaching problem.

My educated guess is that you have a log file in var that has grown too much.

The other most likely problem is that a user has somehow used up a lot of space, since I don't see a separate /home partition.

Good Luck!
0
 
Joseph GanSystem AdminCommented:
From this:

du -dh / | sort -n | more

and this is a snip of some of the output:

8.0M   /usr/appserver/docs/api/com
8.0M   /usr/appserver/docs/api/com/sun
8.0M   /usr/appserver/docs/api/com/sun/appserv
8.0M   /usr/perl5/5.8.4/lib/sun4-solaris-64int/auto
8.1G
8.1M   /usr/openwin/server/modules
8.1M   /usr/share/webconsole/private/container
8.2M   /opt/emc/SYMCLI/V7.1.0/PERL/lib/perl5/5.8.8/sun4/auto
8.2M   /usr/apache/tomcat/webapps/tomcat-docs/catalina/docs

What is the 8.1G hole in the output?

You didn't show in details what was between them with 8.1G hole, but actually you have /root, /usr and /var in one partition, so more likely /var, they keeps all the log files, but also could be /root and /usr.

To separate those issues, you could run:

cd /
du -sk *

cd /usr
du -sk *

cd /var
du -sk *
0
 
kevinmcse1Author Commented:
I was moved to other issues in the last day or so but I will revisit this in the morning. I'll update the thread after I look into the recent suggestions.

Thanks all for the help so far.
0
 
kevinmcse1Author Commented:
Sorry for the long post here but this is the output from du -sh * on /usr and /var.

Nothing is jumping out at me here. Unless there is a way to keep du from looking at mounted shares (like "find / -size +100000000c -xdev -ls" that robocat previously suggested) I can't really run the same command on / because unless I'm in single user it will try and read some very large shares that take an hour or more to finish with.
 
/usr
bash-3.00# du -sh *
6.6M   4lib
   1K   5bin
   1K   TDesigner
   1K   X
  33M   X11
   1K   X11R6
  22M   adios
   1K   adm
  52M   apache
  16M   apache2
 106M   appserver
 172K   aset
  53M   bin
 2.1M   ccs
 106M   cluster
 8.2M   copa
 6.2M   demo
   1K   dict
  82M   dt
  12M   emc
   1K   games
   6K   gnome
  46M   include
 107M   j2se
   1K   java
 221M   jdk
 2.5M   kernel
   4K   kvm
 1.1G   lib
 571M   local
   1K   mail
   1K   man
   5K   net
   1K   news
  32K   oasys
   1K   old
 407M   openwin
  58M   perl5
  30M   pgadmin3
 9.6M   platform
  41M   postgres
   1K   preserve
  15K   proc
   1K   pub
  52M   sadm
  55M   sbin
 502M   sfw
 313M   share
 3.5M   snadm
   1K   spool
   1K   src
   1K   storapi
  55M   sunvts
   1K   symapi
   1K   symapi64
   1K   symapi64mt
   1K   symapimt
   1K   symcli
   1K   symcli64
   1K   symcli64mt
   1K   symclimt
   3K   temp
   1K   tmp
 664K   ucb
 159K   ucbinclude
 626K   ucblib
 415K   vmsys
 2.9M   xpg4
 526K   xpg6
bash-3.00#

/var
bash-3.00# du -sh *
  35K   VRTSat
  31K   VRTSweb
   2K   adios
  23M   adm
 3.0M   apache
 819K   apache2
   1K   audit
 2.6M   cacao
 162K   cache
   6K   cc-ccr
 4.1M   cluster
   2K   crash
 217K   cron
  45K   dmi
  53K   dt
   7K   fm
   2K   imq
   1K   inet
   3K   krb5
   4K   ld
   1K   ldap
 963K   lib
 4.3M   log
 552K   lp
   2K   mail
   1K   mysql
   1K   news
   3K   nfs
  38K   nis
   3K   ntp
  47K   opt
   4K   postgres
   1K   preserve
 160K   run
 1.1G   sadm
 135K   saf
   3K   samba
   2K   scn
  16K   scqsd
   3K   sma_snmp
 131K   snmp
  36K   spool
   4K   statmon
 4.2M   svc
   1K   symapi
 216M   tmp
   2K   tsol
  10K   uucp
 111K   vx
  18K   vxvm
 4.6M   webconsole
  42K   yp
bash-3.00#
0
 
kevinmcse1Author Commented:
Also, something that we have not touched on yet:

This is the primary node on a two node cluster, could that have something to do with this issue? Does a cluster node hold information or a "pseudo partition" that would show up as used space but not under any normal local partition or drive slice?
0
 
kevinmcse1Author Commented:
Ok, found it...

It was in the directory /snfs/swnfs0301/swkusr.

There maybe other subdirs below that but I ran the du -sh * on /snfs/swnfs0301 and I've been waiting about 30 minutes and the command is still chugging away at reading the directory contents (I assume).

I have no idea what that is. Anyone else know?
0
 
AmolCommented:
Its a StorNext Filesystem, more info here,

http://en.wikipedia.org/wiki/StorNext_File_System
0
 
kevinmcse1Author Commented:
Thanks amolq, you gave me the answer but everyone else helped with the search so I don't know exactly how to award points?

I never mentioned that particular directory and I never looked very deep into it as it was filled with 0k subdirs (or so I thought).
0
 
AmolCommented:
you can distribute the points between us...accept one answer and other answers as assited answers...
0
 
kevinmcse1Author Commented:
Damn, I spoke to soon. I apologize as this is showing my lack of knowledge in Solaris...

The directory /snfs is also a mounted share. Not what we are looking for.

Sorry for the false alarm. Apparently the df -h command did not show it as mounted until I accessed it?

bash-3.00# df -h
Filesystem             size   used  avail capacity  Mounted on
/dev/dsk/c1t0d0s0      111G   104G   6.1G    95%    /
/devices                 0K     0K     0K     0%    /devices
ctfs                     0K     0K     0K     0%    /system/contract
proc                     0K     0K     0K     0%    /proc
mnttab                   0K     0K     0K     0%    /etc/mnttab
swap                    19G   2.0M    19G     1%    /etc/svc/volatile
objfs                    0K     0K     0K     0%    /system/object
/platform/SUNW,T5140/lib/libc_psr/libc_psr_hwcap2.so.1
                       111G   104G   6.1G    95%    /platform/sun4v/lib/libc_psr.so.1
/platform/SUNW,T5140/lib/sparcv9/libc_psr/libc_psr_hwcap2.so.1
                       111G   104G   6.1G    95%    /platform/sun4v/lib/sparcv9/libc_psr.so.1
fd                       0K     0K     0K     0%    /dev/fd
swap                    19G    64K    19G     1%    /tmp
swap                    19G    64K    19G     1%    /var/run
swap                    19G     0K    19G     0%    /dev/vx/dmp
swap                    19G     0K    19G     0%    /dev/vx/rdmp
/dev/did/dsk/d2s3      3.9G   8.5M   3.9G     1%    /global/.devices/node@1
/dev/vx/dsk/otl/vol4    20G    14G   5.8G    71%    /global/otl/vol4
/dev/vx/dsk/otl/vol1    20G    14G   5.8G    71%    /global/otl/vol1
/dev/vx/dsk/otl/vol2    20G    14G   5.8G    71%    /global/otl/vol2
/dev/vx/dsk/otl/vol3    20G    14G   5.8G    71%    /global/otl/vol3
/dev/vx/dsk/otl/vol5    20G    14G   5.8G    71%    /global/otl/vol5
/dev/vx/dsk/otl/vol7    20G    14G   5.8G    71%    /global/otl/vol7
/dev/vx/dsk/otl/vol6    20G    14G   5.8G    71%    /global/otl/vol6
/dev/vx/dsk/otl/vol8    20G    14G   5.8G    71%    /global/otl/vol8
/dev/vx/dsk/otl/vol10
                        20G    20M    19G     1%    /global/otl/vol10
/dev/vx/dsk/otl/vol9    20G    14G   5.8G    71%    /global/otl/vol9
/dev/did/dsk/d39s3     3.9G   8.5M   3.9G     1%    /global/.devices/node@2
swnfs03:/mnt_0301       67G    65G   2.1G    97%    /snfs/swnfs0301
0
 
Joseph GanConnect With a Mentor System AdminCommented:
Totally agreed with robocat.
0
 
arthurjbConnect With a Mentor Commented:
Me too...
0
 
kevinmcse1Author Commented:
Sounds good, I'll start the command and let it run for as long as it takes.

I'll update when it's finished.
0
 
kevinmcse1Author Commented:
OK, sorry for my absence. I tried to run the du -hs  /* command over the weekend but it it froze on one of our bloated shares and it never completed.

The only way to run that command is to get a window from the user and run it in non-cluster/single user mode.

Not sure when that will happen, if ever.

I'm just going to close the question. Thanks everyone that contributed. I'm not about any points or what is the procedure now?
0
 
kevinmcse1Author Commented:
Could an admin please advice what I should do with points? the posters were very helpful but I can't remediate a fix at this time.
0
 
robocatCommented:

Usually points are split amongst the posters that you found most useful.
0
 
kevinmcse1Author Commented:
In the time this question has been idle they have wiped the cluster and installed fresh for another round of testing. So no answers will be forthcoming.

Thanks for all the help anyway.
0
All Courses

From novice to tech pro — start learning today.