?
Solved

/tmp directory issues

Posted on 2005-03-14
27
Medium Priority
?
1,517 Views
Last Modified: 2013-12-27
/tmp (swap) has hit 100% on one of our servers, however, the files listed underneath the directory add up to well less than the disk space available in the swap partition. We did have this issue on one other server in the past and restarting the server was able to clear the issue up for us, however, we would like to try to track down what the root cause of the issue is. Is there a way in Solaris 8 to determine what proccess is tying up all the available space on this swap partition?
0
Comment
Question by:RevelationCS
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 10
  • 8
  • 3
  • +3
27 Comments
 
LVL 10

Expert Comment

by:Nukfror
ID: 13536827
First, if you don't do this already, I suggest you limit how much space /tmp is allowed to consume of swap space.

In /etc/vfstab, change this:

swap    -       /tmp    tmpfs   -       yes     -

to this:

swap    -       /tmp    tmpfs   -       yes     size=###m

### is whatever size in MBytes you need.  If you leave off the "m", the ### is interpretted as bytes and is rounded up to the next multiple of the pagesize in Solaris.

If swap is getting eaten up and /tmp is small as you indicated, then you probably have a memory leak in some application.  You should run something like this:

ps -o uid,pid,ppid,c,stime,tty,pmem,vsz,rss,args

And watch the pmem, vsz, and rss columns for process that stay but grow slowly over time and focus you attention on them.

See the ps man page for details on what the columns mean.
0
 
LVL 1

Expert Comment

by:sisiro
ID: 13536843
Normally, each filesystem keeps some space for its internal use. But that is not a significant amount. Since in your case, it seems that some significant space is unaccountable for. One thing that comes to my mind is that there might be some data under the /tmp filesystem, and a new filesystem might have seemed to be mounted on top of it. If this is not the issue, pls let us know since when you noticed this and any modifications to the system software/hardware has been done around that time.

0
 
LVL 1

Expert Comment

by:sisiro
ID: 13536942
Relevant commands:

swap -l <- to see the swap space or file details
fuser    <- to identify a process using the file or file system

0
Get 15 Days FREE Full-Featured Trial

Benefit from a mission critical IT monitoring with Monitis Premium or get it FREE for your entry level monitoring needs.
-Over 200,000 users
-More than 300,000 websites monitored
-Used in 197 countries
-Recommended by 98% of users

 
LVL 8

Author Comment

by:RevelationCS
ID: 13537379
doing a comparison between the server that is having problems and a second server which is identical (both in applications/etc), but is not having a problem; I am noticing the following:

swap -l : identical usage between the two servers

Server with problems:
# fuser /tmp
/tmp:    22720c   22719c   22718c   22717c   22716c   22715c   22714c    1532c

Server without problems:
# fuser /tmp
/tmp:    12483c   28246c    1509c    1508c    1507c    1506c    1505c    1504c    1503c    1502c


/etc/vfstab is configured to 2048M (which, coincidentally is what the unix administrator had set as the same size of the partition itself).

using a variation of the ps -o uid,pid,ppid,c,stime,tty,pmem,vsz,rss,args command, I was able to determine that based on what i am seeing there isn't anything to note that I think could be part of the problem.

I did find the following article: http://www.dbforums.com/t1119517.html 
However, it does not really explain how to determine how to find if you are affected in a situation similar to the last post on that forum (post by Klaus-Dieter).. that sounds like the situation that is going on here with our machine, however, how do I determine that?
http://www.dbforums.com/t1119517.html
0
 
LVL 10

Expert Comment

by:Nukfror
ID: 13540810
Can you send the following outputs when the 100% issue is happening:

swap -l
swap -s
df -k /tmp
grep "/tmp" /etc/vftab

OK -- so your swap partition is 2GB ... right ?  And /etc/vfstab is set so that /tmp is consume 2GB of available swap - which is the whole thing - not in general a good practice ;)

From your description it still sounds like you have a memory leak.

fuser is only going to show you open or in use files/directories.   It won't tell you what process created some temp/trash file and then just left it there eating up anonymous memory.

Send the above command outputs which should hopefully make it more clear what you problem is.
0
 
LVL 10

Expert Comment

by:Nukfror
ID: 13540828
Oh almost forgot, can you also send the following command output:

cd /var/adm
egrep -i "fork|swap" messages
0
 
LVL 38

Expert Comment

by:yuzh
ID: 13541010
Swap space is the extention of your computer's physical memory, not on a chip but on a disk.  The OS uses the swap space when it runs out of physical memory.

If your system are running some large process, it can use up all the RAM + SWAP, eg, I have
seen a lot  of large Engineering simulation (single process > GBs).

You can  use:
/usr/ucb/ps -uax | head -15
to find out what process is eating up your system resource, or use "prstat" comand to fine out
the process statistics.

then use "pmap -a PID" to find out how swap is used by the process.

If it happen very offen, you might need to consider to add more RAM to your system. In most
case your swap sapce should >= RAM size .

You can add more swap space it you wish, have a look at the following page to learn how:
http:Q_20513973.html

also have a look at "Memory and swapping" to learn more details:
http://www.princeton.edu/~psg/unix/Solaris/troubleshoot/ram.html

Solaris Performance FAQ:

http://www.sun.com/sun-on-net/itworld/UIR010329cockcroftletters.html
http://sunsite.uakom.sk/sunworldonline/common/cockcroft.letters.html
http://www.itworld.com/Comp/3380/UIR010329cockcroftletters/




0
 
LVL 10

Expert Comment

by:neteducation
ID: 13542375
Very important: /tmp IS NOT the swapspace. /tmp is a Ram-Disk (I know it's says swap, but that is only to get the administrator informed, that if physical ram is to small it will also swap out memory pages that are in this ramdisk).

You can verify this pretty easily by removing all swapspace (using swap -d) and then you will still be able to put something under /tmp

So for your problem, it seems that on that server you are low on physical ram, and therefore the system is using the swap-partition. Try to find out who is using how much ram, i.e. using the top-utility (available at sunfreeware.com)
0
 
LVL 8

Author Comment

by:RevelationCS
ID: 13545400
server itself has 8GB of RAM on it... Memory usage is no where near capacity on it (only using about 4GB of the memory) so I highly doubt that it is an issue with physical memory being low...
0
 
LVL 10

Expert Comment

by:neteducation
ID: 13546831
well if your /tmp is full then you DO have a problem with physical ram.

You dont believe me ?

fill up your /tmp (you can use mkfile 4000m /tmp/whatever to create big files). And when /tmp is full or allmost full, then try to start some big application. you will not be able to.
0
 
LVL 10

Expert Comment

by:Nukfror
ID: 13551993
RevelationCS,

I'll ask again.  Please provide the following command outputs:

swap -l
swap -s
df -k /tmp
grep "/tmp" /etc/vsftab
egrep -i "fork|swap" /var/adm/messages
0
 
LVL 8

Author Comment

by:RevelationCS
ID: 13552305
several of those have already been posted. the egrep I cant post here, but it basically had nothing but the errors in it saying that /tmp was full.... swap -s did not return anything.... we had already done a df -k . (while in the /tmp directory) and that listed the drive as having no available capacity (as stated before 2GB partition, not space avail)
0
 
LVL 10

Expert Comment

by:Nukfror
ID: 13555998
I understand that you can't post things directly from the box - I've worked at places like that before myself.  So you'll need to be rather explicit in your descriptions.  So I'm going to ask some questions that should clear up the confusion I have in understanding exactly how your environment is setup - not trying to be annoying - trying to help you figure out your problem.

Does your swap -l output point to a physical disk partition or does it show nothing ?

<user@chivas:151>$ swap -l
swapfile             dev  swaplo blocks   free
/dev/md/dsk/d6      85,6       8 263080 263080

You mentioned swap -s shows nothing ... that isn't right ... swap -s should *always* show something.  It should look something like this.

<user@chivas:152>$ swap -s
total: 31336k bytes allocated + 12832k reserved = 44168k used, 366324k available

swap -l is the one that doesn't have to show anything.  swap -l shows the phsysical disk parition and physical utilization of that partition.  If swap -l shows nothing, then you don't have swap backed by phsyical stored and only anonymous memory is being used - which is very dynamic and most systems is much too small to do much of anything.  This could be caused by the choice not to have a swap partition - 99.999% of folks always have phsyical swap backing anonymous memory - or by a typo in the /etc/vfstab file.

Does the /tmp output from df -k look like this:

swap                  364656     384  364272     1%    /tmp

or like this:

/dev/dsk/c#t#d#s#                  364656     384  364272     1%    /tmp
0
 
LVL 8

Author Comment

by:RevelationCS
ID: 13559318
the server was rebooted last night and once the server was rebooted, the /tmp directory went from the 100% (2 GB) to 7% after the reboot (not files were deleted)....  I am still at a lost as to how we can determine what process is tying up this space without having to physically reboot the box...
0
 
LVL 10

Expert Comment

by:Nukfror
ID: 13560410
If /tmp is mounted on swap e.g. anonymous memory, then when the server reboots all files are deleted.  In reality the files just disappear.

In your first post when you said "the files listed underneath the directory add up to well less than the disk space available in the swap partition", how did you determine this ?  Did you use du or df ?
0
 
LVL 10

Expert Comment

by:Nukfror
ID: 13560425
I was just thinking ... it is perfectly possible for a process to open a file, delete it (so it no longer appears in an ls listing), and continue writing to it consuming disk space - well anonymous memory in this case.  Wonder if that's what's going on.
0
 
LVL 8

Author Comment

by:RevelationCS
ID: 13560649
this is what I believe was what was happening as the symptoms in the "scenario" posed in the following article (http://www.dbforums.com/t1119517.html) were similar to what we were seeing... I did use both DF and DU to determine the disk space being used. Theoretically, the processes should be the same between the server that was having the issues and an identical box that is used for load balancing... We have only seen this once in a great while and is not a common thing...
0
 
LVL 38

Expert Comment

by:yuzh
ID: 13561443
As we mentioned above, you need to check what process is eating up you system resource.

>>the server was rebooted last night and once the server was rebooted, the /tmp directory went from the 100% (2 GB) to 7% after the reboot (not files were deleted)....  

when the system shuldown, all process are killed, it free up RAM, swap, and get rid of all process temp fils, that's why to see  from 100% -> 7%

you should use "ps" command to check it out, have a look at my answer in:
http:Q_20864673.html#10214756

to learn more details
0
 
LVL 8

Author Comment

by:RevelationCS
ID: 13571566
see the problem is, nothing was showing in either PS or top or prstat that was looking out of the ordinary. Everything appeared to be as it should and the memory consumption by running processes was the same as it was on our other server as that time... whatever it was, there was a lock placed on the space in the file system and as a result cause the drive to fill up... none of those three commands will show you what is holding the lock on the file system and this is what we are essentially looking for is a way to find what process is locking this space so we can clear it WITHOUT having the reboot the server...
0
 
LVL 1

Accepted Solution

by:
jephilc earned 2000 total points
ID: 13634204
Hi
It is quite possible for a process to continue to write to a deleted file, I've had this same problem before too.
when the problem next occurs, try looking at some of the running processes, or at least start with any process that you "think" might be causing the problem. Use the proc tools to assist you with this.
ptree <pid> will show you the process tree including all children processes, giving you a number of PIDs.
pfiles <pid> will give you more detailed file status information, such as the files that are open.
In your case you would be looking for /tmp file entries to be present where there isn't actually a corresponding entry when you do an ls of /tmp.

A good example of this can be seen with cron. Check this out on a spare machine.
Get the pid of cron (pgrep cron)
Run pfiles <pid of cron> and you'll see a couple of entries relating to the log file: /var/cron/log with the filename, size and inode number etc.
Now, without restarting cron yet do the following:
mv /var/cron/log /var/cron/olog as if you were recycling the log files.
touch /var/cron/log to create a new empty logfile
Put in some test cron jobs to run every minute (ls or something) just to generate some lines in the logfile. Notice if you run the pfiles command again, that cron is writing to /var/cron/olog now and not the newly created /var/cron/log. Remember the inode number of olog.
Now delete /var/cron/olog.
Carry on running your test cron jobs...
Monitor the df -k of /var (if it's a separate filesystem it's easier to see this) to see the blocks increasing still. If you run pfiles of cron pid again, you'll see that no filename is shown, but the inode number is still present, i.e. it's still writing to the "file" even though it's not there. It's because the file descriptor is still open.
Because this is a test and we know it's cron, a restart of the cron daemon will stop this and start writing to the newly created /var/cron/log file again (as expected). As a result of this, the process will stop writing to the non-existent log file which fills up your filesystem without you knowing about it !!! and reclaim the disk space - there's no need for a reboot if you know the process that caused it.
It's a great test to prove (and understand) what's happening, hope you find it useful.
When it next happens to you, you might be able to troubleshoot some of the processes.
Good luck

John
0
 
LVL 10

Expert Comment

by:Nukfror
ID: 13634791
While my thoughts on the deleted-file-still-being-written are a possibility, I still don't this is/was your problem.  I still think you have a memory leak or something similar - maybe you had some process swapouts you don't know about.

I went back and check - you never posted anything related to:

swap -l
swap -s

Here is a good example on my Solaris 10 system here at home:

bash-3.00# du -sk /tmp /var/run /etc/svc/volatile/
80      /tmp
104     /var/run
928     /etc/svc/volatile

See all my tmpfs file systems are using barely anything.

bash-3.00# swap -l
swapfile             dev  swaplo blocks   free
/dev/dsk/c0t0d0s3   136,11     16 3096560 2829856
bash-3.00# swap -s
total: 214768k bytes allocated + 27384k reserved = 242152k used, 1438064k available

But *physical* swap usage is showing about 134MBytes of used disk space.  But how can this be ?  My *virtual* swap space (e.g. anonymous memory) is showing about 213MBytes of used space as well.  Why the difference ?

bash-3.00# vmstat 3
 kthr      memory            page            disk          faults      cpu
 r b w   swap  free  re  mf pi po fr de sr dd f0 s3 --   in   sy   cs us sy id
 0 0 8 1452056 43968  0   1  1  1  1  0  0  0  0  0  0  402  134  183  1  1 98
 0 0 11 1437776 60776 2  10  0  0  0  0  0  0  0  0  0  405  177  181  1  1 98

My "w" column is non-zero - I had processes swapped out at some point in the past (without question from the couple of times I've run SMC on this server setting up RBAC roles and stuff).  The delta between the 134MB and the 213MB is how some pages of "swap" were pinned in memory while others were pushed down to the physical swap partition - this is the "feature" of anonymous memory that allows you to boot a Solaris server without physical swap space backing anonymous memory - only really useful on systems with LOTS of memory and when you don't care/want to have panic core dumps - actually I think you can still have panic dumps going to non-swap partitions ... this is a way to speed up reboots after a panic ... I need to re-educate myself on this.

I understand you can't show physical output from your site - but you could make it up in such a way that it illustrates the command outputs.

Right now - everyone is guessing at what your problem may be so everything is currently nothing more then swags.
0
 
LVL 8

Author Comment

by:RevelationCS
ID: 13635239
I do not believe it was a memory leak as we just had capacity reports generated for us right after this happened and there were no issues reported on that with memory usage...
0
 
LVL 8

Author Comment

by:RevelationCS
ID: 13635246
jephilc,

This is what we were looking for....

thanks to all of the experts for their valued input here....
0
 
LVL 8

Author Comment

by:RevelationCS
ID: 13707815
jephilc,

we had a reoccurance of the issue and am looking at the pfiles output and am having some difficulties interpreting the data in the output... do you know of any good sites that will describe what the translations are for the fields that are outputted there?

thanks....
0
 
LVL 1

Expert Comment

by:jephilc
ID: 13712611
Hi RevelationCS.
I'm afraid I don't know of any good sites that go into detail with the columns of output, but the man pages for fstat and fcntl do give some useful information. When I've had to diagnose these sorts of problems, I've used utilities such as top to identify likely processes to investigate, i.e. those that are clocking time regularly - sometimes it's been due to runaway processes, which are much easier to spot.
From the pfiles output though, when you do find a potential culprit, the file type is given, such as S_IFREG for a regular file, along with the permissions, owning UID and GID and the inode number. There may or may not be a filename present, but it also gives the major and minor numbers. The major/minor number info can be cross-referenced with a listing of /dev/dsk/* to get the link info into the /devices directory and hence deduce the correct major/minor combination for /tmp (being the slice used for swap). The pfiles info also includes file size information, which can be useful if the file is being written to as it will show, if you have the correct process, the size increasing.
Unfortunately, it's an investigation process, with some trial and error to try and identify the process that is causing the damage, that's the hard part. Once you identify the process, then you should be able to see why it's happening because you know what to analyze, or if it's a third party application, you can log a call with the supplier etc etc and get them to explain/fix the problem.
You could also do a ps -ef and pull out the second column (the PID), loop through all the PIDs executing a pwdx for each one. This will tell you the current working directory of each process, yours might be using /tmp as the working directory - this could also give you a good list of potential processes to start looking at.
Can you remember if this started happening at a certain time? i.e. was there a time when this problem didn't occur? if so, what's happened to the system since then? what's been installed or changed? All these questions can make your diagnosis easier. If it's only happened since you upgraded the O/S, then a legacy application might be causing it.
Sorry I can't be of more help at this point, but it can be quite a system-specific type of problem.
If you're still stuck after this, then we can take it offline if that helps to solve it.
Good luck
John
0
 
LVL 8

Author Comment

by:RevelationCS
ID: 13712678
I did find an article relating to some of what youa re talking about above prior to seeing your recommendation and slowly but surely everything is coming together... I appreciate the help you gave with this as it has been a very intriguing issue... thanks again for the help... I did open up another question related to this looking for some more possible explinations (as you gave more info above than what I was expecting)... if any of the above relates to this post - http://www.experts-exchange.com/Operating_Systems/Solaris/Q_21377370.html - please carry it over to there so I can award the points as needed...

thanks again...
0
 
LVL 1

Expert Comment

by:jephilc
ID: 13720903
Hi, glad you found it useful. The one thing that the pfiles output is missing.... and where the real work is - is tying the files to the PIDs. I'm registered in the Solaris express program, so I'll raise it as a suggestion for future release to be able to include information on the PID that has the file open being included in the pfiles output. That would be really useful in these situations.
Good luck

John
0

Featured Post

Independent Software Vendors: We Want Your Opinion

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

This tech tip describes how to install the Solaris Operating System from a tape backup that was created using the Solaris flash archive utility. I have used this procedure on the Solaris 8 and 9 OS, and it shoudl also work well on the Solaris 10 rel…
Why Shell Scripting? Shell scripting is a powerful method of accessing UNIX systems and it is very flexible. Shell scripts are required when we want to execute a sequence of commands in Unix flavored operating systems. “Shell” is the command line i…
Learn how to get help with Linux/Unix bash shell commands. Use help to read help documents for built in bash shell commands.: Use man to interface with the online reference manuals for shell commands.: Use man to search man pages for unknown command…
This video shows how to set up a shell script to accept a positional parameter when called, pass that to a SQL script, accept the output from the statement back and then manipulate it in the Shell.
Suggested Courses

777 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question