Solved

directory /var/tmp in filesystem /var keeps increasing to full level in AIX

Posted on 2010-11-20
26
3,123 Views
Last Modified: 2012-05-10
Hello

our filesystem is /var is continuously increasing every minute, upon further investigation, i found out that, it is tmp folder within /var that is continuously increasing.

can anyone tell me the significance of /var/tmp , and why could it be constantly increasing and how to go about dealing with this? is /var corrupted or something.
/var/tmp owner is bin
and owner of the files within /var/tmp is root.
in the past 4 hours i have added 2.5GB of space to /var , due to /var/tmp reaching near 100% full level.
0
Comment
Question by:assistunix
  • 15
  • 11
26 Comments
 
LVL 68

Expert Comment

by:woolmilkporc
Comment Utility
Is it actually only /var/tmp?

By default the only files which could grow significantly in /var/tmp are the snmp-related logs snmpdv3.log, snmpmibd.log and aixmibd.log.

Which other files do you find in /var/tmp? If in doubt, please post an ls -l sample!

Quite more growth can happen in /var/adm and /var/spool! In /var/adm is the wtmp file, which can grow very big over time, because it records logins and logoffs and sometimes there are remote machines which try to login in short intervals in an automated way, maybe even with malicious intent. In var/spool are the logs of sendmail and all the print queues and their logs, which can grow along with printing activity and print job size.

wmp




0
 
LVL 68

Expert Comment

by:woolmilkporc
Comment Utility
find all processes accessing /var with:

for pid in $(fuser /dev/hd9var 2>/dev/null); do ps -o pid=,ppid=,user=,args= -p $pid; done

Maybe you deleted a file whose handle is still held open by some process writing lots of data to it?
In this case you won't see any growing file, but freespace will vanish nonetheless.

0
 

Author Comment

by:assistunix
Comment Utility
yea it is /var/tmp only.
and it is these type of files below, that are causing /var/tmp to increase
/var/tmp is filled with these type of files and is utilizing 5GB currently

-rw-------    1 root     system      3848810 Nov 20 07:32 stm782474aaaad
-rw-------    1 root     system      3838813 Nov 20 07:28 stm1130600aaaae
-rw-------    1 root     system      3862524 Nov 20 07:27 stm335976aaaaa
-rw-------    1 root     system      3838832 Nov 20 07:19 stm782474aaaac
-rw-------    1 root     system      3848810 Nov 20 07:15 stm1130600aaaad
-rw-------    1 root     system      3838836 Nov 20 07:08 stm782474aaaab
-rw-------    1 root     system      3838832 Nov 20 07:05 stm1130600aaaac
-rw-------    1 root     system      3862523 Nov 20 06:56 stm782474aaaaa

0
 

Author Comment

by:assistunix
Comment Utility
from that command it seems as if no processes is really accessing /var, correct me if i am wrong.

/var/tmp # for pid in $(fuser /dev/hd9var 2>/dev/null); do ps -o pid=,ppid=,user=,args= -p $pid; done
  90330       1     root /usr/lib/errdemon
 213234  282826     root /usr/sbin/rsct/bin/rmcd -a IBM.LPCommands -r
 221330  282826     root /usr/sbin/syslogd
 303278  282826     root /usr/sbin/muxatmd
 307212       1     root /usr/sbin/cron
 311460  282826     root /usr/sbin/aixmibd
 352286  475194 pconsole /usr/java5/bin/java -Xmx512m -Xms20m -Xscmx10m -Xshareclasses -Dfile.encoding=UTF-8 -Xbootclasspath/a:/pconsole/lwi/runtime/core/
 372934  282826     root /usr/sbin/nimsh -s
 463090  282826     root /usr/sbin/rsct/bin/vac8/IBM.CSMAgentRMd
 475194  417854 pconsole /bin/ksh /pconsole/lwi/bin/lwistart_src.sh
 487446  282826     root /usr/sbin/rsct/bin/IBM.ServiceRMd
 585800  282826     root /usr/sbin/rsct/bin/IBM.DRMd
 913506  860330     root /usr/lpp/OV/lbin/eaagt/opcle -std
1687552 1679516     root -ksh
0
 
LVL 68

Accepted Solution

by:
woolmilkporc earned 500 total points
Comment Utility
All those processes are accessing /var!

All of them are pretty standard except for the ...OV.. thing. Are you using NetView?

Anyway, stm... files are temporary work files of "sort"!
Please check with

fuser -f /var/tmp/stm782474aaaad

(choose the newest, still growing file). Which PID do you see? What gives "ps -ef | grep (PID from fuser)"?

0
 

Author Comment

by:assistunix
Comment Utility
this is what i get from running the following command.

/var/tmp # fuser -f /var/tmp/stm782474aaaad
/var/tmp/stm782474aaaad:
0
 

Author Comment

by:assistunix
Comment Utility
i do not get any PID with that command. and OV , i believe is for hpopen view the monitoring tool.
0
 

Author Comment

by:assistunix
Comment Utility
# fuser -f /var/tmp/stm782474aaaad
/var/tmp/stm782474aaaad:
0
 

Author Comment

by:assistunix
Comment Utility
this is just some of the sort processes that are being run.
could these sort processes by causing the increase in space, in /var/tmp ?

ps -ef | grep sort
    root  286816       1   0 18:30:05      -  0:03 sort -rn
    root  295130       1   0 12:15:04      -  0:08 sort -rn
    root  335976       1   0 07:15:04      -  0:13 sort -rn
    root  348410       1   0 11:45:06      -  0:08 sort -rn
    root  360476       1   0 08:00:05      -  0:13 sort -rn
    root  409648       1   0 08:30:05      -  0:12 sort -rn
    root  450710       1   0 22:45:04      -  0:00 sort -rn
    root  458798       1   0 13:45:06      -  0:06 sort -rn
    root  483558       1   0 10:30:05      -  0:09 sort -rn
    root  520394       1   0 18:45:05      -  0:03 sort -rn
    root  573536       1   0 10:15:04      -  0:10 sort -rn
    root  679994       1   0 14:30:03      -  0:06 sort -rn
    root  692332       1   0 17:45:04      -  0:03 sort -rn
    root  704668       1   0 11:30:06      -  0:08 sort -rn
    root  749628       1   0 18:15:05      -  0:03 sort -rn
    root  762054       1   0 11:15:06      -  0:08 sort -rn
    root  778430       1   0 20:45:05      -  0:01 sort -rn
0
 
LVL 68

Expert Comment

by:woolmilkporc
Comment Utility
So this file is no longer in use. Was it actually the newest one or did you simply copy my example? Remember, I told you to use a still growing file?

Anyway, try this

A=""
while [[ -z $A ]]
 do A=$(fuser -f $(ls -rt /var/tmp/stm* | tail -1) 2>/dev/null)
   sleep 2
  done
ps -ef |grep $A | grep -v grep

As soon as an open stm* file is found its associated process will be displayed.
0
 
LVL 68

Expert Comment

by:woolmilkporc
Comment Utility
Your comments are arriving too fast, it seems!

These sort processes are responsible for the lots of stm files, I'm sure.
Where do they come from? Can you kill them?

By the way, NetView is the IBM version of HP OpenView, so I was nearly right with my guess.
0
 

Author Comment

by:assistunix
Comment Utility
not sure, where they are coming from?
is there a way to find that out?

ps -ef , shows the owner of sort -rn to be root and ppid is 1.

i will try killing the pid of sort -rn processes
0
 
LVL 68

Expert Comment

by:woolmilkporc
Comment Utility
Since we have no valid PPID (1 is init, probably only the "adoptive father") it will be very hard to find the true origin of these processes.

Isn't there any sort process having a PPID other than 1? If so, what's this PPID's process?

Killing the sorts with PPID 1 is the right measure and will probably do no harm.
0
What Should I Do With This Threat Intelligence?

Are you wondering if you actually need threat intelligence? The answer is yes. We explain the basics for creating useful threat intelligence.

 

Author Comment

by:assistunix
Comment Utility
I killed all the sort -rn process and they all had 1 as the ppid.
after killing those processes, all the stmxxxxxx files in /var/tmp got removed automatically.

Thank you for your help :)
0
 
LVL 68

Expert Comment

by:woolmilkporc
Comment Utility
Well,

but it's somewhat unsatisfying not to have found out where those processes
might have come from, don't you think?

Is there perhaps a faulty cronjob (running every 15 minutes or so) containing a sort?

0
 

Author Comment

by:assistunix
Comment Utility
yes i agree, it would be good to find the source of it.
i figured that some app or someone ran those processes of "sort -rn" that started to go in loop or hung or something like that.

and it seems as if the issue is back.

sort -rn processes are running again with 1 as ppid and stm806999aaaaa files are being generated again in /var/tmp.

how can we find out if there is a fault cronjob, containing a sort?
0
 
LVL 68

Expert Comment

by:woolmilkporc
Comment Utility
crontab -l
as root, then check the commands resp. called scripts/programs

0
 
LVL 68

Expert Comment

by:woolmilkporc
Comment Utility
I think we should continue tomorrow.

It's late at night here in my part of the world and my day should have been over a couple of hours ago.

 À bientôt!

wmp
0
 

Author Comment

by:assistunix
Comment Utility
ok sure thing. have a good night, thank you for your help.
0
 

Author Comment

by:assistunix
Comment Utility
i was able to find the parent of one of the sort -rn processes, the other sort -rn processes have 1 as ppid.
sort -rn processes keep starting after i kill them.
each time sort -rn process gets restarted , it gets a new pid, and new ppid.


# ps -ef | grep 405536
    root  405536 1196152   0 18:30:05      -  0:00 sort -rn    <<<
    root 1319156 1646674   0 18:31:40  pts/0  0:00 grep 405536
 # ps -ef | grep 1196152
    root  405536 1196152   0 18:30:05      -  0:00 sort -rn
    root  843894 1196152 120 18:30:05      -  0:16 du -xak /mnt/sapmnt
    root 1196152 1171574   0 18:30:05      -  0:00 head -20    <<<
    root 1319158 1646674   0 18:31:45  pts/0  0:00 grep 1196152
 # ps -ef | grep 1171574
    root 1024030 1646674   0 18:31:58  pts/0  0:00 grep 1171574
    root 1196152 1171574   0 18:30:05      -  0:00 head -20
    root 1171574 1675510   0 05:53:57      -  0:04 /usr/lpp/OV/lbin/eaagt/opcacta   << this is the constant ppid for everytime a new sort -rn process is created.
this pid 1171574 creates a new PID for head -20 , which creates a new PID sort -rn process.

 # ps -ef | grep 1675510
    root  753718 1646674   0 18:32:21  pts/0  0:00 grep 1675510
    root  778348 1675510   0 05:54:07      -  0:00 /usr/lpp/OV/lbin/eaagt/opcle -std
    root  786576 1675510   0 05:53:59      -  0:00 /usr/lpp/OV/lbin/conf/ovconfd
    root  847950 1675510   0 05:53:56      -  0:00 /usr/lpp/OV/bin/ovbbccb -nodaemon
    root  884870 1675510   0 05:53:57      -  0:08 /usr/lpp/OV/lbin/perf/coda
    root  921716 1675510   0 05:54:07      -  0:00 /usr/lpp/OV/lbin/eaagt/opcmsgi
    root 1028286 1675510   0 05:54:07      -  0:05 /usr/lpp/OV/lbin/eaagt/opcmona
    root 1171574 1675510   0 05:53:57      -  0:04 /usr/lpp/OV/lbin/eaagt/opcacta
    root 1392852 1675510   0 05:53:59      -  0:02 /usr/lpp/OV/lbin/eaagt/opcmsga
    root 1675510       1   0 05:53:56      -  0:08 /usr/lpp/OV/bin/ovcd   <<<
 
0
 

Author Comment

by:assistunix
Comment Utility
to clarify on what i meant, by ( root 1171574 1675510   0 05:53:57      -  0:04 /usr/lpp/OV/lbin/eaagt/opcacta ) being the constant ppid, see below example.
sort -rn and head -20 have diff pid's , but they all lead back to the same constant ppid 1171574.

# ps -ef | grep 1273900
        root 1273900 1347826   0 18:15:04      -  0:00 sort -rn
# ps -ef | grep 1347826
    root 1347826 1171574   0 18:15:04      -  0:00 head -20
# ps -ef | grep 1171574
       root 1171574 1675510   0 05:53:57      -  0:04 /usr/lpp/OV/lbin/eaagt/opcacta
# ps -ef | grep 1675510
    root 1675510       1   0 05:53:56      -  0:08 /usr/lpp/OV/bin/ovcd


Please advise on next step on how to manage this issue?
0
 

Author Comment

by:assistunix
Comment Utility
Another thing worth mentioning is that , i only killed sort -rn processes which had 1 as ppid
the sort -rn processes which have an actuall ppid (other than 1) start and stop automatically and get assigned new pid everytime they start again.
0
 
LVL 68

Expert Comment

by:woolmilkporc
Comment Utility
OK,

all this comes from OpenView.

Let's see:

EaAgt is the Event Action Agent Application, and ovcacta is the Action Agent itself.
The parent of this all, ovcd, is the OpenView Control Daemon.

Now you've identified ovcacta as the culprit, you can try to restart it.

Use the following with caution, because I'm only familiar with NetView, and OpenView seems rather different!

Issue ovc -stop opcacta and ovc -start opcacta
Check the new status with opcagt -status

Is opcacta running? Are new "sort" hooligans coming up?

You could as well stop and start the whole Agent Subsystem and clean up its temp files inbetween.

1. opcagt -kill
2. Kill all remaining "opc..." processes, if any.
3. Remove all files under "/var/opt/OV/tmp/OpC"  
Note: Not sure if this directory exists with HPOV, if it doesn't search for something like "/usr/lpp/OV/tmp/OpC" or "/usr/opt/OV/tmp/OpC"
4. opcagt -start

Hope this helps. If any of the above commands does not exist or would complain about bad syntax - sorry for that, but it's not NetView!
In such a case you will have to consult your HPOV docs - or try to restart the whole OpenView application, this should be something like
ovc -stop ovcd
ovc -start ovcd
Attention! All HPOV application windows will close!

If the latter doesn't exist or work either - sorry again, please check the docs or see your HPOV admin.

wmp




0
 
LVL 68

Expert Comment

by:woolmilkporc
Comment Utility
What I forgot: It could well be that HPOV is manageable via smit!

Open smit (or smitty) and search for HPOV, either under "Communications Applications and Services" or "Applications".

If it's there, see what you can do. At least restarting the whole application should be possible!

Good luck!
0
 

Author Comment

by:assistunix
Comment Utility
I had stopped the ovcd processes and restarted it, but the sort -rn processes issue is still there- Will work further with HPOV team.
0
 

Author Closing Comment

by:assistunix
Comment Utility
HPOV team made changes to their application template from their end. Thanks.
0

Featured Post

How to run any project with ease

Manage projects of all sizes how you want. Great for personal to-do lists, project milestones, team priorities and launch plans.
- Combine task lists, docs, spreadsheets, and chat in one
- View and edit from mobile/offline
- Cut down on emails

Join & Write a Comment

Attention: This article will no longer be maintained. If you have any questions, please feel free to mail me. jgh@FreeBSD.org Please see http://www.freebsd.org/doc/en_US.ISO8859-1/articles/freebsd-update-server/ for the updated article. It is avail…
If you have a server on collocation with the super-fast CPU, that doesn't mean that you get it running at full power. Here is a preamble. When doing inventory of Linux servers, that I'm administering, I've found that some of them are running on l…
This video shows how to set up a shell script to accept a positional parameter when called, pass that to a SQL script, accept the output from the statement back and then manipulate it in the Shell.
In a previous video, we went over how to export a DynamoDB table into Amazon S3.  In this video, we show how to load the export from S3 into a DynamoDB table.

743 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

13 Experts available now in Live!

Get 1:1 Help Now