Link to home
Start Free TrialLog in
Avatar of assistunix
assistunixFlag for United States of America

asked on

directory /var/tmp in filesystem /var keeps increasing to full level in AIX

Hello

our filesystem is /var is continuously increasing every minute, upon further investigation, i found out that, it is tmp folder within /var that is continuously increasing.

can anyone tell me the significance of /var/tmp , and why could it be constantly increasing and how to go about dealing with this? is /var corrupted or something.
/var/tmp owner is bin
and owner of the files within /var/tmp is root.
in the past 4 hours i have added 2.5GB of space to /var , due to /var/tmp reaching near 100% full level.
Avatar of woolmilkporc
woolmilkporc
Flag of Germany image

Is it actually only /var/tmp?

By default the only files which could grow significantly in /var/tmp are the snmp-related logs snmpdv3.log, snmpmibd.log and aixmibd.log.

Which other files do you find in /var/tmp? If in doubt, please post an ls -l sample!

Quite more growth can happen in /var/adm and /var/spool! In /var/adm is the wtmp file, which can grow very big over time, because it records logins and logoffs and sometimes there are remote machines which try to login in short intervals in an automated way, maybe even with malicious intent. In var/spool are the logs of sendmail and all the print queues and their logs, which can grow along with printing activity and print job size.

wmp




find all processes accessing /var with:

for pid in $(fuser /dev/hd9var 2>/dev/null); do ps -o pid=,ppid=,user=,args= -p $pid; done

Maybe you deleted a file whose handle is still held open by some process writing lots of data to it?
In this case you won't see any growing file, but freespace will vanish nonetheless.

Avatar of assistunix

ASKER

yea it is /var/tmp only.
and it is these type of files below, that are causing /var/tmp to increase
/var/tmp is filled with these type of files and is utilizing 5GB currently

-rw-------    1 root     system      3848810 Nov 20 07:32 stm782474aaaad
-rw-------    1 root     system      3838813 Nov 20 07:28 stm1130600aaaae
-rw-------    1 root     system      3862524 Nov 20 07:27 stm335976aaaaa
-rw-------    1 root     system      3838832 Nov 20 07:19 stm782474aaaac
-rw-------    1 root     system      3848810 Nov 20 07:15 stm1130600aaaad
-rw-------    1 root     system      3838836 Nov 20 07:08 stm782474aaaab
-rw-------    1 root     system      3838832 Nov 20 07:05 stm1130600aaaac
-rw-------    1 root     system      3862523 Nov 20 06:56 stm782474aaaaa

from that command it seems as if no processes is really accessing /var, correct me if i am wrong.

/var/tmp # for pid in $(fuser /dev/hd9var 2>/dev/null); do ps -o pid=,ppid=,user=,args= -p $pid; done
  90330       1     root /usr/lib/errdemon
 213234  282826     root /usr/sbin/rsct/bin/rmcd -a IBM.LPCommands -r
 221330  282826     root /usr/sbin/syslogd
 303278  282826     root /usr/sbin/muxatmd
 307212       1     root /usr/sbin/cron
 311460  282826     root /usr/sbin/aixmibd
 352286  475194 pconsole /usr/java5/bin/java -Xmx512m -Xms20m -Xscmx10m -Xshareclasses -Dfile.encoding=UTF-8 -Xbootclasspath/a:/pconsole/lwi/runtime/core/
 372934  282826     root /usr/sbin/nimsh -s
 463090  282826     root /usr/sbin/rsct/bin/vac8/IBM.CSMAgentRMd
 475194  417854 pconsole /bin/ksh /pconsole/lwi/bin/lwistart_src.sh
 487446  282826     root /usr/sbin/rsct/bin/IBM.ServiceRMd
 585800  282826     root /usr/sbin/rsct/bin/IBM.DRMd
 913506  860330     root /usr/lpp/OV/lbin/eaagt/opcle -std
1687552 1679516     root -ksh
ASKER CERTIFIED SOLUTION
Avatar of woolmilkporc
woolmilkporc
Flag of Germany image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
this is what i get from running the following command.

/var/tmp # fuser -f /var/tmp/stm782474aaaad
/var/tmp/stm782474aaaad:
i do not get any PID with that command. and OV , i believe is for hpopen view the monitoring tool.
# fuser -f /var/tmp/stm782474aaaad
/var/tmp/stm782474aaaad:
this is just some of the sort processes that are being run.
could these sort processes by causing the increase in space, in /var/tmp ?

ps -ef | grep sort
    root  286816       1   0 18:30:05      -  0:03 sort -rn
    root  295130       1   0 12:15:04      -  0:08 sort -rn
    root  335976       1   0 07:15:04      -  0:13 sort -rn
    root  348410       1   0 11:45:06      -  0:08 sort -rn
    root  360476       1   0 08:00:05      -  0:13 sort -rn
    root  409648       1   0 08:30:05      -  0:12 sort -rn
    root  450710       1   0 22:45:04      -  0:00 sort -rn
    root  458798       1   0 13:45:06      -  0:06 sort -rn
    root  483558       1   0 10:30:05      -  0:09 sort -rn
    root  520394       1   0 18:45:05      -  0:03 sort -rn
    root  573536       1   0 10:15:04      -  0:10 sort -rn
    root  679994       1   0 14:30:03      -  0:06 sort -rn
    root  692332       1   0 17:45:04      -  0:03 sort -rn
    root  704668       1   0 11:30:06      -  0:08 sort -rn
    root  749628       1   0 18:15:05      -  0:03 sort -rn
    root  762054       1   0 11:15:06      -  0:08 sort -rn
    root  778430       1   0 20:45:05      -  0:01 sort -rn
So this file is no longer in use. Was it actually the newest one or did you simply copy my example? Remember, I told you to use a still growing file?

Anyway, try this

A=""
while [[ -z $A ]]
 do A=$(fuser -f $(ls -rt /var/tmp/stm* | tail -1) 2>/dev/null)
   sleep 2
  done
ps -ef |grep $A | grep -v grep

As soon as an open stm* file is found its associated process will be displayed.
Your comments are arriving too fast, it seems!

These sort processes are responsible for the lots of stm files, I'm sure.
Where do they come from? Can you kill them?

By the way, NetView is the IBM version of HP OpenView, so I was nearly right with my guess.
not sure, where they are coming from?
is there a way to find that out?

ps -ef , shows the owner of sort -rn to be root and ppid is 1.

i will try killing the pid of sort -rn processes
Since we have no valid PPID (1 is init, probably only the "adoptive father") it will be very hard to find the true origin of these processes.

Isn't there any sort process having a PPID other than 1? If so, what's this PPID's process?

Killing the sorts with PPID 1 is the right measure and will probably do no harm.
I killed all the sort -rn process and they all had 1 as the ppid.
after killing those processes, all the stmxxxxxx files in /var/tmp got removed automatically.

Thank you for your help :)
Well,

but it's somewhat unsatisfying not to have found out where those processes
might have come from, don't you think?

Is there perhaps a faulty cronjob (running every 15 minutes or so) containing a sort?

yes i agree, it would be good to find the source of it.
i figured that some app or someone ran those processes of "sort -rn" that started to go in loop or hung or something like that.

and it seems as if the issue is back.

sort -rn processes are running again with 1 as ppid and stm806999aaaaa files are being generated again in /var/tmp.

how can we find out if there is a fault cronjob, containing a sort?
crontab -l
as root, then check the commands resp. called scripts/programs

I think we should continue tomorrow.

It's late at night here in my part of the world and my day should have been over a couple of hours ago.

 À bientôt!

wmp
ok sure thing. have a good night, thank you for your help.
i was able to find the parent of one of the sort -rn processes, the other sort -rn processes have 1 as ppid.
sort -rn processes keep starting after i kill them.
each time sort -rn process gets restarted , it gets a new pid, and new ppid.


# ps -ef | grep 405536
    root  405536 1196152   0 18:30:05      -  0:00 sort -rn    <<<
    root 1319156 1646674   0 18:31:40  pts/0  0:00 grep 405536
 # ps -ef | grep 1196152
    root  405536 1196152   0 18:30:05      -  0:00 sort -rn
    root  843894 1196152 120 18:30:05      -  0:16 du -xak /mnt/sapmnt
    root 1196152 1171574   0 18:30:05      -  0:00 head -20    <<<
    root 1319158 1646674   0 18:31:45  pts/0  0:00 grep 1196152
 # ps -ef | grep 1171574
    root 1024030 1646674   0 18:31:58  pts/0  0:00 grep 1171574
    root 1196152 1171574   0 18:30:05      -  0:00 head -20
    root 1171574 1675510   0 05:53:57      -  0:04 /usr/lpp/OV/lbin/eaagt/opcacta   << this is the constant ppid for everytime a new sort -rn process is created.
this pid 1171574 creates a new PID for head -20 , which creates a new PID sort -rn process.

 # ps -ef | grep 1675510
    root  753718 1646674   0 18:32:21  pts/0  0:00 grep 1675510
    root  778348 1675510   0 05:54:07      -  0:00 /usr/lpp/OV/lbin/eaagt/opcle -std
    root  786576 1675510   0 05:53:59      -  0:00 /usr/lpp/OV/lbin/conf/ovconfd
    root  847950 1675510   0 05:53:56      -  0:00 /usr/lpp/OV/bin/ovbbccb -nodaemon
    root  884870 1675510   0 05:53:57      -  0:08 /usr/lpp/OV/lbin/perf/coda
    root  921716 1675510   0 05:54:07      -  0:00 /usr/lpp/OV/lbin/eaagt/opcmsgi
    root 1028286 1675510   0 05:54:07      -  0:05 /usr/lpp/OV/lbin/eaagt/opcmona
    root 1171574 1675510   0 05:53:57      -  0:04 /usr/lpp/OV/lbin/eaagt/opcacta
    root 1392852 1675510   0 05:53:59      -  0:02 /usr/lpp/OV/lbin/eaagt/opcmsga
    root 1675510       1   0 05:53:56      -  0:08 /usr/lpp/OV/bin/ovcd   <<<
 
to clarify on what i meant, by ( root 1171574 1675510   0 05:53:57      -  0:04 /usr/lpp/OV/lbin/eaagt/opcacta ) being the constant ppid, see below example.
sort -rn and head -20 have diff pid's , but they all lead back to the same constant ppid 1171574.

# ps -ef | grep 1273900
        root 1273900 1347826   0 18:15:04      -  0:00 sort -rn
# ps -ef | grep 1347826
    root 1347826 1171574   0 18:15:04      -  0:00 head -20
# ps -ef | grep 1171574
       root 1171574 1675510   0 05:53:57      -  0:04 /usr/lpp/OV/lbin/eaagt/opcacta
# ps -ef | grep 1675510
    root 1675510       1   0 05:53:56      -  0:08 /usr/lpp/OV/bin/ovcd


Please advise on next step on how to manage this issue?
Another thing worth mentioning is that , i only killed sort -rn processes which had 1 as ppid
the sort -rn processes which have an actuall ppid (other than 1) start and stop automatically and get assigned new pid everytime they start again.
OK,

all this comes from OpenView.

Let's see:

EaAgt is the Event Action Agent Application, and ovcacta is the Action Agent itself.
The parent of this all, ovcd, is the OpenView Control Daemon.

Now you've identified ovcacta as the culprit, you can try to restart it.

Use the following with caution, because I'm only familiar with NetView, and OpenView seems rather different!

Issue ovc -stop opcacta and ovc -start opcacta
Check the new status with opcagt -status

Is opcacta running? Are new "sort" hooligans coming up?

You could as well stop and start the whole Agent Subsystem and clean up its temp files inbetween.

1. opcagt -kill
2. Kill all remaining "opc..." processes, if any.
3. Remove all files under "/var/opt/OV/tmp/OpC"  
Note: Not sure if this directory exists with HPOV, if it doesn't search for something like "/usr/lpp/OV/tmp/OpC" or "/usr/opt/OV/tmp/OpC"
4. opcagt -start

Hope this helps. If any of the above commands does not exist or would complain about bad syntax - sorry for that, but it's not NetView!
In such a case you will have to consult your HPOV docs - or try to restart the whole OpenView application, this should be something like
ovc -stop ovcd
ovc -start ovcd
Attention! All HPOV application windows will close!

If the latter doesn't exist or work either - sorry again, please check the docs or see your HPOV admin.

wmp




What I forgot: It could well be that HPOV is manageable via smit!

Open smit (or smitty) and search for HPOV, either under "Communications Applications and Services" or "Applications".

If it's there, see what you can do. At least restarting the whole application should be possible!

Good luck!
I had stopped the ovcd processes and restarted it, but the sort -rn processes issue is still there- Will work further with HPOV team.
HPOV team made changes to their application template from their end. Thanks.