Start Free Trial

asked on

directory /var/tmp in filesystem /var keeps increasing to full level in AIX

Hello

our filesystem is /var is continuously increasing every minute, upon further investigation, i found out that, it is tmp folder within /var that is continuously increasing.

can anyone tell me the significance of /var/tmp , and why could it be constantly increasing and how to go about dealing with this? is /var corrupted or something.
/var/tmp owner is bin
and owner of the files within /var/tmp is root.
in the past 4 hours i have added 2.5GB of space to /var , due to /var/tmp reaching near 100% full level.

Is it actually only /var/tmp?

By default the only files which could grow significantly in /var/tmp are the snmp-related logs snmpdv3.log, snmpmibd.log and aixmibd.log.

Which other files do you find in /var/tmp? If in doubt, please post an ls -l sample!

Quite more growth can happen in /var/adm and /var/spool! In /var/adm is the wtmp file, which can grow very big over time, because it records logins and logoffs and sometimes there are remote machines which try to login in short intervals in an automated way, maybe even with malicious intent. In var/spool are the logs of sendmail and all the print queues and their logs, which can grow along with printing activity and print job size.

wmp

find all processes accessing /var with:

for pid in $(fuser /dev/hd9var 2>/dev/null); do ps -o pid=,ppid=,user=,args= -p $pid; done

Maybe you deleted a file whose handle is still held open by some process writing lots of data to it?
In this case you won't see any growing file, but freespace will vanish nonetheless.

ASKER

yea it is /var/tmp only.
and it is these type of files below, that are causing /var/tmp to increase
/var/tmp is filled with these type of files and is utilizing 5GB currently

-rw------- 1 root system 3848810 Nov 20 07:32 stm782474aaaad
-rw------- 1 root system 3838813 Nov 20 07:28 stm1130600aaaae
-rw------- 1 root system 3862524 Nov 20 07:27 stm335976aaaaa
-rw------- 1 root system 3838832 Nov 20 07:19 stm782474aaaac
-rw------- 1 root system 3848810 Nov 20 07:15 stm1130600aaaad
-rw------- 1 root system 3838836 Nov 20 07:08 stm782474aaaab
-rw------- 1 root system 3838832 Nov 20 07:05 stm1130600aaaac
-rw------- 1 root system 3862523 Nov 20 06:56 stm782474aaaaa

ASKER

from that command it seems as if no processes is really accessing /var, correct me if i am wrong.

/var/tmp # for pid in $(fuser /dev/hd9var 2>/dev/null); do ps -o pid=,ppid=,user=,args= -p $pid; done
90330 1 root /usr/lib/errdemon
213234 282826 root /usr/sbin/rsct/bin/rmcd -a IBM.LPCommands -r
221330 282826 root /usr/sbin/syslogd
303278 282826 root /usr/sbin/muxatmd
307212 1 root /usr/sbin/cron
311460 282826 root /usr/sbin/aixmibd
352286 475194 pconsole /usr/java5/bin/java -Xmx512m -Xms20m -Xscmx10m -Xshareclasses -Dfile.encoding=UTF-8 -Xbootclasspath/a:/pconsole/lwi/runtime/core/
372934 282826 root /usr/sbin/nimsh -s
463090 282826 root /usr/sbin/rsct/bin/vac8/IBM.CSMAgentRMd
475194 417854 pconsole /bin/ksh /pconsole/lwi/bin/lwistart_src.sh
487446 282826 root /usr/sbin/rsct/bin/IBM.ServiceRMd
585800 282826 root /usr/sbin/rsct/bin/IBM.DRMd
913506 860330 root /usr/lpp/OV/lbin/eaagt/opcle -std
1687552 1679516 root -ksh

ASKER CERTIFIED SOLUTION

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

ASKER

this is what i get from running the following command.

/var/tmp # fuser -f /var/tmp/stm782474aaaad
/var/tmp/stm782474aaaad:

ASKER

i do not get any PID with that command. and OV , i believe is for hpopen view the monitoring tool.

ASKER

# fuser -f /var/tmp/stm782474aaaad
/var/tmp/stm782474aaaad:

ASKER

this is just some of the sort processes that are being run.
could these sort processes by causing the increase in space, in /var/tmp ?

ps -ef | grep sort
root 286816 1 0 18:30:05 - 0:03 sort -rn
root 295130 1 0 12:15:04 - 0:08 sort -rn
root 335976 1 0 07:15:04 - 0:13 sort -rn
root 348410 1 0 11:45:06 - 0:08 sort -rn
root 360476 1 0 08:00:05 - 0:13 sort -rn
root 409648 1 0 08:30:05 - 0:12 sort -rn
root 450710 1 0 22:45:04 - 0:00 sort -rn
root 458798 1 0 13:45:06 - 0:06 sort -rn
root 483558 1 0 10:30:05 - 0:09 sort -rn
root 520394 1 0 18:45:05 - 0:03 sort -rn
root 573536 1 0 10:15:04 - 0:10 sort -rn
root 679994 1 0 14:30:03 - 0:06 sort -rn
root 692332 1 0 17:45:04 - 0:03 sort -rn
root 704668 1 0 11:30:06 - 0:08 sort -rn
root 749628 1 0 18:15:05 - 0:03 sort -rn
root 762054 1 0 11:15:06 - 0:08 sort -rn
root 778430 1 0 20:45:05 - 0:01 sort -rn

So this file is no longer in use. Was it actually the newest one or did you simply copy my example? Remember, I told you to use a still growing file?

Anyway, try this

A=""
while [[ -z $A ]]
do A=$(fuser -f $(ls -rt /var/tmp/stm* | tail -1) 2>/dev/null)
sleep 2
done
ps -ef |grep $A | grep -v grep

As soon as an open stm* file is found its associated process will be displayed.

Your comments are arriving too fast, it seems!

These sort processes are responsible for the lots of stm files, I'm sure.
Where do they come from? Can you kill them?

By the way, NetView is the IBM version of HP OpenView, so I was nearly right with my guess.

ASKER

not sure, where they are coming from?
is there a way to find that out?

ps -ef , shows the owner of sort -rn to be root and ppid is 1.

i will try killing the pid of sort -rn processes

Since we have no valid PPID (1 is init, probably only the "adoptive father") it will be very hard to find the true origin of these processes.

Isn't there any sort process having a PPID other than 1? If so, what's this PPID's process?

Killing the sorts with PPID 1 is the right measure and will probably do no harm.

ASKER

I killed all the sort -rn process and they all had 1 as the ppid.
after killing those processes, all the stmxxxxxx files in /var/tmp got removed automatically.

Thank you for your help :)

Well,

but it's somewhat unsatisfying not to have found out where those processes
might have come from, don't you think?

Is there perhaps a faulty cronjob (running every 15 minutes or so) containing a sort?

ASKER

yes i agree, it would be good to find the source of it.
i figured that some app or someone ran those processes of "sort -rn" that started to go in loop or hung or something like that.

and it seems as if the issue is back.

sort -rn processes are running again with 1 as ppid and stm806999aaaaa files are being generated again in /var/tmp.

how can we find out if there is a fault cronjob, containing a sort?

crontab -l
as root, then check the commands resp. called scripts/programs

I think we should continue tomorrow.

It's late at night here in my part of the world and my day should have been over a couple of hours ago.

À bientôt!

wmp

ASKER

ok sure thing. have a good night, thank you for your help.

ASKER

i was able to find the parent of one of the sort -rn processes, the other sort -rn processes have 1 as ppid.
sort -rn processes keep starting after i kill them.
each time sort -rn process gets restarted , it gets a new pid, and new ppid.

# ps -ef | grep 405536
root 405536 1196152 0 18:30:05 - 0:00 sort -rn <<<
root 1319156 1646674 0 18:31:40 pts/0 0:00 grep 405536
# ps -ef | grep 1196152
root 405536 1196152 0 18:30:05 - 0:00 sort -rn
root 843894 1196152 120 18:30:05 - 0:16 du -xak /mnt/sapmnt
root 1196152 1171574 0 18:30:05 - 0:00 head -20 <<<
root 1319158 1646674 0 18:31:45 pts/0 0:00 grep 1196152
# ps -ef | grep 1171574
root 1024030 1646674 0 18:31:58 pts/0 0:00 grep 1171574
root 1196152 1171574 0 18:30:05 - 0:00 head -20
root 1171574 1675510 0 05:53:57 - 0:04 /usr/lpp/OV/lbin/eaagt/opcacta << this is the constant ppid for everytime a new sort -rn process is created.
this pid 1171574 creates a new PID for head -20 , which creates a new PID sort -rn process.

# ps -ef | grep 1675510
root 753718 1646674 0 18:32:21 pts/0 0:00 grep 1675510
root 778348 1675510 0 05:54:07 - 0:00 /usr/lpp/OV/lbin/eaagt/opcle -std
root 786576 1675510 0 05:53:59 - 0:00 /usr/lpp/OV/lbin/conf/ovconfd
root 847950 1675510 0 05:53:56 - 0:00 /usr/lpp/OV/bin/ovbbccb -nodaemon
root 884870 1675510 0 05:53:57 - 0:08 /usr/lpp/OV/lbin/perf/coda
root 921716 1675510 0 05:54:07 - 0:00 /usr/lpp/OV/lbin/eaagt/opcmsgi
root 1028286 1675510 0 05:54:07 - 0:05 /usr/lpp/OV/lbin/eaagt/opcmona
root 1171574 1675510 0 05:53:57 - 0:04 /usr/lpp/OV/lbin/eaagt/opcacta
root 1392852 1675510 0 05:53:59 - 0:02 /usr/lpp/OV/lbin/eaagt/opcmsga
root 1675510 1 0 05:53:56 - 0:08 /usr/lpp/OV/bin/ovcd <<<

ASKER

to clarify on what i meant, by ( root 1171574 1675510 0 05:53:57 - 0:04 /usr/lpp/OV/lbin/eaagt/opcacta ) being the constant ppid, see below example.
sort -rn and head -20 have diff pid's , but they all lead back to the same constant ppid 1171574.

# ps -ef | grep 1273900
root 1273900 1347826 0 18:15:04 - 0:00 sort -rn
# ps -ef | grep 1347826
root 1347826 1171574 0 18:15:04 - 0:00 head -20
# ps -ef | grep 1171574
root 1171574 1675510 0 05:53:57 - 0:04 /usr/lpp/OV/lbin/eaagt/opcacta
# ps -ef | grep 1675510
root 1675510 1 0 05:53:56 - 0:08 /usr/lpp/OV/bin/ovcd

Please advise on next step on how to manage this issue?

ASKER

Another thing worth mentioning is that , i only killed sort -rn processes which had 1 as ppid
the sort -rn processes which have an actuall ppid (other than 1) start and stop automatically and get assigned new pid everytime they start again.

OK,

all this comes from OpenView.

Let's see:

EaAgt is the Event Action Agent Application, and ovcacta is the Action Agent itself.
The parent of this all, ovcd, is the OpenView Control Daemon.

Now you've identified ovcacta as the culprit, you can try to restart it.

Use the following with caution, because I'm only familiar with NetView, and OpenView seems rather different!

Issue ovc -stop opcacta and ovc -start opcacta
Check the new status with opcagt -status

Is opcacta running? Are new "sort" hooligans coming up?

You could as well stop and start the whole Agent Subsystem and clean up its temp files inbetween.

1. opcagt -kill
2. Kill all remaining "opc..." processes, if any.
3. Remove all files under "/var/opt/OV/tmp/OpC"
Note: Not sure if this directory exists with HPOV, if it doesn't search for something like "/usr/lpp/OV/tmp/OpC" or "/usr/opt/OV/tmp/OpC"
4. opcagt -start

Hope this helps. If any of the above commands does not exist or would complain about bad syntax - sorry for that, but it's not NetView!
In such a case you will have to consult your HPOV docs - or try to restart the whole OpenView application, this should be something like
ovc -stop ovcd
ovc -start ovcd
Attention! All HPOV application windows will close!

If the latter doesn't exist or work either - sorry again, please check the docs or see your HPOV admin.

wmp

What I forgot: It could well be that HPOV is manageable via smit!

Open smit (or smitty) and search for HPOV, either under "Communications Applications and Services" or "Applications".

If it's there, see what you can do. At least restarting the whole application should be possible!

Good luck!

ASKER

I had stopped the ovcd processes and restarted it, but the sort -rn processes issue is still there- Will work further with HPOV team.

ASKER

HPOV team made changes to their application template from their end. Thanks.