assistunix
asked on
directory /var/tmp in filesystem /var keeps increasing to full level in AIX
Hello
our filesystem is /var is continuously increasing every minute, upon further investigation, i found out that, it is tmp folder within /var that is continuously increasing.
can anyone tell me the significance of /var/tmp , and why could it be constantly increasing and how to go about dealing with this? is /var corrupted or something.
/var/tmp owner is bin
and owner of the files within /var/tmp is root.
in the past 4 hours i have added 2.5GB of space to /var , due to /var/tmp reaching near 100% full level.
our filesystem is /var is continuously increasing every minute, upon further investigation, i found out that, it is tmp folder within /var that is continuously increasing.
can anyone tell me the significance of /var/tmp , and why could it be constantly increasing and how to go about dealing with this? is /var corrupted or something.
/var/tmp owner is bin
and owner of the files within /var/tmp is root.
in the past 4 hours i have added 2.5GB of space to /var , due to /var/tmp reaching near 100% full level.
find all processes accessing /var with:
for pid in $(fuser /dev/hd9var 2>/dev/null); do ps -o pid=,ppid=,user=,args= -p $pid; done
Maybe you deleted a file whose handle is still held open by some process writing lots of data to it?
In this case you won't see any growing file, but freespace will vanish nonetheless.
for pid in $(fuser /dev/hd9var 2>/dev/null); do ps -o pid=,ppid=,user=,args= -p $pid; done
Maybe you deleted a file whose handle is still held open by some process writing lots of data to it?
In this case you won't see any growing file, but freespace will vanish nonetheless.
ASKER
yea it is /var/tmp only.
and it is these type of files below, that are causing /var/tmp to increase
/var/tmp is filled with these type of files and is utilizing 5GB currently
-rw------- 1 root system 3848810 Nov 20 07:32 stm782474aaaad
-rw------- 1 root system 3838813 Nov 20 07:28 stm1130600aaaae
-rw------- 1 root system 3862524 Nov 20 07:27 stm335976aaaaa
-rw------- 1 root system 3838832 Nov 20 07:19 stm782474aaaac
-rw------- 1 root system 3848810 Nov 20 07:15 stm1130600aaaad
-rw------- 1 root system 3838836 Nov 20 07:08 stm782474aaaab
-rw------- 1 root system 3838832 Nov 20 07:05 stm1130600aaaac
-rw------- 1 root system 3862523 Nov 20 06:56 stm782474aaaaa
and it is these type of files below, that are causing /var/tmp to increase
/var/tmp is filled with these type of files and is utilizing 5GB currently
-rw------- 1 root system 3848810 Nov 20 07:32 stm782474aaaad
-rw------- 1 root system 3838813 Nov 20 07:28 stm1130600aaaae
-rw------- 1 root system 3862524 Nov 20 07:27 stm335976aaaaa
-rw------- 1 root system 3838832 Nov 20 07:19 stm782474aaaac
-rw------- 1 root system 3848810 Nov 20 07:15 stm1130600aaaad
-rw------- 1 root system 3838836 Nov 20 07:08 stm782474aaaab
-rw------- 1 root system 3838832 Nov 20 07:05 stm1130600aaaac
-rw------- 1 root system 3862523 Nov 20 06:56 stm782474aaaaa
ASKER
from that command it seems as if no processes is really accessing /var, correct me if i am wrong.
/var/tmp # for pid in $(fuser /dev/hd9var 2>/dev/null); do ps -o pid=,ppid=,user=,args= -p $pid; done
90330 1 root /usr/lib/errdemon
213234 282826 root /usr/sbin/rsct/bin/rmcd -a IBM.LPCommands -r
221330 282826 root /usr/sbin/syslogd
303278 282826 root /usr/sbin/muxatmd
307212 1 root /usr/sbin/cron
311460 282826 root /usr/sbin/aixmibd
352286 475194 pconsole /usr/java5/bin/java -Xmx512m -Xms20m -Xscmx10m -Xshareclasses -Dfile.encoding=UTF-8 -Xbootclasspath/a:/pconsol e/lwi/runt ime/core/
372934 282826 root /usr/sbin/nimsh -s
463090 282826 root /usr/sbin/rsct/bin/vac8/IB M.CSMAgent RMd
475194 417854 pconsole /bin/ksh /pconsole/lwi/bin/lwistart _src.sh
487446 282826 root /usr/sbin/rsct/bin/IBM.Ser viceRMd
585800 282826 root /usr/sbin/rsct/bin/IBM.DRM d
913506 860330 root /usr/lpp/OV/lbin/eaagt/opc le -std
1687552 1679516 root -ksh
/var/tmp # for pid in $(fuser /dev/hd9var 2>/dev/null); do ps -o pid=,ppid=,user=,args= -p $pid; done
90330 1 root /usr/lib/errdemon
213234 282826 root /usr/sbin/rsct/bin/rmcd -a IBM.LPCommands -r
221330 282826 root /usr/sbin/syslogd
303278 282826 root /usr/sbin/muxatmd
307212 1 root /usr/sbin/cron
311460 282826 root /usr/sbin/aixmibd
352286 475194 pconsole /usr/java5/bin/java -Xmx512m -Xms20m -Xscmx10m -Xshareclasses -Dfile.encoding=UTF-8 -Xbootclasspath/a:/pconsol
372934 282826 root /usr/sbin/nimsh -s
463090 282826 root /usr/sbin/rsct/bin/vac8/IB
475194 417854 pconsole /bin/ksh /pconsole/lwi/bin/lwistart
487446 282826 root /usr/sbin/rsct/bin/IBM.Ser
585800 282826 root /usr/sbin/rsct/bin/IBM.DRM
913506 860330 root /usr/lpp/OV/lbin/eaagt/opc
1687552 1679516 root -ksh
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
this is what i get from running the following command.
/var/tmp # fuser -f /var/tmp/stm782474aaaad
/var/tmp/stm782474aaaad:
/var/tmp # fuser -f /var/tmp/stm782474aaaad
/var/tmp/stm782474aaaad:
ASKER
i do not get any PID with that command. and OV , i believe is for hpopen view the monitoring tool.
ASKER
# fuser -f /var/tmp/stm782474aaaad
/var/tmp/stm782474aaaad:
/var/tmp/stm782474aaaad:
ASKER
this is just some of the sort processes that are being run.
could these sort processes by causing the increase in space, in /var/tmp ?
ps -ef | grep sort
root 286816 1 0 18:30:05 - 0:03 sort -rn
root 295130 1 0 12:15:04 - 0:08 sort -rn
root 335976 1 0 07:15:04 - 0:13 sort -rn
root 348410 1 0 11:45:06 - 0:08 sort -rn
root 360476 1 0 08:00:05 - 0:13 sort -rn
root 409648 1 0 08:30:05 - 0:12 sort -rn
root 450710 1 0 22:45:04 - 0:00 sort -rn
root 458798 1 0 13:45:06 - 0:06 sort -rn
root 483558 1 0 10:30:05 - 0:09 sort -rn
root 520394 1 0 18:45:05 - 0:03 sort -rn
root 573536 1 0 10:15:04 - 0:10 sort -rn
root 679994 1 0 14:30:03 - 0:06 sort -rn
root 692332 1 0 17:45:04 - 0:03 sort -rn
root 704668 1 0 11:30:06 - 0:08 sort -rn
root 749628 1 0 18:15:05 - 0:03 sort -rn
root 762054 1 0 11:15:06 - 0:08 sort -rn
root 778430 1 0 20:45:05 - 0:01 sort -rn
could these sort processes by causing the increase in space, in /var/tmp ?
ps -ef | grep sort
root 286816 1 0 18:30:05 - 0:03 sort -rn
root 295130 1 0 12:15:04 - 0:08 sort -rn
root 335976 1 0 07:15:04 - 0:13 sort -rn
root 348410 1 0 11:45:06 - 0:08 sort -rn
root 360476 1 0 08:00:05 - 0:13 sort -rn
root 409648 1 0 08:30:05 - 0:12 sort -rn
root 450710 1 0 22:45:04 - 0:00 sort -rn
root 458798 1 0 13:45:06 - 0:06 sort -rn
root 483558 1 0 10:30:05 - 0:09 sort -rn
root 520394 1 0 18:45:05 - 0:03 sort -rn
root 573536 1 0 10:15:04 - 0:10 sort -rn
root 679994 1 0 14:30:03 - 0:06 sort -rn
root 692332 1 0 17:45:04 - 0:03 sort -rn
root 704668 1 0 11:30:06 - 0:08 sort -rn
root 749628 1 0 18:15:05 - 0:03 sort -rn
root 762054 1 0 11:15:06 - 0:08 sort -rn
root 778430 1 0 20:45:05 - 0:01 sort -rn
So this file is no longer in use. Was it actually the newest one or did you simply copy my example? Remember, I told you to use a still growing file?
Anyway, try this
A=""
while [[ -z $A ]]
do A=$(fuser -f $(ls -rt /var/tmp/stm* | tail -1) 2>/dev/null)
sleep 2
done
ps -ef |grep $A | grep -v grep
As soon as an open stm* file is found its associated process will be displayed.
Anyway, try this
A=""
while [[ -z $A ]]
do A=$(fuser -f $(ls -rt /var/tmp/stm* | tail -1) 2>/dev/null)
sleep 2
done
ps -ef |grep $A | grep -v grep
As soon as an open stm* file is found its associated process will be displayed.
Your comments are arriving too fast, it seems!
These sort processes are responsible for the lots of stm files, I'm sure.
Where do they come from? Can you kill them?
By the way, NetView is the IBM version of HP OpenView, so I was nearly right with my guess.
These sort processes are responsible for the lots of stm files, I'm sure.
Where do they come from? Can you kill them?
By the way, NetView is the IBM version of HP OpenView, so I was nearly right with my guess.
ASKER
not sure, where they are coming from?
is there a way to find that out?
ps -ef , shows the owner of sort -rn to be root and ppid is 1.
i will try killing the pid of sort -rn processes
is there a way to find that out?
ps -ef , shows the owner of sort -rn to be root and ppid is 1.
i will try killing the pid of sort -rn processes
Since we have no valid PPID (1 is init, probably only the "adoptive father") it will be very hard to find the true origin of these processes.
Isn't there any sort process having a PPID other than 1? If so, what's this PPID's process?
Killing the sorts with PPID 1 is the right measure and will probably do no harm.
Isn't there any sort process having a PPID other than 1? If so, what's this PPID's process?
Killing the sorts with PPID 1 is the right measure and will probably do no harm.
ASKER
I killed all the sort -rn process and they all had 1 as the ppid.
after killing those processes, all the stmxxxxxx files in /var/tmp got removed automatically.
Thank you for your help :)
after killing those processes, all the stmxxxxxx files in /var/tmp got removed automatically.
Thank you for your help :)
Well,
but it's somewhat unsatisfying not to have found out where those processes
might have come from, don't you think?
Is there perhaps a faulty cronjob (running every 15 minutes or so) containing a sort?
but it's somewhat unsatisfying not to have found out where those processes
might have come from, don't you think?
Is there perhaps a faulty cronjob (running every 15 minutes or so) containing a sort?
ASKER
yes i agree, it would be good to find the source of it.
i figured that some app or someone ran those processes of "sort -rn" that started to go in loop or hung or something like that.
and it seems as if the issue is back.
sort -rn processes are running again with 1 as ppid and stm806999aaaaa files are being generated again in /var/tmp.
how can we find out if there is a fault cronjob, containing a sort?
i figured that some app or someone ran those processes of "sort -rn" that started to go in loop or hung or something like that.
and it seems as if the issue is back.
sort -rn processes are running again with 1 as ppid and stm806999aaaaa files are being generated again in /var/tmp.
how can we find out if there is a fault cronjob, containing a sort?
crontab -l
as root, then check the commands resp. called scripts/programs
as root, then check the commands resp. called scripts/programs
I think we should continue tomorrow.
It's late at night here in my part of the world and my day should have been over a couple of hours ago.
À bientôt!
wmp
It's late at night here in my part of the world and my day should have been over a couple of hours ago.
À bientôt!
wmp
ASKER
ok sure thing. have a good night, thank you for your help.
ASKER
i was able to find the parent of one of the sort -rn processes, the other sort -rn processes have 1 as ppid.
sort -rn processes keep starting after i kill them.
each time sort -rn process gets restarted , it gets a new pid, and new ppid.
# ps -ef | grep 405536
root 405536 1196152 0 18:30:05 - 0:00 sort -rn <<<
root 1319156 1646674 0 18:31:40 pts/0 0:00 grep 405536
# ps -ef | grep 1196152
root 405536 1196152 0 18:30:05 - 0:00 sort -rn
root 843894 1196152 120 18:30:05 - 0:16 du -xak /mnt/sapmnt
root 1196152 1171574 0 18:30:05 - 0:00 head -20 <<<
root 1319158 1646674 0 18:31:45 pts/0 0:00 grep 1196152
# ps -ef | grep 1171574
root 1024030 1646674 0 18:31:58 pts/0 0:00 grep 1171574
root 1196152 1171574 0 18:30:05 - 0:00 head -20
root 1171574 1675510 0 05:53:57 - 0:04 /usr/lpp/OV/lbin/eaagt/opc acta << this is the constant ppid for everytime a new sort -rn process is created.
this pid 1171574 creates a new PID for head -20 , which creates a new PID sort -rn process.
# ps -ef | grep 1675510
root 753718 1646674 0 18:32:21 pts/0 0:00 grep 1675510
root 778348 1675510 0 05:54:07 - 0:00 /usr/lpp/OV/lbin/eaagt/opc le -std
root 786576 1675510 0 05:53:59 - 0:00 /usr/lpp/OV/lbin/conf/ovco nfd
root 847950 1675510 0 05:53:56 - 0:00 /usr/lpp/OV/bin/ovbbccb -nodaemon
root 884870 1675510 0 05:53:57 - 0:08 /usr/lpp/OV/lbin/perf/coda
root 921716 1675510 0 05:54:07 - 0:00 /usr/lpp/OV/lbin/eaagt/opc msgi
root 1028286 1675510 0 05:54:07 - 0:05 /usr/lpp/OV/lbin/eaagt/opc mona
root 1171574 1675510 0 05:53:57 - 0:04 /usr/lpp/OV/lbin/eaagt/opc acta
root 1392852 1675510 0 05:53:59 - 0:02 /usr/lpp/OV/lbin/eaagt/opc msga
root 1675510 1 0 05:53:56 - 0:08 /usr/lpp/OV/bin/ovcd <<<
sort -rn processes keep starting after i kill them.
each time sort -rn process gets restarted , it gets a new pid, and new ppid.
# ps -ef | grep 405536
root 405536 1196152 0 18:30:05 - 0:00 sort -rn <<<
root 1319156 1646674 0 18:31:40 pts/0 0:00 grep 405536
# ps -ef | grep 1196152
root 405536 1196152 0 18:30:05 - 0:00 sort -rn
root 843894 1196152 120 18:30:05 - 0:16 du -xak /mnt/sapmnt
root 1196152 1171574 0 18:30:05 - 0:00 head -20 <<<
root 1319158 1646674 0 18:31:45 pts/0 0:00 grep 1196152
# ps -ef | grep 1171574
root 1024030 1646674 0 18:31:58 pts/0 0:00 grep 1171574
root 1196152 1171574 0 18:30:05 - 0:00 head -20
root 1171574 1675510 0 05:53:57 - 0:04 /usr/lpp/OV/lbin/eaagt/opc
this pid 1171574 creates a new PID for head -20 , which creates a new PID sort -rn process.
# ps -ef | grep 1675510
root 753718 1646674 0 18:32:21 pts/0 0:00 grep 1675510
root 778348 1675510 0 05:54:07 - 0:00 /usr/lpp/OV/lbin/eaagt/opc
root 786576 1675510 0 05:53:59 - 0:00 /usr/lpp/OV/lbin/conf/ovco
root 847950 1675510 0 05:53:56 - 0:00 /usr/lpp/OV/bin/ovbbccb -nodaemon
root 884870 1675510 0 05:53:57 - 0:08 /usr/lpp/OV/lbin/perf/coda
root 921716 1675510 0 05:54:07 - 0:00 /usr/lpp/OV/lbin/eaagt/opc
root 1028286 1675510 0 05:54:07 - 0:05 /usr/lpp/OV/lbin/eaagt/opc
root 1171574 1675510 0 05:53:57 - 0:04 /usr/lpp/OV/lbin/eaagt/opc
root 1392852 1675510 0 05:53:59 - 0:02 /usr/lpp/OV/lbin/eaagt/opc
root 1675510 1 0 05:53:56 - 0:08 /usr/lpp/OV/bin/ovcd <<<
ASKER
to clarify on what i meant, by ( root 1171574 1675510 0 05:53:57 - 0:04 /usr/lpp/OV/lbin/eaagt/opc acta ) being the constant ppid, see below example.
sort -rn and head -20 have diff pid's , but they all lead back to the same constant ppid 1171574.
# ps -ef | grep 1273900
root 1273900 1347826 0 18:15:04 - 0:00 sort -rn
# ps -ef | grep 1347826
root 1347826 1171574 0 18:15:04 - 0:00 head -20
# ps -ef | grep 1171574
root 1171574 1675510 0 05:53:57 - 0:04 /usr/lpp/OV/lbin/eaagt/opc acta
# ps -ef | grep 1675510
root 1675510 1 0 05:53:56 - 0:08 /usr/lpp/OV/bin/ovcd
Please advise on next step on how to manage this issue?
sort -rn and head -20 have diff pid's , but they all lead back to the same constant ppid 1171574.
# ps -ef | grep 1273900
root 1273900 1347826 0 18:15:04 - 0:00 sort -rn
# ps -ef | grep 1347826
root 1347826 1171574 0 18:15:04 - 0:00 head -20
# ps -ef | grep 1171574
root 1171574 1675510 0 05:53:57 - 0:04 /usr/lpp/OV/lbin/eaagt/opc
# ps -ef | grep 1675510
root 1675510 1 0 05:53:56 - 0:08 /usr/lpp/OV/bin/ovcd
Please advise on next step on how to manage this issue?
ASKER
Another thing worth mentioning is that , i only killed sort -rn processes which had 1 as ppid
the sort -rn processes which have an actuall ppid (other than 1) start and stop automatically and get assigned new pid everytime they start again.
the sort -rn processes which have an actuall ppid (other than 1) start and stop automatically and get assigned new pid everytime they start again.
OK,
all this comes from OpenView.
Let's see:
EaAgt is the Event Action Agent Application, and ovcacta is the Action Agent itself.
The parent of this all, ovcd, is the OpenView Control Daemon.
Now you've identified ovcacta as the culprit, you can try to restart it.
Use the following with caution, because I'm only familiar with NetView, and OpenView seems rather different!
Issue ovc -stop opcacta and ovc -start opcacta
Check the new status with opcagt -status
Is opcacta running? Are new "sort" hooligans coming up?
You could as well stop and start the whole Agent Subsystem and clean up its temp files inbetween.
1. opcagt -kill
2. Kill all remaining "opc..." processes, if any.
3. Remove all files under "/var/opt/OV/tmp/OpC"
Note: Not sure if this directory exists with HPOV, if it doesn't search for something like "/usr/lpp/OV/tmp/OpC" or "/usr/opt/OV/tmp/OpC"
4. opcagt -start
Hope this helps. If any of the above commands does not exist or would complain about bad syntax - sorry for that, but it's not NetView!
In such a case you will have to consult your HPOV docs - or try to restart the whole OpenView application, this should be something like
ovc -stop ovcd
ovc -start ovcd
Attention! All HPOV application windows will close!
If the latter doesn't exist or work either - sorry again, please check the docs or see your HPOV admin.
wmp
all this comes from OpenView.
Let's see:
EaAgt is the Event Action Agent Application, and ovcacta is the Action Agent itself.
The parent of this all, ovcd, is the OpenView Control Daemon.
Now you've identified ovcacta as the culprit, you can try to restart it.
Use the following with caution, because I'm only familiar with NetView, and OpenView seems rather different!
Issue ovc -stop opcacta and ovc -start opcacta
Check the new status with opcagt -status
Is opcacta running? Are new "sort" hooligans coming up?
You could as well stop and start the whole Agent Subsystem and clean up its temp files inbetween.
1. opcagt -kill
2. Kill all remaining "opc..." processes, if any.
3. Remove all files under "/var/opt/OV/tmp/OpC"
Note: Not sure if this directory exists with HPOV, if it doesn't search for something like "/usr/lpp/OV/tmp/OpC" or "/usr/opt/OV/tmp/OpC"
4. opcagt -start
Hope this helps. If any of the above commands does not exist or would complain about bad syntax - sorry for that, but it's not NetView!
In such a case you will have to consult your HPOV docs - or try to restart the whole OpenView application, this should be something like
ovc -stop ovcd
ovc -start ovcd
Attention! All HPOV application windows will close!
If the latter doesn't exist or work either - sorry again, please check the docs or see your HPOV admin.
wmp
What I forgot: It could well be that HPOV is manageable via smit!
Open smit (or smitty) and search for HPOV, either under "Communications Applications and Services" or "Applications".
If it's there, see what you can do. At least restarting the whole application should be possible!
Good luck!
Open smit (or smitty) and search for HPOV, either under "Communications Applications and Services" or "Applications".
If it's there, see what you can do. At least restarting the whole application should be possible!
Good luck!
ASKER
I had stopped the ovcd processes and restarted it, but the sort -rn processes issue is still there- Will work further with HPOV team.
ASKER
HPOV team made changes to their application template from their end. Thanks.
By default the only files which could grow significantly in /var/tmp are the snmp-related logs snmpdv3.log, snmpmibd.log and aixmibd.log.
Which other files do you find in /var/tmp? If in doubt, please post an ls -l sample!
Quite more growth can happen in /var/adm and /var/spool! In /var/adm is the wtmp file, which can grow very big over time, because it records logins and logoffs and sometimes there are remote machines which try to login in short intervals in an automated way, maybe even with malicious intent. In var/spool are the logs of sendmail and all the print queues and their logs, which can grow along with printing activity and print job size.
wmp