Unusual Disk Read From Primary Disks


We have a server running Linux Redhat 9.0
Kernel: 2.4.20-20.9bigmem #1

External Raid
and two internal drives RAID-0 (using the same cable)

and all our data and applications is mounted on the external raid

the primary drive is for the root file system only to be more prcesisi here is a output of df

Filesystem           1K-blocks      Used Available Use% Mounted on
/dev/md2              34336980   1965536  30627192   7% /
/dev/md0               2063440     65396   1893228   4% /boot
none                   3096472         0   3096472   0% /dev/shm
tiny:/home            75854568  27651104  44350280  39% /ftp
//fax/doze            77199360  50599936  26599424  66% /doze
/dev/Volume00/lvol1  203147960  64257220 128404980  34% /appl
/dev/Volume00/tmp     10485436    113016  10372420   2% /tmp
/dev/Volume00/proedi  10485436    867228   9618208   9% /appl/proedi
                      52427200  34969832  17457368  67% /appl/backups/sys3
                      36298264   3120520  31333888  10% /appl/backups/sys4b

our usual load average is between 2 - 4, now and then it would peak to 10 but usualy comes down with on minitues.

since couple of weeks ago we started having a small problem, it seems every 7 days (which is another intresting factor) the disk reads from primary disks would increase significantly, causing a i/o wait for all other web applications we are running off our webserver (shell scripts n stuff) , we are yet to find what exactly is causing all these reads and why is it always 6-7 days after a reboot. we always have no other choice than having to reboot the server when this happens. since the increase read time from dev3-0 and 1  would go upto 200 to 300 blks/s causing everything else to hang

dev3-0 and 1 is our primary drives using IDE and they are on the same IDE cable

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
dev3-0            4.50        15.21        53.90      35564     126016
dev3-1            4.51        16.27        53.90      38032     126016
dev8-0          131.01      1928.25       937.88    4507854    2192568
dev8-1          131.19      2000.76       937.88    4677356    2192568

dev8 is the external raid

well i hope i have given enough information to visualize the senario if not i am more than willing to provide some extra information.. if anyone out there have encounterd and problem as this or knows something that could cause a situation like this all comments are greatly apriciated.



Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

DevZer0Author Commented:
Well here is some more info.. we do have a swap parition on our primary drives but it is set to turn off when the RAID comes to life. since there is a swap in the RAID.. but we can;t know for sure if the primary swap is actually not being used or not.. as for my knowladge using free -m will not show what swap is being used well that was one theory that i came up with my no evidence to confirm it yet.

For swap normally there is/are entry in the fstab, and show the partitions used for swap, eg.

/dev/hda7               swap                    swap    defaults        0 0     <-- this mean the primary HDD partition#7 is for swap, so everytime you start/restart your

linux distro, the system will read the fstab and enable the swap for you.
There may have lots of reason why your harddisk i/o increase, if someone try to scan you web server for hacking which might increase your i/o, because they use many way to scan your server which make your system busy, and your system is install into your dev3-0 and 1.

Your application request the system resouce and create lots of system I/O may also the reason, check you web application to ensure it's not the case.

I will suggest to restart the web service each day or twice per week, before you find out the reason.

I do not mean to restart/reboot the server, I mean restart the web/http service, the way to do is just as below

/usr/bin/apachectl stop      <--- which will stop the web/http service
/usr/bin/apachectl start      <--- which start up the web/http service without the ssl
/usr/bin/apachectl startssl   <--- which start up the web/http serice with ssl

I've an application running in our web server, which create lots of dead child, and which hang up my web 10-20 days, my programmer still looking for the problem, but the restart http service which kill all the dead child and keep my web server running, I know this is not a solution, but at lease I can keep my server alive.

I don't know the restart service will help you or not but at lease you can try.
Big Business Goals? Which KPIs Will Help You

The most successful MSPs rely on metrics – known as key performance indicators (KPIs) – for making informed decisions that help their businesses thrive, rather than just survive. This eBook provides an overview of the most important KPIs used by top MSPs.

If this is happening every 7 days I'd be inclined to look for a cronjob doing something at weekly intervals. In a standard RH 9 configuration only the rpm log is rotated weekly (see /etc/logrotate.d/rpm), but you could have configured something else to rotate logs (like web server logs) on a weekly basis.
DevZer0Author Commented:
Hi thanks for the comments, much apriciated.. i checked the cron weekly and and we do have weekly cronjobs but they are nothing to do with the primary drives... apache is also on our external raid, pretty much whats on the root file system is

/var      /mnt
/etc      /opt
/usr     /root
/bin     /sbin
/boot     /snap

and i know for a fact var has whole bunch of logs, as in ftp logs and telnet logs and other error logs.. but as far as apache logs and our application logs goes they are all in the external raid

Do the log dates on /var/logs match up with when the disk load occurs? How about the apache logs?

Note that activity on the system devices could be a result of the use of swap space or temp areas, even though Apache, DB's, whatever, are located on the external RAID. You can check to see what devices are actually being used for swap with 'swapon -s'.
DevZer0Author Commented:
apache logs are also in the external RAID and the problem is increase of primary disk read activity not writes. so even though if it was a small log that is being written in /var/log (not apache logs nor other application logs because they are written to the raid) woulodnt cause a increase of disk read time but them the two primary disks are Mirrored so that is the only cause that i could think that something is actually trying to read stuff...

Well, /tmp and swap are on the primary disks and anything that used those resources would cause disk activity on the primaries, even though the application and its data resides on the external RAID. The key to solving this is going to be in finding a correlation between the high disk I/O on the primaries and something that happens every 7 days. Once you know what triggers the I/O it then becomes a matter of figuring out why it beats on the primary disks.
DevZer0Author Commented:
/tmp and swap is not on the primary disks
Oops I missed that /tmp is on the RAID. Have you verified that all of swap isn't on a primary with 'swapon -s'? Could something be using /var/tmp?
DevZer0Author Commented:
----- swapon -s results -----

/dev/Volume00/swap1             partition 2097144 29292 -2
/dev/Volume00/swap2             partition 2097144 0 -3
/dev/Volume00/swap3             partition 2097144 0 -4
/dev/Volume00/swap4             partition 2097144 0 -5
/dev/Volume00/swap5             partition 2097144 0 -6
/dev/Volume00/swap6             partition 2097144 0 -7

looks like only RAID swaps are active and /var/tmp is not used by anything infact its empty as well...

any other ideas ?
DevZer0Author Commented:
We switched the cable which was used by the primary disks over this weekend... Since both Primary drives were sharing the same cable and since they are mirrored as well our most resonable theory was that since both Disks are sharing the same IDE Controler and the CABLE it could increase I/O time when a mirror operation is in progress. but we are yet to see the results... so far no problems have noticed...

I assume you are using RAID 1 (mirroring) and having each disk as a master on different IDE controllers is a good thing, reliability wise. Using two controllers is also good w/respect to disk I/O rates and that should help. But it still doesn't explain what's happening or why. I think you are going to have to look carefully at what happens on a seven day period for the case.

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today

From novice to tech pro — start learning today.