pdflush on Linux

I'm running a server which continuously receives data over a network,
processes the data, and stores the processed data in a large
database.  The database consists of multiple files which are very
large ( > 50GBs).  The database has a lot of cache, to get around I/O

But what happens is that the system starts to slow down horribly
despite database cache, and the iostat monitoring tool along with atop
indicates constant disk activity.  When I shut the server down, disk
activity continues for some time, even though the server is off.

I'm wondering if this is an issue with pdflush.  There's no utility on
Linux to indicate which process is actually doing the disk I/O, but I
suspect pdflush is the cause of the performance bottleneck.  My
questions are:

1. Is pdflush a likely candidate for this problem?

2. If so, is there anything I can do about it, such as disable
pdflush, or modify how often it flushes pages to disk.
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

Kent OlsenData Warehouse Architect / DBACommented:
Hi chsalvia,

pdflush may be involved, but my expectation would be that the root cause is index updating.

If you're updating large tables with large datasets, it's going to take some time to clean up the indexes.

Good Luck,
Duncan RoeSoftware DeveloperCommented:
Following on from Kdo's excellent post, I suspect you are waiting for disk reads more than for disk writes. Since you are reading random file blocks from a file much larger than installed RAM, a disk read will be required for each (instead of the block's being cached) more often than not.
Statistically, you should see an improvement by increasing installed RAM, but how much of an improvement I would not like to estimate.
How much RAM does your system have? How much could you put on? (using largest readily-available DIMMs in all slots)
Duncan RoeSoftware DeveloperCommented:
"the system starts to slow down horribly" - is that as more users get started, or is it a constant load from the start?
The Ultimate Tool Kit for Technolgy Solution Provi

Broken down into practical pointers and step-by-step instructions, the IT Service Excellence Tool Kit delivers expert advice for technology solution providers. Get your free copy for valuable how-to assets including sample agreements, checklists, flowcharts, and more!

chsalviaAuthor Commented:
I'm pretty confident this is not related to the DB itself, (.e.g index updating), and has something to do with the OS cache, or pdflush.

The reason I think this is because atop reports consistent disk activity even AFTER the server is shut down, and the only process which shows any activity at that point is pdflush.  The server and DB is not at all active.

So basically, it seems as though pdflush is always causing 100% disk activity, which seriously slows down the program.  

Also, the setup is a RAID 5 with 16GB ram.
Duncan RoeSoftware DeveloperCommented:
16GB is respectable - is that top weight or could it hold 32?
The function of pdflush is to write out "dirty" buffers - anything that needs writing to disk. If it keeps laboring after the application is closed down, maybe there was a ton of outstanding disk writes. Should there be a *lot* of disk writing going on? (high ratio of updates to reads)
chsalviaAuthor Commented:
It's upgradeable to 32, but right now only has 16.

The application is IO intensive by nature, so it's not surprising that a lot of disk writing is going on.  I'm just wondering if there's anything I can do (short of purchasing more RAM/faster drives) which might improve performance somewhat.  pdflush is very active when the application is running, and remains active for some time after the application closes.  Is there a way to decrease the frequency that pdflush syncs to disk?

Also, since the DB does it's own caching, maybe the OS caching is unnecessary and just getting in the way.  Is this a possibility, and if so, is it possible to bypass OS caching?
Duncan RoeSoftware DeveloperCommented:
Googling for pdflush discovered a couple of items regarding excessive CPU use by pdflush - but that was 2.6.6 in 2004. Still looking...
Duncan RoeSoftware DeveloperCommented:
pdflush had a 1-line change effective or so (I'd just get 2.6.24 now it's out). I haven't found out yet what problem it's supposed to address, still looking...
Duncan RoeSoftware DeveloperCommented:
This discussion thread entitled "Raid performance problems (pdflush / raid5 eats 100%)" is fairly recent: http://www.mail-archive.com/linux-raid@vger.kernel.org/msg09213.html
I don't suggest your problem is the same, but you might find the responder's suggextions useful (or you may not).
Duncan RoeSoftware DeveloperCommented:
You can google for pdflush as well as I - start at http://www.google.com/linux (or your national equivalent, e.g. com.au for Australia)..
On another thread, there are some user controls available to you regarding pdflush - you may like to try them. In particular, you can increase how old a block becomes before pdflush will write it out - this may help with your database's caching mechanism. You can also increase the memory limit before dirty blocks are forced out.
This is documented in the linux source directory  on your system, file Documentation/filesystems/proc.txt. I've attached the 2.6.24 version of the relevant section ( that I run is actually the same).
To change any of these values, use "echo >", e.g. on my system (also showing current values):

12:55:08# for i in *;do echo -e "`cat $i`\t$i";done
0       block_dump
5       dirty_background_ratio
3000    dirty_expire_centisecs
10      dirty_ratio
500     dirty_writeback_centisecs
0       drop_caches
0       laptop_mode
0       legacy_va_layout
256     256     lowmem_reserve_ratio
65536   max_map_count
7604    min_free_kbytes
5       min_slab_ratio
1       min_unmapped_ratio
2       nr_pdflush_threads
0       overcommit_memory
50      overcommit_ratio
3       page-cluster
0       panic_on_oom
0       percpu_pagelist_fraction
1       stat_interval
60      swappiness
100     vfs_cache_pressure
0       zone_reclaim_mode
12:55:38# echo 10 >dirty_background_ratio
12:55:56# for i in *;do echo -e "`cat $i`\t$i";done
0       block_dump
10      dirty_background_ratio
3000    dirty_expire_centisecs
10      dirty_ratio
500     dirty_writeback_centisecs
0       drop_caches
0       laptop_mode
0       legacy_va_layout
256     256     lowmem_reserve_ratio
65536   max_map_count
7604    min_free_kbytes
5       min_slab_ratio
1       min_unmapped_ratio
2       nr_pdflush_threads
0       overcommit_memory
50      overcommit_ratio
3       page-cluster
0       panic_on_oom
0       percpu_pagelist_fraction
1       stat_interval
60      swappiness
100     vfs_cache_pressure
0       zone_reclaim_mode


I couldn't find what issue the 2.6.23 patch addresses
2.4 /proc/sys/vm - The virtual memory subsystem
The files  in  this directory can be used to tune the operation of the virtual
memory (VM)  subsystem  of  the  Linux  kernel.
Controls the tendency of the kernel to reclaim the memory which is used for
caching of directory and inode objects.
At the default value of vfs_cache_pressure=100 the kernel will attempt to
reclaim dentries and inodes at a "fair" rate with respect to pagecache and
swapcache reclaim.  Decreasing vfs_cache_pressure causes the kernel to prefer
to retain dentry and inode caches.  Increasing vfs_cache_pressure beyond 100
causes the kernel to prefer to reclaim dentries and inodes.
Contains, as a percentage of total system memory, the number of pages at which
the pdflush background writeback daemon will start writing out dirty data.
Contains, as a percentage of total system memory, the number of pages at which
a process which is generating disk writes will itself start writing out dirty
The pdflush writeback daemons will periodically wake up and write `old' data
out to disk.  This tunable expresses the interval between those wakeups, in
100'ths of a second.
Setting this to zero disables periodic writeback altogether.
This tunable is used to define when dirty data is old enough to be eligible
for writeout by the pdflush daemons.  It is expressed in 100'ths of a second. 
Data which has been dirty in-memory for longer than this interval will be
written out next time a pdflush daemon wakes up.
If non-zero, this sysctl disables the new 32-bit mmap mmap layout - the kernel
will use the legacy (2.4) layout for all processes.
For some specialised workloads on highmem machines it is dangerous for
the kernel to allow process memory to be allocated from the "lowmem"
zone.  This is because that memory could then be pinned via the mlock()
system call, or by unavailability of swapspace.
And on large highmem machines this lack of reclaimable lowmem memory
can be fatal.
So the Linux page allocator has a mechanism which prevents allocations
which _could_ use highmem from using too much lowmem.  This means that
a certain amount of lowmem is defended from the possibility of being
captured into pinned user memory.
(The same argument applies to the old 16 megabyte ISA DMA region.  This
mechanism will also defend that region from allocations which could use
highmem or lowmem).
The `lower_zone_protection' tunable determines how aggressive the kernel is
in defending these lower zones.  The default value is zero - no
protection at all.
If you have a machine which uses highmem or ISA DMA and your
applications are using mlock(), or if you are running with no swap then
you probably should increase the lower_zone_protection setting.
The units of this tunable are fairly vague.  It is approximately equal
to "megabytes," so setting lower_zone_protection=100 will protect around 100
megabytes of the lowmem zone from user allocations.  It will also make
those 100 megabytes unavailable for use by applications and by
pagecache, so there is a cost.
The effects of this tunable may be observed by monitoring
/proc/meminfo:LowFree.  Write a single huge file and observe the point
at which LowFree ceases to fall.
A reasonable value for lower_zone_protection is 100.
page-cluster controls the number of pages which are written to swap in
a single attempt.  The swap I/O size.
It is a logarithmic value - setting it to zero means "1 page", setting
it to 1 means "2 pages", setting it to 2 means "4 pages", etc.
The default value is three (eight pages at a time).  There may be some
small benefits in tuning this to a different value if your workload is
Controls overcommit of system memory, possibly allowing processes
to allocate (but not use) more memory than is actually available.
0       -       Heuristic overcommit handling. Obvious overcommits of
                address space are refused. Used for a typical system. It
                ensures a seriously wild allocation fails while allowing
                overcommit to reduce swap usage.  root is allowed to
                allocate slightly more memory in this mode. This is the
1       -       Always overcommit. Appropriate for some scientific
2       -       Don't overcommit. The total address space commit
                for the system is not permitted to exceed swap plus a
                configurable percentage (default is 50) of physical RAM.
                Depending on the percentage you use, in most situations
                this means a process will not be killed while attempting
                to use already-allocated memory but will receive errors
                on memory allocation as appropriate.
Percentage of physical memory size to include in overcommit calculations
(see above.)
Memory allocation limit = swapspace + physmem * (overcommit_ratio / 100)
        swapspace = total size of all swap areas
        physmem = size of physical memory in system
nr_hugepages and hugetlb_shm_group
nr_hugepages configures number of hugetlb page reserved for the system.
hugetlb_shm_group contains group id that is allowed to create SysV shared
memory segment using hugetlb page.
This parameter is only useful when kernelcore= is specified at boot time to
create ZONE_MOVABLE for pages that may be reclaimed or migrated. Huge pages
are not movable so are not normally allocated from ZONE_MOVABLE. A non-zero
value written to hugepages_treat_as_movable allows huge pages to be allocated
Once enabled, the ZONE_MOVABLE is treated as an area of memory the huge
pages pool can easily grow or shrink within. Assuming that applications are
not running that mlock() a lot of memory, it is likely the huge pages pool
can grow to the size of ZONE_MOVABLE by repeatedly entering the desired value
into nr_hugepages and triggering page reclaim.
laptop_mode is a knob that controls "laptop mode". All the things that are
controlled by this knob are discussed in Documentation/laptop-mode.txt.
block_dump enables block I/O debugging when set to a nonzero value. More
information on block I/O debugging is in Documentation/laptop-mode.txt.
This file contains valid hold time of swap out protection token. The Linux
VM has token based thrashing control mechanism and uses the token to prevent
unnecessary page faults in thrashing situation. The unit of the value is
second. The value would be useful to tune thrashing behavior.
Writing to this will cause the kernel to drop clean caches, dentries and
inodes from memory, causing that memory to become free.
To free pagecache:
        echo 1 > /proc/sys/vm/drop_caches
To free dentries and inodes:
        echo 2 > /proc/sys/vm/drop_caches
To free pagecache, dentries and inodes:
        echo 3 > /proc/sys/vm/drop_caches
As this is a non-destructive operation and dirty objects are not freeable, the
user should run `sync' first.

Open in new window


Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today

From novice to tech pro — start learning today.