?
Solved

pdflush on Linux

Posted on 2008-02-01
10
Medium Priority
?
4,326 Views
Last Modified: 2013-12-16
I'm running a server which continuously receives data over a network,
processes the data, and stores the processed data in a large
database.  The database consists of multiple files which are very
large ( > 50GBs).  The database has a lot of cache, to get around I/O
bottlenecks.

But what happens is that the system starts to slow down horribly
despite database cache, and the iostat monitoring tool along with atop
indicates constant disk activity.  When I shut the server down, disk
activity continues for some time, even though the server is off.

I'm wondering if this is an issue with pdflush.  There's no utility on
Linux to indicate which process is actually doing the disk I/O, but I
suspect pdflush is the cause of the performance bottleneck.  My
questions are:

1. Is pdflush a likely candidate for this problem?

2. If so, is there anything I can do about it, such as disable
pdflush, or modify how often it flushes pages to disk.
0
Comment
Question by:chsalvia
  • 7
  • 2
10 Comments
 
LVL 46

Expert Comment

by:Kent Olsen
ID: 20800551
Hi chsalvia,

pdflush may be involved, but my expectation would be that the root cause is index updating.

If you're updating large tables with large datasets, it's going to take some time to clean up the indexes.



Good Luck,
Kent
0
 
LVL 35

Expert Comment

by:Duncan Roe
ID: 20800915
Following on from Kdo's excellent post, I suspect you are waiting for disk reads more than for disk writes. Since you are reading random file blocks from a file much larger than installed RAM, a disk read will be required for each (instead of the block's being cached) more often than not.
Statistically, you should see an improvement by increasing installed RAM, but how much of an improvement I would not like to estimate.
How much RAM does your system have? How much could you put on? (using largest readily-available DIMMs in all slots)
0
 
LVL 35

Expert Comment

by:Duncan Roe
ID: 20801038
"the system starts to slow down horribly" - is that as more users get started, or is it a constant load from the start?
0
Never miss a deadline with monday.com

The revolutionary project management tool is here!   Plan visually with a single glance and make sure your projects get done.

 

Author Comment

by:chsalvia
ID: 20801651
I'm pretty confident this is not related to the DB itself, (.e.g index updating), and has something to do with the OS cache, or pdflush.

The reason I think this is because atop reports consistent disk activity even AFTER the server is shut down, and the only process which shows any activity at that point is pdflush.  The server and DB is not at all active.

So basically, it seems as though pdflush is always causing 100% disk activity, which seriously slows down the program.  

Also, the setup is a RAID 5 with 16GB ram.
0
 
LVL 35

Expert Comment

by:Duncan Roe
ID: 20803451
16GB is respectable - is that top weight or could it hold 32?
The function of pdflush is to write out "dirty" buffers - anything that needs writing to disk. If it keeps laboring after the application is closed down, maybe there was a ton of outstanding disk writes. Should there be a *lot* of disk writing going on? (high ratio of updates to reads)
0
 

Author Comment

by:chsalvia
ID: 20804894
It's upgradeable to 32, but right now only has 16.

The application is IO intensive by nature, so it's not surprising that a lot of disk writing is going on.  I'm just wondering if there's anything I can do (short of purchasing more RAM/faster drives) which might improve performance somewhat.  pdflush is very active when the application is running, and remains active for some time after the application closes.  Is there a way to decrease the frequency that pdflush syncs to disk?

Also, since the DB does it's own caching, maybe the OS caching is unnecessary and just getting in the way.  Is this a possibility, and if so, is it possible to bypass OS caching?
0
 
LVL 35

Expert Comment

by:Duncan Roe
ID: 20806583
Googling for pdflush discovered a couple of items regarding excessive CPU use by pdflush - but that was 2.6.6 in 2004. Still looking...
0
 
LVL 35

Expert Comment

by:Duncan Roe
ID: 20806773
pdflush had a 1-line change effective 2.6.23.2 or so (I'd just get 2.6.24 now it's out). I haven't found out yet what problem it's supposed to address, still looking...
0
 
LVL 35

Expert Comment

by:Duncan Roe
ID: 20806874
This discussion thread entitled "Raid performance problems (pdflush / raid5 eats 100%)" is fairly recent: http://www.mail-archive.com/linux-raid@vger.kernel.org/msg09213.html
I don't suggest your problem is the same, but you might find the responder's suggextions useful (or you may not).
0
 
LVL 35

Accepted Solution

by:
Duncan Roe earned 2000 total points
ID: 20807013
You can google for pdflush as well as I - start at http://www.google.com/linux (or your national equivalent, e.g. com.au for Australia)..
On another thread, there are some user controls available to you regarding pdflush - you may like to try them. In particular, you can increase how old a block becomes before pdflush will write it out - this may help with your database's caching mechanism. You can also increase the memory limit before dirty blocks are forced out.
This is documented in the linux source directory  on your system, file Documentation/filesystems/proc.txt. I've attached the 2.6.24 version of the relevant section (2.6.22.9 that I run is actually the same).
To change any of these values, use "echo >", e.g. on my system (also showing current values):

12:55:08# for i in *;do echo -e "`cat $i`\t$i";done
0       block_dump
5       dirty_background_ratio
3000    dirty_expire_centisecs
10      dirty_ratio
500     dirty_writeback_centisecs
0       drop_caches
0       laptop_mode
0       legacy_va_layout
256     256     lowmem_reserve_ratio
65536   max_map_count
7604    min_free_kbytes
5       min_slab_ratio
1       min_unmapped_ratio
2       nr_pdflush_threads
0       overcommit_memory
50      overcommit_ratio
3       page-cluster
0       panic_on_oom
0       percpu_pagelist_fraction
1       stat_interval
60      swappiness
100     vfs_cache_pressure
0       zone_reclaim_mode
12:55:38# echo 10 >dirty_background_ratio
12:55:56# for i in *;do echo -e "`cat $i`\t$i";done
0       block_dump
10      dirty_background_ratio
3000    dirty_expire_centisecs
10      dirty_ratio
500     dirty_writeback_centisecs
0       drop_caches
0       laptop_mode
0       legacy_va_layout
256     256     lowmem_reserve_ratio
65536   max_map_count
7604    min_free_kbytes
5       min_slab_ratio
1       min_unmapped_ratio
2       nr_pdflush_threads
0       overcommit_memory
50      overcommit_ratio
3       page-cluster
0       panic_on_oom
0       percpu_pagelist_fraction
1       stat_interval
60      swappiness
100     vfs_cache_pressure
0       zone_reclaim_mode

======================

I couldn't find what issue the 2.6.23 patch addresses
2.4 /proc/sys/vm - The virtual memory subsystem
-----------------------------------------------
 
The files  in  this directory can be used to tune the operation of the virtual
memory (VM)  subsystem  of  the  Linux  kernel.
 
vfs_cache_pressure
------------------
 
Controls the tendency of the kernel to reclaim the memory which is used for
caching of directory and inode objects.
 
At the default value of vfs_cache_pressure=100 the kernel will attempt to
reclaim dentries and inodes at a "fair" rate with respect to pagecache and
swapcache reclaim.  Decreasing vfs_cache_pressure causes the kernel to prefer
to retain dentry and inode caches.  Increasing vfs_cache_pressure beyond 100
causes the kernel to prefer to reclaim dentries and inodes.
 
dirty_background_ratio
----------------------
 
Contains, as a percentage of total system memory, the number of pages at which
the pdflush background writeback daemon will start writing out dirty data.
 
dirty_ratio
-----------------
 
Contains, as a percentage of total system memory, the number of pages at which
a process which is generating disk writes will itself start writing out dirty
data.
 
dirty_writeback_centisecs
-------------------------
 
The pdflush writeback daemons will periodically wake up and write `old' data
out to disk.  This tunable expresses the interval between those wakeups, in
100'ths of a second.
 
Setting this to zero disables periodic writeback altogether.
 
dirty_expire_centisecs
----------------------
 
This tunable is used to define when dirty data is old enough to be eligible
for writeout by the pdflush daemons.  It is expressed in 100'ths of a second. 
Data which has been dirty in-memory for longer than this interval will be
written out next time a pdflush daemon wakes up.
 
legacy_va_layout
----------------
 
If non-zero, this sysctl disables the new 32-bit mmap mmap layout - the kernel
will use the legacy (2.4) layout for all processes.
 
lower_zone_protection
---------------------
 
For some specialised workloads on highmem machines it is dangerous for
the kernel to allow process memory to be allocated from the "lowmem"
zone.  This is because that memory could then be pinned via the mlock()
system call, or by unavailability of swapspace.
 
And on large highmem machines this lack of reclaimable lowmem memory
can be fatal.
 
So the Linux page allocator has a mechanism which prevents allocations
which _could_ use highmem from using too much lowmem.  This means that
a certain amount of lowmem is defended from the possibility of being
captured into pinned user memory.
 
(The same argument applies to the old 16 megabyte ISA DMA region.  This
mechanism will also defend that region from allocations which could use
highmem or lowmem).
 
The `lower_zone_protection' tunable determines how aggressive the kernel is
in defending these lower zones.  The default value is zero - no
protection at all.
 
If you have a machine which uses highmem or ISA DMA and your
applications are using mlock(), or if you are running with no swap then
you probably should increase the lower_zone_protection setting.
 
The units of this tunable are fairly vague.  It is approximately equal
to "megabytes," so setting lower_zone_protection=100 will protect around 100
megabytes of the lowmem zone from user allocations.  It will also make
those 100 megabytes unavailable for use by applications and by
pagecache, so there is a cost.
 
The effects of this tunable may be observed by monitoring
/proc/meminfo:LowFree.  Write a single huge file and observe the point
at which LowFree ceases to fall.
 
A reasonable value for lower_zone_protection is 100.
 
page-cluster
------------
 
page-cluster controls the number of pages which are written to swap in
a single attempt.  The swap I/O size.
 
It is a logarithmic value - setting it to zero means "1 page", setting
it to 1 means "2 pages", setting it to 2 means "4 pages", etc.
 
The default value is three (eight pages at a time).  There may be some
small benefits in tuning this to a different value if your workload is
swap-intensive.
 
overcommit_memory
-----------------
 
Controls overcommit of system memory, possibly allowing processes
to allocate (but not use) more memory than is actually available.
 
 
0       -       Heuristic overcommit handling. Obvious overcommits of
                address space are refused. Used for a typical system. It
                ensures a seriously wild allocation fails while allowing
                overcommit to reduce swap usage.  root is allowed to
                allocate slightly more memory in this mode. This is the
                default.
 
1       -       Always overcommit. Appropriate for some scientific
                applications.
 
2       -       Don't overcommit. The total address space commit
                for the system is not permitted to exceed swap plus a
                configurable percentage (default is 50) of physical RAM.
                Depending on the percentage you use, in most situations
                this means a process will not be killed while attempting
                to use already-allocated memory but will receive errors
                on memory allocation as appropriate.
 
overcommit_ratio
----------------
 
Percentage of physical memory size to include in overcommit calculations
(see above.)
 
Memory allocation limit = swapspace + physmem * (overcommit_ratio / 100)
 
        swapspace = total size of all swap areas
        physmem = size of physical memory in system
 
nr_hugepages and hugetlb_shm_group
----------------------------------
 
nr_hugepages configures number of hugetlb page reserved for the system.
 
hugetlb_shm_group contains group id that is allowed to create SysV shared
memory segment using hugetlb page.
 
hugepages_treat_as_movable
--------------------------
 
This parameter is only useful when kernelcore= is specified at boot time to
create ZONE_MOVABLE for pages that may be reclaimed or migrated. Huge pages
are not movable so are not normally allocated from ZONE_MOVABLE. A non-zero
value written to hugepages_treat_as_movable allows huge pages to be allocated
from ZONE_MOVABLE.
 
Once enabled, the ZONE_MOVABLE is treated as an area of memory the huge
pages pool can easily grow or shrink within. Assuming that applications are
not running that mlock() a lot of memory, it is likely the huge pages pool
can grow to the size of ZONE_MOVABLE by repeatedly entering the desired value
into nr_hugepages and triggering page reclaim.
 
laptop_mode
-----------
 
laptop_mode is a knob that controls "laptop mode". All the things that are
controlled by this knob are discussed in Documentation/laptop-mode.txt.
 
block_dump
----------
 
block_dump enables block I/O debugging when set to a nonzero value. More
information on block I/O debugging is in Documentation/laptop-mode.txt.
 
swap_token_timeout
------------------
 
This file contains valid hold time of swap out protection token. The Linux
VM has token based thrashing control mechanism and uses the token to prevent
unnecessary page faults in thrashing situation. The unit of the value is
second. The value would be useful to tune thrashing behavior.
 
drop_caches
-----------
 
Writing to this will cause the kernel to drop clean caches, dentries and
inodes from memory, causing that memory to become free.
 
To free pagecache:
        echo 1 > /proc/sys/vm/drop_caches
To free dentries and inodes:
        echo 2 > /proc/sys/vm/drop_caches
To free pagecache, dentries and inodes:
        echo 3 > /proc/sys/vm/drop_caches
 
As this is a non-destructive operation and dirty objects are not freeable, the
user should run `sync' first.

Open in new window

0

Featured Post

The new generation of project management tools

With monday.com’s project management tool, you can see what everyone on your team is working in a single glance. Its intuitive dashboards are customizable, so you can create systems that work for you.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

It’s 2016. Password authentication should be dead — or at least close to dying. But, unfortunately, it has not traversed Quagga stage yet. Using password authentication is like laundering hotel guest linens with a washboard — it’s Passé.
Welcome back to our beginners guide of the popular Unix tool, cron. If you missed part one where we introduced this tool, the link is below. We left off learning how to build a simple script to schedule automatic back ups. Now, we’ll learn how to se…
Learn how to get help with Linux/Unix bash shell commands. Use help to read help documents for built in bash shell commands.: Use man to interface with the online reference manuals for shell commands.: Use man to search man pages for unknown command…
This demo shows you how to set up the containerized NetScaler CPX with NetScaler Management and Analytics System in a non-routable Mesos/Marathon environment for use with Micro-Services applications.
Suggested Courses
Course of the Month5 days, 12 hours left to enroll

589 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question