O_SYNC fsync interaction with Disk Controllers with battery backed write cache

Posted on 2008-11-05
Last Modified: 2013-11-14
First let me start off by stating what will soon be apparently obvious.  I am not a C Developer but an Infrastructure professional. ;-)

The Problem:

We have an in-house developed application, One of the applications main responsibilities is to log incoming network transactions out to disk.  The inbound messages are relatively small (a few hundred bytes).  The application breaks the inbound messages into several threads that write to several open files (30 on average).  While the application is processing the incoming network streams, it is not uncommon to see around 2,000-3000 iops per second being generated.  It is important that this data be written to disk as soon as possible to maintain a consistent state of our application, and know where the transaction was in the within the overall application.

In order to achieve and sustain the iiop's it was decided to use a 8 Drive RAID-10 configuration behind a Battery Backed Write Cache system.  In initial testing we discovered that with one thread running (1 open file) we we're achieving acceptable latency results while writing to disk ~100us.  However, as soon as a second thread was added to the same block device, latency immediately increased to 8ms.  

I have checked the usual suspects from the system side; Write Cache is indeed enabled on the controllers, made appropriate kernel tunings made to io schedulers, vm dirty_ratio settings, filesystem tuning recommendations etc ... and all appears to be as one would expect, but the best we can seem to achieve is 6-8ms response time.  However, when I use Bonnie++ to run tests, I have exceptional result sets.

In working closely with the developers, It was determined that they are using O_SYNC and/or fsync (and the java equivalent) within their code to assure that the data is committed to disk before execution of the next step. They put together a small test sample code (attached) for me to do my base lining.   I have actually removed and recompiled this test script without the sync calls, and as I suspect I can run multiple threads (upto 32 so far) and achieve latency numbers of around ~100ms as the are being written out to the system buffers, and processed through the ioscheduler.

So this was a long way about getting to the question(s) but here it is:

By using the O_SYNC or fsync system calls, is the write cache bypassed somehow?

If so is  there a better system call that would assure that the data was on the H/W write cache and not just in the system io buffers  that would allow the program to move onto the next execution?

Am I  missing something? ;-)

System OS: SLES10 and RHEL 5.1
Hardware: HP DL380 w/p400i 512MB BBWC SAS 15K 76GB Drives
also tested with HDS AMS1000 with 4GB of CACHE 15K 76GB Fiber Channel Drives.
Filesystem: EXT3

Thanks to any help you could provide.



#include <pthread.h>

#include <iostream>

#include <string>

#include <sstream>

#include <stdio.h>

#include <stdlib.h>

#include <inttypes.h>

#include <unistd.h>

#include <fstream>

#include <sys/types.h>

#include <sys/stat.h>

#include <fcntl.h>

using namespace std;

static int g_bufferSize = 1000;

static int g_threadCount = 1;

static int g_maxLoops = 2000;

static char* g_buffer;

static string g_filePath;

static void

printTimings(int64_t bt, int64_t et, int64_t loops) {

        int64_t duration = et - bt;

        if (duration == 0)

                duration = 1;

        int64_t each = duration / loops;

        string timing = each >= 1000000 ? "milliseconds" : each >= 1000 ? "microseconds" : "nanoseconds";

        double divisor = each >= 1000000 ? 1000000 : each >= 1000 ? 1000 : 1;

        printf("Ran %"PRId64" iterations in %.3f seconds for a rate of %"PRId64" iterations / sec.  That is %.4g %s each\n",

                        loops, duration / 1000000000.0, loops * 1000000000 / duration, each / divisor, timing.c_str());


static int64_t



        timespec tp;

        int rv;

        if ((rv = clock_gettime(CLOCK_REALTIME, &tp)) != 0) {

                printf("clock_gettime error: %d\n", rv);



        int64_t v = tp.tv_sec;

        v *= 1000000000;

        v += tp.tv_nsec;

        return v;


extern "C"


fileWriter(void* varg)


        intptr_t instanceNumber = (intptr_t)varg;

        int timing[g_maxLoops];

        ostringstream oss;

        if (g_filePath.length())

                oss << g_filePath << "/";

        oss << "speedTestData" << instanceNumber << ".dat";

        string filename = oss.str();

        //int out = open(filename.c_str(), O_WRONLY | O_CREAT | O_SYNC, 0644);

        int out = open(filename.c_str(), O_WRONLY | O_CREAT, 0644);

        int64_t bt = nanoTime();

        int64_t begin = bt;

        for (int i=0; i < g_maxLoops; ++i) {

                write(out, g_buffer, g_bufferSize);


                int64_t end = nanoTime();

                timing[i] = (int)(end - begin)/1000;

                begin = end;


        int64_t et = nanoTime();

        printTimings(bt, et, g_maxLoops);

        int min=INT_MAX, max=INT_MIN;

        int64_t sum=0;

        int minIter=0, maxIter=0;

        for (int i=0; i < g_maxLoops; ++i) {

                if (timing[i] < min) {

                        min = timing[i];

                        minIter = i;


                if (timing[i] > max) {

                        max = timing[i];

                        maxIter = i;


                sum += timing[i];


        printf("Min = %dus (iteration %d), max = %dus (iteration %d), avg = %0.3fus\n",

                        min, minIter, max, maxIter, (double)sum / g_maxLoops);

        ostringstream oss2;

        oss2 << "report" << instanceNumber << ".txt";

        ofstream report(oss2.str().c_str());

        for (int i=0; i < g_maxLoops; ++i)

                report << timing[i] << endl;


        return 0;



main(int argc, const char* argv[])


        if (argc > 1) {

                g_threadCount = atoi(argv[1]);

                if (argc > 2) {

                        g_maxLoops = atol(argv[2]);

                        if (argc > 3) {

                                g_bufferSize = atoi(argv[3]);

                                if (argc > 4) {

                                        g_filePath = argv[4];

                                        if (argc > 5) {

                                                printf("Usage: FileWriteSpeed2 [<thread-count>=1 [<loop-count>=2000 [<block-size>=1000 [<output-directory>=.]]]]\n");

                                                return 0;






        g_buffer = new char[g_bufferSize];

        pthread_t threads[g_threadCount];

        for (intptr_t i=1; i <= g_threadCount; i++)

                pthread_create(&threads[i-1], 0, fileWriter, (void*)i);

        for (intptr_t i=0; i < g_threadCount; i++)

                pthread_join(threads[i], 0);

        return 0;


Open in new window

Question by:cp5c0tt
    LVL 43

    Accepted Solution

    > By using the O_SYNC or fsync system calls, is the write cache bypassed somehow?
    No. You just forcing the OS(linux) pagecache to flush it's buffers to the HDD. Now what the controller/hdd is going to do with that data is up to it's configuration.

    I heard there's unfortunate thing about linux fsync - using fsync on single file it flushes whole pagecache. Not really robust isn't it?

    In Your case I think You could try using O_DIRECT rather than O_SYNC/fsync. Yet the O_DIRECT requires in-memory buffers aligned to 512bytes as well as write offset and write size. But 4K alignment is far better(see below). But this is thing to program.

    Also, You may want to write exactly (4K * n) buffers at (4K * m) offsets. That's because writing partial filesystem block(which is 4K) causes read-before-write.
    LVL 6

    Expert Comment

    Could you please post the output of your test sample code ??

    ".. achieve latency numbers of around ~100ms .. "    or do you mean 100 us ??

    Author Comment

    Thanks Gugro,

    You are correct, that was meant to be micro seconds.  Typing to fast.

    Here is the STDOUT outputs for a single thread and two thread run:
    also attached are report files that prints out the timing of each individual iteration.

    dhq-dil-qa03:/appvol/tst # /home/cprojs/FileWriteSpeed2 1
    Ran 2000 iterations in 0.503 seconds for a rate of 3974 iterations / sec.  That is 251.6 microseconds each
    Min = 194us (iteration 42), max = 8277us (iteration 0), avg = 251.631us
    dhq-dil-qa03:/appvol/tst #

    and Two Threads:

    dhq-dil-qa03:/appvol/tst # /home/jwegher/cprojs/FileWriteSpeed2 2
    Ran 2000 iterations in 15.656 seconds for a rate of 127 iterations / sec.  That is 7.828 milliseconds each
    Min = 219us (iteration 525), max = 28328us (iteration 497), avg = 7828.003us
    Ran 2000 iterations in 15.689 seconds for a rate of 127 iterations / sec.  That is 7.845 milliseconds each
    Min = 214us (iteration 1956), max = 28397us (iteration 474), avg = 7844.596us
    dhq-dil-qa03:/appvol/tst #

    LVL 6

    Expert Comment

    I think you has a problem with your write cache on your controllers.
    2000 iter x 1000 bytes => 2 MB should easily fit in the controller cache.

    When you are running one thread, you are writing sequentially on your disc
    and you are able to reach a high I/O rate

    When you are running two threads the controller has to jump between the
    two files located on different parts on your disks, so you are limited to IO rate
    of your disks. Normal disks have an access time of about 8 ms => 125 IOPs
    and your write test gave you ? 127 iterations / sec !!!!!!!!

    So the write cache in your controller is not working, or something is disabling the cache ?!?!

    Im sorry I don't know the source code of RHEL, but it could be a problem with using fsync()
    Please see the follwing thread:

    LVL 6

    Expert Comment

    Have you tried your code with the O_NOATIME option in open(..) ??

    Write Comment

    Please enter a first name

    Please enter a last name

    We will never share this with anyone.

    Featured Post

    Better Security Awareness With Threat Intelligence

    See how one of the leading financial services organizations uses Recorded Future as part of a holistic threat intelligence program to promote security awareness and proactively and efficiently identify threats.

    Setting up Secure Ubuntu server on VMware 1.      Insert the Ubuntu Server distribution CD or attach the ISO of the CD which is in the “Datastore”. Note that it is important to install the x64 edition on servers, not the X86 editions. 2.      Power on th…
    Linux users are sometimes dumbfounded by the severe lack of documentation on a topic. Sometimes, the documentation is copious, but other times, you end up with some obscure "it varies depending on your distribution" over and over when searching for …
    The goal of this video is to provide viewers with basic examples to understand and use conditional statements in the C programming language.
    Get a first impression of how PRTG looks and learn how it works.   This video is a short introduction to PRTG, as an initial overview or as a quick start for new PRTG users.

    779 members asked questions and received personalized solutions in the past 7 days.

    Join the community of 500,000 technology professionals and ask your questions.

    Join & Ask a Question

    Need Help in Real-Time?

    Connect with top rated Experts

    15 Experts available now in Live!

    Get 1:1 Help Now