O_SYNC fsync interaction with Disk Controllers with battery backed write cache

First let me start off by stating what will soon be apparently obvious.  I am not a C Developer but an Infrastructure professional. ;-)

The Problem:

We have an in-house developed application, One of the applications main responsibilities is to log incoming network transactions out to disk.  The inbound messages are relatively small (a few hundred bytes).  The application breaks the inbound messages into several threads that write to several open files (30 on average).  While the application is processing the incoming network streams, it is not uncommon to see around 2,000-3000 iops per second being generated.  It is important that this data be written to disk as soon as possible to maintain a consistent state of our application, and know where the transaction was in the within the overall application.

In order to achieve and sustain the iiop's it was decided to use a 8 Drive RAID-10 configuration behind a Battery Backed Write Cache system.  In initial testing we discovered that with one thread running (1 open file) we we're achieving acceptable latency results while writing to disk ~100us.  However, as soon as a second thread was added to the same block device, latency immediately increased to 8ms.  

I have checked the usual suspects from the system side; Write Cache is indeed enabled on the controllers, made appropriate kernel tunings made to io schedulers, vm dirty_ratio settings, filesystem tuning recommendations etc ... and all appears to be as one would expect, but the best we can seem to achieve is 6-8ms response time.  However, when I use Bonnie++ to run tests, I have exceptional result sets.

In working closely with the developers, It was determined that they are using O_SYNC and/or fsync (and the java equivalent) within their code to assure that the data is committed to disk before execution of the next step. They put together a small test sample code (attached) for me to do my base lining.   I have actually removed and recompiled this test script without the sync calls, and as I suspect I can run multiple threads (upto 32 so far) and achieve latency numbers of around ~100ms as the are being written out to the system buffers, and processed through the ioscheduler.

So this was a long way about getting to the question(s) but here it is:

By using the O_SYNC or fsync system calls, is the write cache bypassed somehow?

If so is  there a better system call that would assure that the data was on the H/W write cache and not just in the system io buffers  that would allow the program to move onto the next execution?

Am I  missing something? ;-)

System OS: SLES10 and RHEL 5.1
Hardware: HP DL380 w/p400i 512MB BBWC SAS 15K 76GB Drives
also tested with HDS AMS1000 with 4GB of CACHE 15K 76GB Fiber Channel Drives.
Filesystem: EXT3

Thanks to any help you could provide.

Chris


#define __STDC_FORMAT_MACROS
 
#include <pthread.h>
#include <iostream>
#include <string>
#include <sstream>
#include <stdio.h>
#include <stdlib.h>
#include <inttypes.h>
#include <unistd.h>
#include <fstream>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
 
using namespace std;
 
static int g_bufferSize = 1000;
static int g_threadCount = 1;
static int g_maxLoops = 2000;
static char* g_buffer;
static string g_filePath;
 
static void
printTimings(int64_t bt, int64_t et, int64_t loops) {
        int64_t duration = et - bt;
        if (duration == 0)
                duration = 1;
 
        int64_t each = duration / loops;
        string timing = each >= 1000000 ? "milliseconds" : each >= 1000 ? "microseconds" : "nanoseconds";
        double divisor = each >= 1000000 ? 1000000 : each >= 1000 ? 1000 : 1;
 
        printf("Ran %"PRId64" iterations in %.3f seconds for a rate of %"PRId64" iterations / sec.  That is %.4g %s each\n",
                        loops, duration / 1000000000.0, loops * 1000000000 / duration, each / divisor, timing.c_str());
}
 
static int64_t
nanoTime()
{
        timespec tp;
        int rv;
        if ((rv = clock_gettime(CLOCK_REALTIME, &tp)) != 0) {
                printf("clock_gettime error: %d\n", rv);
                exit(1);
        }
        int64_t v = tp.tv_sec;
        v *= 1000000000;
        v += tp.tv_nsec;
        return v;
}
 
extern "C"
void*
fileWriter(void* varg)
{
        intptr_t instanceNumber = (intptr_t)varg;
 
        int timing[g_maxLoops];
        ostringstream oss;
        if (g_filePath.length())
                oss << g_filePath << "/";
        oss << "speedTestData" << instanceNumber << ".dat";
 
        string filename = oss.str();
 
        //int out = open(filename.c_str(), O_WRONLY | O_CREAT | O_SYNC, 0644);
        int out = open(filename.c_str(), O_WRONLY | O_CREAT, 0644);
 
        int64_t bt = nanoTime();
        int64_t begin = bt;
 
        for (int i=0; i < g_maxLoops; ++i) {
                write(out, g_buffer, g_bufferSize);
                fsync(out);
                int64_t end = nanoTime();
                timing[i] = (int)(end - begin)/1000;
                begin = end;
        }
 
        int64_t et = nanoTime();
 
        printTimings(bt, et, g_maxLoops);
 
        int min=INT_MAX, max=INT_MIN;
        int64_t sum=0;
        int minIter=0, maxIter=0;
 
        for (int i=0; i < g_maxLoops; ++i) {
                if (timing[i] < min) {
                        min = timing[i];
                        minIter = i;
                }
                if (timing[i] > max) {
                        max = timing[i];
                        maxIter = i;
                }
                sum += timing[i];
        }
 
        printf("Min = %dus (iteration %d), max = %dus (iteration %d), avg = %0.3fus\n",
                        min, minIter, max, maxIter, (double)sum / g_maxLoops);
 
        ostringstream oss2;
        oss2 << "report" << instanceNumber << ".txt";
        ofstream report(oss2.str().c_str());
        for (int i=0; i < g_maxLoops; ++i)
                report << timing[i] << endl;
        report.flush();
 
        return 0;
}
 
int
main(int argc, const char* argv[])
{
        if (argc > 1) {
                g_threadCount = atoi(argv[1]);
                if (argc > 2) {
                        g_maxLoops = atol(argv[2]);
                        if (argc > 3) {
                                g_bufferSize = atoi(argv[3]);
                                if (argc > 4) {
                                        g_filePath = argv[4];
                                        if (argc > 5) {
                                                printf("Usage: FileWriteSpeed2 [<thread-count>=1 [<loop-count>=2000 [<block-size>=1000 [<output-directory>=.]]]]\n");
                                                return 0;
                                        }
                                }
                        }
                }
        }
 
        g_buffer = new char[g_bufferSize];
 
        pthread_t threads[g_threadCount];
        for (intptr_t i=1; i <= g_threadCount; i++)
                pthread_create(&threads[i-1], 0, fileWriter, (void*)i);
 
        for (intptr_t i=0; i < g_threadCount; i++)
                pthread_join(threads[i], 0);
 
        return 0;
}

Open in new window

cp5c0ttAsked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

ravenplCommented:
> By using the O_SYNC or fsync system calls, is the write cache bypassed somehow?
No. You just forcing the OS(linux) pagecache to flush it's buffers to the HDD. Now what the controller/hdd is going to do with that data is up to it's configuration.

I heard there's unfortunate thing about linux fsync - using fsync on single file it flushes whole pagecache. Not really robust isn't it?

In Your case I think You could try using O_DIRECT rather than O_SYNC/fsync. Yet the O_DIRECT requires in-memory buffers aligned to 512bytes as well as write offset and write size. But 4K alignment is far better(see below). But this is thing to program.

Also, You may want to write exactly (4K * n) buffers at (4K * m) offsets. That's because writing partial filesystem block(which is 4K) causes read-before-write.
0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
GugroCommented:
Could you please post the output of your test sample code ??

".. achieve latency numbers of around ~100ms .. "    or do you mean 100 us ??
0
cp5c0ttAuthor Commented:
Thanks Gugro,

You are correct, that was meant to be micro seconds.  Typing to fast.

Here is the STDOUT outputs for a single thread and two thread run:
also attached are report files that prints out the timing of each individual iteration.

dhq-dil-qa03:/appvol/tst # /home/cprojs/FileWriteSpeed2 1
Ran 2000 iterations in 0.503 seconds for a rate of 3974 iterations / sec.  That is 251.6 microseconds each
Min = 194us (iteration 42), max = 8277us (iteration 0), avg = 251.631us
dhq-dil-qa03:/appvol/tst #

and Two Threads:

dhq-dil-qa03:/appvol/tst # /home/jwegher/cprojs/FileWriteSpeed2 2
Ran 2000 iterations in 15.656 seconds for a rate of 127 iterations / sec.  That is 7.828 milliseconds each
Min = 219us (iteration 525), max = 28328us (iteration 497), avg = 7828.003us
Ran 2000 iterations in 15.689 seconds for a rate of 127 iterations / sec.  That is 7.845 milliseconds each
Min = 214us (iteration 1956), max = 28397us (iteration 474), avg = 7844.596us
dhq-dil-qa03:/appvol/tst #




report2-one-thread.txt
report1-two-thread.txt
report2-two-thread.txt
0
GugroCommented:
I think you has a problem with your write cache on your controllers.
2000 iter x 1000 bytes => 2 MB should easily fit in the controller cache.

When you are running one thread, you are writing sequentially on your disc
and you are able to reach a high I/O rate

When you are running two threads the controller has to jump between the
two files located on different parts on your disks, so you are limited to IO rate
of your disks. Normal disks have an access time of about 8 ms => 125 IOPs
and your write test gave you ? 127 iterations / sec !!!!!!!!

So the write cache in your controller is not working, or something is disabling the cache ?!?!

Im sorry I don't know the source code of RHEL, but it could be a problem with using fsync()
Please see the follwing thread:
http://kerneltrap.org/node/14148


0
GugroCommented:
Have you tried your code with the O_NOATIME option in open(..) ??
0
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Storage

From novice to tech pro — start learning today.

Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.