Link to home
Start Free TrialLog in
Avatar of modsiw
modsiw

asked on

speeding up text processing app

The app below takes about 50 seconds per gb of input. I'm looking for ways to speed it up.

It is CPU bound.
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

const char NUL = '\0';
const char CR = 0x0D;
const char LF = 0x0A;
const char TAB = '\t';
const char BS = '\\';
const char SPC = ' ';
const int CNT_FIELD = 292;

int main(int argc, char** argv) {
    int i;
    int minArg = 3;
    
    if (argc < minArg) {
        fputs("arguments: line_size prefix {outputs}\n",stderr);
        return EXIT_FAILURE;
    }
    
    const int sizeLine = atoi(argv[1]) + 1;
    if (sizeLine <= 0) {
        fputs("line_size <= 0",stderr);
        return EXIT_FAILURE;
    }

    char* line = malloc(sizeLine * sizeof(char));
    char* lineCursor = line;
    char* lineReset = line;
    
    char* tmp = malloc(sizeLine * sizeof(char));
    char* tmpCursor = tmp;
    char* tmpReset = tmp;
    
    if (strlen(argv[2]) > 0) {
        strcpy(line,argv[2]);
        lineReset = line + strlen(argv[2]);
        *lineReset = TAB;
        lineReset++;
        lineCursor = lineReset;
    }
    
    FILE* out;
    FILE** outs;
    int cntOuts = argc - minArg;
    if (cntOuts) {
        outs = malloc(cntOuts * sizeof(FILE));
        for (i = minArg; i < argc; i++)
            outs[i - minArg] = fopen(argv[i],"a+");
    }
    else
        out = stdout;

    int fields = 1;
    unsigned int cntLines = 0;
    
    char c;
    int afterSlash = 0;
    while (!feof(stdin)) {
        tmpCursor = tmpReset;
        fgets(tmpCursor, sizeLine, stdin);
        while(1) {
            c = *tmpCursor;
            tmpCursor++;
            if (afterSlash) {
                if (c == LF || c == TAB || c == CR) {
                    *lineCursor = SPC;
                    lineCursor++;
                }
                else if (c == NUL) {
                    *lineCursor = SPC;
                    lineCursor++;
                    afterSlash = 0;
                    break;
                }
                else {
                    *lineCursor = BS;
                    lineCursor++;
                    *lineCursor = c;
                    lineCursor++;
                }
                afterSlash = 0;
            }
            else if (c == TAB) {
                fields++;
                *lineCursor = TAB;
                lineCursor++;
            }
            else if (c == NUL) {
                *lineCursor = NUL;
                if (fields == CNT_FIELD) {
                    if (cntOuts)
                        out = outs[cntLines++ % cntOuts];
                    fputs(line,out);
                }
                else if (fields > CNT_FIELD)
                    fputs(line,stderr);
                fields = 1;
                lineCursor = lineReset;
                break;
            }
            else if (c == BS) 
                afterSlash = 1;
            else if (c == CR);
            else {
                *lineCursor = c;
                lineCursor++;
            }
        }
    }
    
    for (i = 0; i < cntOuts; i++)
        fclose(outs[i]);
    
    free(outs);
    free(tmp);
    free(line);
    return (EXIT_SUCCESS);
}

Open in new window

Avatar of Superdave
Superdave
Flag of United States of America image

You could try using a switch statement instead of the if...else if's.  The compiler will optimize that into some kind of lookup table that is probably faster than all the compares and conditional jumps.  I think that will make a little bit of speed-up, but probably not a whole lot.

You could also replace the fgets with a fixed size read, either fread or use open/read/write instead of the buffered versions.  You're already checking every character so it would not be much more overhead to deal with the end-of-lines as you find them.

Also line 48 should have sizeof(FILE *) rather that sizeof(FILE) but that won't make a difference--just allocates a bit more memory than it needs to.
Avatar of phoffric
phoffric

If you do no logic, how long does it take to just do I/O?
If willing to use non-portable code, then if using linux, you can memory map a file and treat it as an array for superior I/O performance (almost 10x).
    http://linux.die.net/man/3/mmap
and here is some sample code for write and read:
    http://www.linuxquestions.org/questions/programming-9/mmap-tutorial-c-c-511265/
(I changed the for (i = 1; i <=NUMINTS; ++i) {...} to start at 0; and it ran 9x faster than using fwrite)

I never did this in Windows, but here is a link to get started:
    http://msdn.microsoft.com/en-us/library/aa366556(v=vs.85).aspx
i would guess you have same time if you only do the read and writes. 20 mb per second is not so bad.

you could speed up by reading the file in bigger chunks and by storing the outputs in memory and flush them at end.

Sara
Avatar of modsiw

ASKER

superdave,

I tried using putc / getc. It showed the same sample to 70sec / gb.
I'll try switch, and let ya know.
Thanks for FILE*


phoffric,
I'll give it a shot with just IO.

In normal operation, the input is a pipe coming out of gzip. Output is either named pipe(s) or a normal pipe. I'm not sure what affect this would have on mmap; I've never used it, so I'll have to research. From looking at top, gzip is spending most of its time blocked, and therefore I presume (maybe wrongly?) that it is writing out data plenty fast enough.

I've also tried it reading from a normal file (sample_data) that has already been unzipped. It produces the same speed as from the gzip pipe (50sec / gb).

cat sample_data > named_pipe & cat named_pipe > /dev/null runs in 6-10sec / gb. This is based on SAN traffic I guess.

It's on a Solaris box atm, and slated to go to a RedHat box. I'm ok with non-portable code. This is small and easy enough to port as needed.


Sara,
The input is capable of upwards of 100mb/sec and top reports 100% cpu usage for this process. All the same, I'll reduce the app to just IO and see what happens.

I'm not sure if reading in bigger chunks would really benefit me. Input is from a pipe, no seeks.
I can't flush at the end; the data is too large. The output is also a pipe to a process, so again there are no disk seeks.
I didn't give the info about pipes in & out in the OP, oops. Do you still recommend larger chunks? If so, I'll give it a shot.
you won't get 100mb/sec if you read with fgets and have write operations while reading.

streaming operations on stdin are relatively slow.

Sara
ASKER CERTIFIED SOLUTION
Avatar of phoffric
phoffric

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
>> use read
I tried using open and read, but got no appreciable difference - a little surprised.
>> Output is either named pipe(s) or a normal pipe.
On my system it took ~2 sec to write 1GB to a file using mmap vs 17-19 sec using other approaches. But since pipe is serial, it cannot be mapped using mmap. If your design could change to mmap write to a file, and then pipe that file to the next process, perhaps that will improve. (Likewise, next process could mmap read.)

>> I've also tried it reading from a normal file (sample_data) that has already been unzipped. It produces the same speed as from the gzip pipe
>>  top reports 100% cpu usage
    Could you do a timing test with no I/O. Just hard-code a dummy data sequence that represents a worst-case scenario, and loop until 1GB of data is processed. That eliminates I/O and will confirm the evidence that it is your data processing that is the problem. If you then post a program with no I/O and we also see high cpu usage, then maybe see if we can suggest SW improvements.

For overall system performance improvements, perhaps HW will help:
     http://www.aha.com/show_prod.php?id=36
Avatar of modsiw

ASKER

Below is code with no IO. It runs at 9sec / gb.

It seems the IO is the issue. I'll try the IO things mentioned in this thread and post results.
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

const int NUL = '\0';
const int CR = 0x0D;
const int LF = 0x0A;
const int TAB = '\t';
const int BS = '\\';
const int SPC = ' ';
const int CNT_FIELD = 292;

const char* LINE0 = "1295414456\tss\ten-US,en;q=0.9,ja;q=\t2011-01-19 00:20:56\t405483083\t841936015\t200,135\tU\t24.43.46.75\t0\t\t\t\thttp://www.removedremoved.com/\t/removed/home/0,1904,,00.html\t\tMozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en; rv:1.9.0.19) Gecko/2010111021 Camino/2.0.6 (like\thome\t\t\t\t\thome\tremoved removed - Easy removeds, Healthy Eating Ideas and Chef removed Videos\t\tHOME,removed\tremoved-SECTION-32176-1\tsection\t\t\tRepeat\t\t\t\t\t\t\t\t\t\t\t\t\t\t\thttp://www.removedremoved.com/\t\t\t\t\tWed-12A\t\t\t\t\t/removed/home/0,1904,,00.html\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\tMozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en; rv:1.9.0.19) Gecko/2010111021 Camino/2.0.6 (like Firefox/3.0.19)\t\t0\t0\t\t\t\t\t\t498\t600\t1243\tY\tY\tN\t2\t0\t300\trr.com\t18/0/2011 21:20:54 2 480\t8\t45\t2\t10003,00011,10020,10021\t71\t1295047392\t1295047392\t1295414456\t0\t0\thttp://www.removedremoved.com/removeds/pumpkins-pumpkin-pupcakes-removed/index.html\t/removeds/pumpkins-pumpkin-pupcakes-removed/index.html\thttp://www.removedremoved.com/\t/removed/home/0,1904,,00.html\thttp://www.facebook.com/l.php?u=http%3A%2F%2Fwww.removedremoved.com%2Fremoveds%2Fpumpkins-pumpkin-pupcakes-removed%2Findex.html&h=f130a\t\t0\t2\t1\t0\tlos angeles\tusa\tca\t0\t1\t1\t1\t0\t0\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\tRepeat\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\tRepeat\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t0\t\t0\t0\t\t340878110\tremovedremovednetnew\t2729687\t2781881361728110592\t4992312158249386792\t600\t1243\t24\tH.21\tY\t\t\t1.7\tY\tDefault Plug-in;Java Embedding Plugin 0.9.7.4;Moveremoveds Quantum Media Player;Shockwave Flash;QuickTime Plug-in 7.6.6;Silverlight Plug-In;DivX Web Player;\tN\t1280x800\t0\twww369.sj2.removed.com\t18/0/2011 21:20:54 2 480\t\t\t\tUSD\t0.000000000000\t2\tY\tN\t803\t\t1\t0\t0\t0\t0\t\t0\t1295414456\t\t\t\t\t1\t0\t405483083\t841936015\t\t\t\t1\tN\t0\t\t1\t\t0\t\t\t\t\t\t\t\t\t\n";
const char* LINE1 = "1295414463\tss\ten-US,en;q=0.9,ja;q=\t2011-01-19 00:21:03\t405483083\t841936015\t200,227,103,135,144\tU\t24.43.46.75\t0\t\t\t\thttp://www.removedremoved.com/search/delegate.do?fnSearchString=cashew+chicken&fnSearchType=site\t/removedredesign/site/search.do\t\tMozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en; rv:1.9.0.19) Gecko/2010111021 Camino/2.0.6 (like\tsearch\t\t\t\t\tsearch\tSearch Results removed removed\t\tSEARCH,removed\tremoved-SEARCH-1\tsearch\t\t\tRepeat\t\t\t\t\t\t\t\tcashew chicken\tcashew chicken\t\t\tsearch:cashew chicken\t\t\thttp://www.removedremoved.com/search/delegate.do?fnSearchString=cashew+chicken&fnSearchType=site\t\t\t\t\tWed-12A\t\t/removed/home/0,1904,,00.html\t\t\t/removedredesign/site/search.do\t\tCHICKEN\t\t\t\t\t\t\t\t\t\t\t\thttp://www.removedremoved.com/\t\tMozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en; rv:1.9.0.19) Gecko/2010111021 Camino/2.0.6 (like Firefox/3.0.19)\t\t0\t0\t\t\t\t\t\t498\t600\t1243\tY\tY\tN\t2\t0\t300\trr.com\t18/0/2011 21:20:54 2 480\t8\t45\t2\t10003,00011,10020,10021\t71\t1295414456\t1295047392\t1295414456\t0\t0\thttp://www.removedremoved.com/removeds/pumpkins-pumpkin-pupcakes-removed/index.html\t/removeds/pumpkins-pumpkin-pupcakes-removed/index.html\thttp://www.removedremoved.com/\t/removed/home/0,1904,,00.html\thttp://www.facebook.com/l.php?u=http%3A%2F%2Fwww.removedremoved.com%2Fremoveds%2Fpumpkins-pumpkin-pupcakes-removed%2Findex.html&h=f130a\t\t0\t2\t2\t3483642180\tlos angeles\tusa\tca\t0\t0\t0\t0\t0\t0\t\t\t\t\tcashew chicken\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\tRepeat\t\t\t\t\t\t\t\t\tChicken removed Collection\t\t\t\t\t\t\t\t\tcashew chicken\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\tRepeat\t\t\t\t\t\t\t\t\tChicken removed Collection\t\t\t\t\t\t\\\n                                                Search\\\n\t3\t/removed/home/0,1904,,00.html\t1\t0\tSUBMIT\t340878110\tremovedremovednetnew\t2729687\t2781881376760528896\t5725589719228350391\t600\t1243\t24\tH.21\tY\t\t\t1.7\tY\tDefault Plug-in;Java Embedding Plugin 0.9.7.4;Moveremoveds Quantum Media Player;Shockwave Flash;QuickTime Plug-in 7.6.6;Silverlight Plug-In;DivX Web Player;\tN\t1280x800\t0\twww429.sj2.removed.com\t18/0/2011 21:21:1 2 480\t\t\t\tUSD\t0.000000000000\t2\tY\tN\t803\t\t0\t0\t0\t0\t0\t\t0\t1295414463\t\t\t\t\t1\t0\t405483083\t841936015\t\t\t\t1\tN\t0\t\t1\t\t0\t\t\t\t\t\t\t\t\t\n";
const char* LINE2 = "1295414487\tss\ten-US,en;q=0.9,ja;q=\t2011-01-19 00:21:27\t405483083\t841936015\t200,135\tU\t24.43.46.75\t0\t\t\t\thttp://www.removedremoved.com/\t/removed/home/0,1904,,00.html\t\tMozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en; rv:1.9.0.19) Gecko/2010111021 Camino/2.0.6 (like\thome\t\t\t\t\thome\tremoved removed - Easy removeds, Healthy Eating Ideas and Chef removed Videos\t\tHOME,removed\tremoved-SECTION-32176-1\tsection\t\t\tRepeat\t\t\t\t\t\t\t\t\t\t\t\t\t\t\thttp://www.removedremoved.com/\t\t\t\t\tWed-12A\t70|49\t/removedredesign/site/search.do\t\t\t/removed/home/0,1904,,00.html\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\tMozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en; rv:1.9.0.19) Gecko/2010111021 Camino/2.0.6 (like Firefox/3.0.19)\t\t0\t0\t\t\t\t\t\t498\t600\t1243\tY\tY\tN\t2\t0\t300\trr.com\t18/0/2011 21:20:54 2 480\t8\t45\t2\t10003,00011,10020,10021\t71\t1295414463\t1295047392\t1295414456\t0\t0\thttp://www.removedremoved.com/removeds/pumpkins-pumpkin-pupcakes-removed/index.html\t/removeds/pumpkins-pumpkin-pupcakes-removed/index.html\thttp://www.removedremoved.com/\t/removed/home/0,1904,,00.html\thttp://www.facebook.com/l.php?u=http%3A%2F%2Fwww.removedremoved.com%2Fremoveds%2Fpumpkins-pumpkin-pupcakes-removed%2Findex.html&h=f130a\t\t0\t2\t3\t1616801072\tlos angeles\tusa\tca\t0\t0\t0\t0\t0\t0\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\tRepeat\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\tcashew chicken\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\tRepeat\t\t\t\t\t\t\t\t\tChicken removed Collection\t\t\t\t\t\t\t0\t\t0\t0\t\t340878110\tremovedremovednetnew\t2729687\t2781881428300103680\t5037348159352429215\t600\t1243\t24\tH.21\tY\t\t\t1.7\tY\tDefault Plug-in;Java Embedding Plugin 0.9.7.4;Moveremoveds Quantum Media Player;Shockwave Flash;QuickTime Plug-in 7.6.6;Silverlight Plug-In;DivX Web Player;\tN\t1280x800\t0\twww378.sj2.removed.com\t18/0/2011 21:21:25 2 480\t\t\t\tUSD\t0.000000000000\t2\tY\tN\t803\t\t0\t0\t0\t0\t0\t\t0\t1295414487\t\t\t\t\t1\t0\t405483083\t841936015\t\t\t\t1\tN\t0\t\t1\t\t0\t\t\t\t\t\t\t\t\t";

int main(int argc, char** argv) {
    int i;
    int minArg = 3;
    
    if (argc < minArg) {
        fputs("arguments: line_size prefix {outputs}\n",stderr);
        return EXIT_FAILURE;
    }
    
    const int sizeLine = atoi(argv[1]) + 1;
    if (sizeLine <= 0) {
        fputs("line_size <= 0",stderr);
        return EXIT_FAILURE;
    }

    char* line = malloc(sizeLine * sizeof(char));
    char* lineCursor = line;
    char* lineReset = line;
    
    char* tmp = malloc(sizeLine * sizeof(char));
    char* tmpCursor = tmp;
    char* tmpReset = tmp;
    
    if (strlen(argv[2]) > 0) {
        strcpy(line,argv[2]);
        lineReset = line + strlen(argv[2]);
        *lineReset = TAB;
        lineReset++;
        lineCursor = lineReset;
    }

    int fields = 1;
    unsigned int cntLines = 0;
    
    char c;
    int afterSlash = 0;
    int notEol;
    int k;
    for (k = 0; k < 3569516; k++) { //20gb = 3569516
        if (k % 3 == 0)
            tmpCursor = LINE0;
        if (k % 3 == 1)
            tmpCursor = LINE1;
        if (k % 3 == 2)
            tmpCursor = LINE2;
        notEol = 1;
        while(notEol) {
            c = *tmpCursor;
            tmpCursor++;
            if (afterSlash) {
                switch(c) {
                    case '\x0A':
                    case '\x0D':
                    case '\t':
                        *lineCursor = SPC;
                        lineCursor++;
                        break;
                    case '\0':
                        *lineCursor = SPC;
                        lineCursor++;
                        afterSlash = 0;
                        notEol = 0;
                        break;
                    default:
                        *lineCursor = BS;
                        lineCursor++;
                        *lineCursor = c;
                        lineCursor++;
                }
                afterSlash = 0;
            }
            else {
                switch(c) {
                    case '\0':
                        if (fields == CNT_FIELD) {
                            *lineCursor = LF;
                            lineCursor++;
                            *lineCursor = NUL;
                            fields = 1;
                            lineCursor = lineReset;
                        }
                        else if (fields > CNT_FIELD) {
                            *lineCursor = LF;
                            lineCursor++;
                            *lineCursor = NUL;
                            fields = 1;
                            lineCursor = lineReset;
                        }
                        notEol = 0;
                        break;
                    case '\\':
                        afterSlash = 1;
                        break;
                    case '\x0D':
                    case '\x0A':
                        break;
                    case '\t':
                        fields++;
                    default:
                        *lineCursor = c;
                        lineCursor++;
                }
            }
        }
    }
    
    free(tmp);
    free(line);
    return (EXIT_SUCCESS);
}

Open in new window

Avatar of modsiw

ASKER

The code below uses 16M buffers. It ran slower than 70sec / gb. I wasn't able to get an exact time bc I had to kill the process.
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

const char NUL = '\0';
const char CR = 0x0D;
const char LF = 0x0A;
const char TAB = '\t';
const char BS = '\\';
const char SPC = ' ';
const int CNT_FIELD = 292;
const int TMP_BUFFER_SIZE = 16 * 1024 * 1024;
const int OUT_BUFFER_SIZE = 16 * 1024 * 1024;

int main(int argc, char** argv) {
    int i;
    int minArg = 3;
    
    if (argc < minArg) {
        fputs("arguments: line_size prefix {outputs}\n",stderr);
        return EXIT_FAILURE;
    }
    
    const int sizeLine = atoi(argv[1]) + 1;
    if (sizeLine <= 0) {
        fputs("line_size <= 0",stderr);
        return EXIT_FAILURE;
    }

    char* line = malloc(sizeLine * sizeof(char));
    char* lineCursor = line;
    char* lineReset = line;
    
    char* tmp = malloc(TMP_BUFFER_SIZE * sizeof(char));
    
    if (strlen(argv[2]) > 0) {
        strcpy(line,argv[2]);
        lineReset = line + strlen(argv[2]);
        *lineReset = TAB;
        lineReset++;
        lineCursor = lineReset;
    }
    
    FILE* out;
    FILE** outs;
    int cntOuts = argc - minArg;
    char** outBuffer;
    if (cntOuts) {
        outBuffer = malloc(cntOuts * sizeof(char*));
        outs = malloc(cntOuts * sizeof(FILE*));
        for (i = minArg; i < argc; i++) {
            outs[i - minArg] = fopen(argv[i],"a+");
            outBuffer[i - minArg] = malloc(OUT_BUFFER_SIZE * sizeof(char));
            setvbuf(outs[i - minArg],outBuffer[i - minArg],_IOFBF,OUT_BUFFER_SIZE);
        }
    }
    else {
        outBuffer = malloc(sizeof(char*));
        outBuffer[0] = malloc(OUT_BUFFER_SIZE * sizeof(char));
        setvbuf(stdout,outBuffer[0],_IOFBF,OUT_BUFFER_SIZE);
        out = stdout;
    }
    
    char* inBuffer = malloc(TMP_BUFFER_SIZE * sizeof(char));
    setvbuf(stdin,inBuffer,_IOFBF,TMP_BUFFER_SIZE);

    int fields = 1;
    unsigned int cntLines = 0;
    
    char c;
    int afterSlash = 0;
    int inFileNo = fileno(stdin);
    int n;
    while (!feof(stdin)) {
        n = read(inFileNo,tmp,TMP_BUFFER_SIZE);
        for(i = 0; i < n; i++) {
            c = tmp[i];
            if (afterSlash) {
                switch(c) {
                    case '\x0A':
                    case '\x0D':
                    case '\t':
                        *lineCursor = SPC;
                        lineCursor++;
                        break;
                    case '\0':
                        *lineCursor = SPC;
                        lineCursor++;
                        afterSlash = 0;
                        break;
                    default:
                        *lineCursor = BS;
                        lineCursor++;
                        *lineCursor = c;
                        lineCursor++;
                }
                afterSlash = 0;
            }
            else {
                switch(c) {
                    case '\\':
                        afterSlash = 1;
                        break;
                    case '\x0A':
                        if (fields == CNT_FIELD) {
                            *lineCursor = LF;
                            lineCursor++;
                            *lineCursor = NUL;
                            if (cntOuts)
                                out = outs[cntLines++ % cntOuts];
                            fputs(line,out);
                            fields = 1;
                            lineCursor = lineReset;
                        }
                        else if (fields > CNT_FIELD) {
                            *lineCursor = LF;
                            lineCursor++;
                            *lineCursor = NUL;
                            fputs(line,stderr);
                            fields = 1;
                            lineCursor = lineReset;
                        }
                        break;
                    case '\0':
                    case '\x0D':
                        break;
                    case '\t':
                        fields++;
                    default:
                        *lineCursor = c;
                        lineCursor++;
                }
            }
        }
    }
    
    for (i = 0; i < cntOuts; i++) {
        fclose(outs[i]);
        free(outBuffer[i]);
    }
    
    free(inBuffer);
    free(outBuffer);
    free(outs);
    free(tmp);
    free(line);
    return (EXIT_SUCCESS);
}

Open in new window

Avatar of modsiw

ASKER

The 16M buffer code fluxed between 75% and 100% cpu usage. It consumed more cpu time than the original. All of my gb / sec s measured by wall clock time
Avatar of modsiw

ASKER

16k buffer is also slower than original.
Avatar of modsiw

ASKER

IO only code below.

IO only: 15 sec / gb
Everything but IO: 9 sec / gb

IO & Everything but IO: 50 sec /gb



Why does doing both take so much more time than doing each task separately?
#include <stdio.h>
#include <stdlib.h>

const int BUFFER_SIZE = 16 * 1024 * 1024;

int main(int argc, char** argv) {
    char* buffer = malloc(BUFFER_SIZE * sizeof(char));

    int inFileNo = fileno(stdin);
    int outFileNo = fileno(stdout);
    int n;
    
    while (1) {
        n = read(inFileNo,buffer,BUFFER_SIZE);
        if (n <= 0)
            break;
        write(outFileNo,buffer,n);
    }
    
    free(buffer);
    return (EXIT_SUCCESS);
}

Open in new window

Avatar of modsiw

ASKER

In the code below. There are two comment blocks. The changes between them seems to be what is slowing it down.

Anyway to speed it back up?

I'm going to see if I can replace the backslash-linefeed sequence with two spaces instead of one. If so, this would remove the requirement to "shift" the array and could opperate on buffer directly and not use line.
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

const char NUL = '\0';
const char CR = 0x0D;
const char LF = 0x0A;
const char TAB = '\t';
const char BS = '\\';
const char SPC = ' ';
const int CNT_FIELD = 292;
const int BUFFER_SIZE = 16 * 1024 * 1024;

int main(int argc, char** argv) {
    int i;
    int minArg = 3;
    
    if (argc < minArg) {
        fputs("arguments: line_size prefix {outputs}\n",stderr);
        return EXIT_FAILURE;
    }
    
    const int sizeLine = atoi(argv[1]) + 1;
    if (sizeLine <= 0) {
        fputs("line_size <= 0",stderr);
        return EXIT_FAILURE;
    }

    char* line = malloc(sizeLine * sizeof(char));
    
    char* buffer = malloc(BUFFER_SIZE * sizeof(char));
    
    int lineResetIdx = strlen(argv[2]);
    if (lineResetIdx > 0) {
        strcpy(line,argv[2]);
        line[lineResetIdx++] = TAB;
    }
    
    FILE* out;
    FILE** outs;
    int cntOuts = argc - minArg;
    if (cntOuts) {
        outs = malloc(cntOuts * sizeof(FILE*));
        for (i = minArg; i < argc; i++)
            outs[i - minArg] = fopen(argv[i],"a+");
    }
    else
        out = stdout;
    
    int* outNos;
    int outNo = fileno(stdout);
    if (cntOuts) {
        outNos = malloc(cntOuts * sizeof(int));
        for (i = 0; i < cntOuts; i++)
            outNos[i] = fileno(outs[i]);
    }
    
    int errNo = fileno(stderr);
    
    int fields = 1;
    unsigned int cntLines = 0;
    
    char c;
    int afterSlash = 0;
    int inNo = fileno(stdin);
    int n;
    int lineIdx = lineResetIdx;

    do {
        n = read(inNo,buffer,BUFFER_SIZE);
        
//32 sec/gb
/*
        for (i = 0; i < n; i++) {
            char c = buffer[i];
            line[lineIdx++] = c;
            if (c == LF) {
                write(outNo,line,lineIdx);            
                lineIdx = lineResetIdx;
            }
        }
*/
        
//9 sec/gb
        write(outNo,buffer,n);
        
//
    } while (n > 0);
    
    for (i = 0; i < cntOuts; i++)
        fclose(outs[i]);
    
    if (cntOuts) {
        free(outs);
        free(outNos);
    }
    free(buffer);
    free(line);

    return (EXIT_SUCCESS);
}

Open in new window

I just had time to glance at your posts, and I'll be away for 2 hours. In your mock-up tests, I don't understand why you had to kill a process. Shouldn't you exit as soon as you hit the 1GB mark for timing purposes?

Also, I'd like to try your code on my system. I think I mentioned that it took about 18 sec for me to write 1GB to a disk file. So, could you mention what input you use for each timing test - what is the size of the input file and what is the nature of the data in each test. If the data is complicated, the could you provide a template for creation (or place such a file in mediafire (200 MB max), but maybe we can read the file 5x.

In my quick glance I didn't see pipes or mention of gzip - isn't that what you are trying to test. Or, do you now believe that pipes and gzip is not part of the timing problem?
Avatar of modsiw

ASKER

I had to kill it bc someone else needed the box. I was using a 20gb sample (better average) and writing to /dev/null

I believe that pipes and gzip are not part of the problem bc I get the same results starting with a .gz file and piping it trough gzip -dc as I do having already deflated the file on disk. Also of note is that gz is free to run on a different processor. This code is obviously not multithreaded.


http://www.mediafire.com/?9guy1lykvgrzqj0


This is just 3 lines (well, what would be 3 lines without incorrect LFs). It is a tab delimited file with 292 fields. You'll need to double the file many times to get a meaningful test. I have to manually edit the file to remove things my company prob wouldn't want published, so it is a small sample.

I test with a 20gb file. The real data measures into the TBs.



Do note my last code sample. It demonstrates a part of the app that seems to be causing the slowness.
The I/O-only version did both input and output in huge blocks, but the version without the output commented out did line-at-a-time output; that must be the difference.  I think the easiest fix would be to use fopen/fwrite for the output.  I looked back at the original version and saw it was using fputs, which means it needs to scan for nulls pointlessly, when you already know the size.  With fwrite you can pass it the size, but it won't do as many OS-level I/O calls as the version with write..

You should probably make your buffer smaller, too.  I think it's well past the point of diminishing returns where a bigger buffer will help, and to the point where it will slow it down by making your cache useless.
=====================
re: your code post in http:#35037386 
    for (k = 0; k < 3569516; k++) { //20gb = 3569516
        VS 2010 61 sec

cygwin:
cc -O2 no_io.c

      for (k = 0; k < 3569516/6; k++) { //20gb = 3569516

$ time ./a 8192 X

real    0m4.672s
user    0m4.640s
sys     0m0.015s
=====================

For your next post, I had trouble using ddd to both redirect the stdin to an input file. When trying to use VS 2010, the mixture of unix and standard C I/O results in compiler and runtime issues. So, I wasn't able to proceed. I also wasn't sure how you wanted me to setup the arguments and the input file using that code, since the file you linked to was obviously small. If you elaborate to the point that I can reproduce the timing issues, I may be able to identify the problem.

BTW - A coworker complained that all he did was rebuild his embedded system (had a disk), and then the run-time performace dropped significantly. The problem was disk fragmentation. When reading and writing, that alone can cause the weakest link due to time-consuming head movements; but add onto that fragmentation, and now you are really head-moving.
If you wish to repost in standard portable C, I'll be able to better time your various programs using VS 2010. Please include detailed instructions on how to run each program (e.g., what arguments, what input file). Since the input file in your link is short, you may wish to loop to get the right amount of processed bytes.