Solved

speeding up text processing app

Posted on 2011-03-02
21
322 Views
Last Modified: 2012-06-21
The app below takes about 50 seconds per gb of input. I'm looking for ways to speed it up.

It is CPU bound.
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

const char NUL = '\0';
const char CR = 0x0D;
const char LF = 0x0A;
const char TAB = '\t';
const char BS = '\\';
const char SPC = ' ';
const int CNT_FIELD = 292;

int main(int argc, char** argv) {
    int i;
    int minArg = 3;
    
    if (argc < minArg) {
        fputs("arguments: line_size prefix {outputs}\n",stderr);
        return EXIT_FAILURE;
    }
    
    const int sizeLine = atoi(argv[1]) + 1;
    if (sizeLine <= 0) {
        fputs("line_size <= 0",stderr);
        return EXIT_FAILURE;
    }

    char* line = malloc(sizeLine * sizeof(char));
    char* lineCursor = line;
    char* lineReset = line;
    
    char* tmp = malloc(sizeLine * sizeof(char));
    char* tmpCursor = tmp;
    char* tmpReset = tmp;
    
    if (strlen(argv[2]) > 0) {
        strcpy(line,argv[2]);
        lineReset = line + strlen(argv[2]);
        *lineReset = TAB;
        lineReset++;
        lineCursor = lineReset;
    }
    
    FILE* out;
    FILE** outs;
    int cntOuts = argc - minArg;
    if (cntOuts) {
        outs = malloc(cntOuts * sizeof(FILE));
        for (i = minArg; i < argc; i++)
            outs[i - minArg] = fopen(argv[i],"a+");
    }
    else
        out = stdout;

    int fields = 1;
    unsigned int cntLines = 0;
    
    char c;
    int afterSlash = 0;
    while (!feof(stdin)) {
        tmpCursor = tmpReset;
        fgets(tmpCursor, sizeLine, stdin);
        while(1) {
            c = *tmpCursor;
            tmpCursor++;
            if (afterSlash) {
                if (c == LF || c == TAB || c == CR) {
                    *lineCursor = SPC;
                    lineCursor++;
                }
                else if (c == NUL) {
                    *lineCursor = SPC;
                    lineCursor++;
                    afterSlash = 0;
                    break;
                }
                else {
                    *lineCursor = BS;
                    lineCursor++;
                    *lineCursor = c;
                    lineCursor++;
                }
                afterSlash = 0;
            }
            else if (c == TAB) {
                fields++;
                *lineCursor = TAB;
                lineCursor++;
            }
            else if (c == NUL) {
                *lineCursor = NUL;
                if (fields == CNT_FIELD) {
                    if (cntOuts)
                        out = outs[cntLines++ % cntOuts];
                    fputs(line,out);
                }
                else if (fields > CNT_FIELD)
                    fputs(line,stderr);
                fields = 1;
                lineCursor = lineReset;
                break;
            }
            else if (c == BS) 
                afterSlash = 1;
            else if (c == CR);
            else {
                *lineCursor = c;
                lineCursor++;
            }
        }
    }
    
    for (i = 0; i < cntOuts; i++)
        fclose(outs[i]);
    
    free(outs);
    free(tmp);
    free(line);
    return (EXIT_SUCCESS);
}

Open in new window

0
Comment
Question by:modsiw
  • 8
  • 8
  • 3
  • +1
21 Comments
 
LVL 13

Expert Comment

by:Superdave
ID: 35022695
You could try using a switch statement instead of the if...else if's.  The compiler will optimize that into some kind of lookup table that is probably faster than all the compares and conditional jumps.  I think that will make a little bit of speed-up, but probably not a whole lot.

You could also replace the fgets with a fixed size read, either fread or use open/read/write instead of the buffered versions.  You're already checking every character so it would not be much more overhead to deal with the end-of-lines as you find them.

Also line 48 should have sizeof(FILE *) rather that sizeof(FILE) but that won't make a difference--just allocates a bit more memory than it needs to.
0
 
LVL 32

Expert Comment

by:phoffric
ID: 35023716
If you do no logic, how long does it take to just do I/O?
0
 
LVL 32

Expert Comment

by:phoffric
ID: 35025095
If willing to use non-portable code, then if using linux, you can memory map a file and treat it as an array for superior I/O performance (almost 10x).
    http://linux.die.net/man/3/mmap
and here is some sample code for write and read:
    http://www.linuxquestions.org/questions/programming-9/mmap-tutorial-c-c-511265/
(I changed the for (i = 1; i <=NUMINTS; ++i) {...} to start at 0; and it ran 9x faster than using fwrite)

I never did this in Windows, but here is a link to get started:
    http://msdn.microsoft.com/en-us/library/aa366556(v=vs.85).aspx
0
 
LVL 32

Expert Comment

by:sarabande
ID: 35026304
i would guess you have same time if you only do the read and writes. 20 mb per second is not so bad.

you could speed up by reading the file in bigger chunks and by storing the outputs in memory and flush them at end.

Sara
0
 
LVL 3

Author Comment

by:modsiw
ID: 35028129
superdave,

I tried using putc / getc. It showed the same sample to 70sec / gb.
I'll try switch, and let ya know.
Thanks for FILE*


phoffric,
I'll give it a shot with just IO.

In normal operation, the input is a pipe coming out of gzip. Output is either named pipe(s) or a normal pipe. I'm not sure what affect this would have on mmap; I've never used it, so I'll have to research. From looking at top, gzip is spending most of its time blocked, and therefore I presume (maybe wrongly?) that it is writing out data plenty fast enough.

I've also tried it reading from a normal file (sample_data) that has already been unzipped. It produces the same speed as from the gzip pipe (50sec / gb).

cat sample_data > named_pipe & cat named_pipe > /dev/null runs in 6-10sec / gb. This is based on SAN traffic I guess.

It's on a Solaris box atm, and slated to go to a RedHat box. I'm ok with non-portable code. This is small and easy enough to port as needed.


Sara,
The input is capable of upwards of 100mb/sec and top reports 100% cpu usage for this process. All the same, I'll reduce the app to just IO and see what happens.

I'm not sure if reading in bigger chunks would really benefit me. Input is from a pipe, no seeks.
I can't flush at the end; the data is too large. The output is also a pipe to a process, so again there are no disk seeks.
I didn't give the info about pipes in & out in the OP, oops. Do you still recommend larger chunks? If so, I'll give it a shot.
0
 
LVL 32

Expert Comment

by:sarabande
ID: 35028829
you won't get 100mb/sec if you read with fgets and have write operations while reading.

streaming operations on stdin are relatively slow.

Sara
0
 
LVL 32

Accepted Solution

by:
phoffric earned 250 total points
ID: 35032276
>> In normal operation, the input is a pipe coming out of gzip. Output is either named pipe(s) or a normal pipe.
Ah, your code suggested you were dealing with disk files, not pipe. mmap uses a virtual addressing mechanism that takes advantage of hardware page faults to quickly read/write random positions in your disk system. Pipe doesn't fall into that category, so you will not be able to use mmap.

Your question now changes its flavor quite a bit, IMO.
0
 
LVL 13

Assisted Solution

by:Superdave
Superdave earned 250 total points
ID: 35033534
A quick thing you could try is to use setvbuf on the files, like:
setvbuf(stdin,buffer,_IOFBF,sizeof buffer);
where buffer is a large character array (like maybe 16K, but you can experiment).
Do the same for the output files.  This means it uses bigger buffers so it needs to do fewer I/O operations.

To do it faster yet, instead of fopen and fgets, use read(0,buffer,sizeof(buffer)) to do input, where buffer is a large character array (like maybe 16K, but you can experiment).  Loop your pointer through that array instead of one line at a time.  When you come to a linefeed, deal with outputting that line.  Since it looks like you're writing to different files, it would be easiest to leave the output alone; just use setvbuf on the output files.
0
 
LVL 32

Expert Comment

by:phoffric
ID: 35033574
>> use read
I tried using open and read, but got no appreciable difference - a little surprised.
0
 
LVL 32

Expert Comment

by:phoffric
ID: 35033720
>> Output is either named pipe(s) or a normal pipe.
On my system it took ~2 sec to write 1GB to a file using mmap vs 17-19 sec using other approaches. But since pipe is serial, it cannot be mapped using mmap. If your design could change to mmap write to a file, and then pipe that file to the next process, perhaps that will improve. (Likewise, next process could mmap read.)

>> I've also tried it reading from a normal file (sample_data) that has already been unzipped. It produces the same speed as from the gzip pipe
>>  top reports 100% cpu usage
    Could you do a timing test with no I/O. Just hard-code a dummy data sequence that represents a worst-case scenario, and loop until 1GB of data is processed. That eliminates I/O and will confirm the evidence that it is your data processing that is the problem. If you then post a program with no I/O and we also see high cpu usage, then maybe see if we can suggest SW improvements.

For overall system performance improvements, perhaps HW will help:
     http://www.aha.com/show_prod.php?id=36
0
How to run any project with ease

Manage projects of all sizes how you want. Great for personal to-do lists, project milestones, team priorities and launch plans.
- Combine task lists, docs, spreadsheets, and chat in one
- View and edit from mobile/offline
- Cut down on emails

 
LVL 3

Author Comment

by:modsiw
ID: 35037386
Below is code with no IO. It runs at 9sec / gb.

It seems the IO is the issue. I'll try the IO things mentioned in this thread and post results.
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

const int NUL = '\0';
const int CR = 0x0D;
const int LF = 0x0A;
const int TAB = '\t';
const int BS = '\\';
const int SPC = ' ';
const int CNT_FIELD = 292;

const char* LINE0 = "1295414456\tss\ten-US,en;q=0.9,ja;q=\t2011-01-19 00:20:56\t405483083\t841936015\t200,135\tU\t24.43.46.75\t0\t\t\t\thttp://www.removedremoved.com/\t/removed/home/0,1904,,00.html\t\tMozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en; rv:1.9.0.19) Gecko/2010111021 Camino/2.0.6 (like\thome\t\t\t\t\thome\tremoved removed - Easy removeds, Healthy Eating Ideas and Chef removed Videos\t\tHOME,removed\tremoved-SECTION-32176-1\tsection\t\t\tRepeat\t\t\t\t\t\t\t\t\t\t\t\t\t\t\thttp://www.removedremoved.com/\t\t\t\t\tWed-12A\t\t\t\t\t/removed/home/0,1904,,00.html\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\tMozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en; rv:1.9.0.19) Gecko/2010111021 Camino/2.0.6 (like Firefox/3.0.19)\t\t0\t0\t\t\t\t\t\t498\t600\t1243\tY\tY\tN\t2\t0\t300\trr.com\t18/0/2011 21:20:54 2 480\t8\t45\t2\t10003,00011,10020,10021\t71\t1295047392\t1295047392\t1295414456\t0\t0\thttp://www.removedremoved.com/removeds/pumpkins-pumpkin-pupcakes-removed/index.html\t/removeds/pumpkins-pumpkin-pupcakes-removed/index.html\thttp://www.removedremoved.com/\t/removed/home/0,1904,,00.html\thttp://www.facebook.com/l.php?u=http%3A%2F%2Fwww.removedremoved.com%2Fremoveds%2Fpumpkins-pumpkin-pupcakes-removed%2Findex.html&h=f130a\t\t0\t2\t1\t0\tlos angeles\tusa\tca\t0\t1\t1\t1\t0\t0\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\tRepeat\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\tRepeat\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t0\t\t0\t0\t\t340878110\tremovedremovednetnew\t2729687\t2781881361728110592\t4992312158249386792\t600\t1243\t24\tH.21\tY\t\t\t1.7\tY\tDefault Plug-in;Java Embedding Plugin 0.9.7.4;Moveremoveds Quantum Media Player;Shockwave Flash;QuickTime Plug-in 7.6.6;Silverlight Plug-In;DivX Web Player;\tN\t1280x800\t0\twww369.sj2.removed.com\t18/0/2011 21:20:54 2 480\t\t\t\tUSD\t0.000000000000\t2\tY\tN\t803\t\t1\t0\t0\t0\t0\t\t0\t1295414456\t\t\t\t\t1\t0\t405483083\t841936015\t\t\t\t1\tN\t0\t\t1\t\t0\t\t\t\t\t\t\t\t\t\n";
const char* LINE1 = "1295414463\tss\ten-US,en;q=0.9,ja;q=\t2011-01-19 00:21:03\t405483083\t841936015\t200,227,103,135,144\tU\t24.43.46.75\t0\t\t\t\thttp://www.removedremoved.com/search/delegate.do?fnSearchString=cashew+chicken&fnSearchType=site\t/removedredesign/site/search.do\t\tMozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en; rv:1.9.0.19) Gecko/2010111021 Camino/2.0.6 (like\tsearch\t\t\t\t\tsearch\tSearch Results removed removed\t\tSEARCH,removed\tremoved-SEARCH-1\tsearch\t\t\tRepeat\t\t\t\t\t\t\t\tcashew chicken\tcashew chicken\t\t\tsearch:cashew chicken\t\t\thttp://www.removedremoved.com/search/delegate.do?fnSearchString=cashew+chicken&fnSearchType=site\t\t\t\t\tWed-12A\t\t/removed/home/0,1904,,00.html\t\t\t/removedredesign/site/search.do\t\tCHICKEN\t\t\t\t\t\t\t\t\t\t\t\thttp://www.removedremoved.com/\t\tMozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en; rv:1.9.0.19) Gecko/2010111021 Camino/2.0.6 (like Firefox/3.0.19)\t\t0\t0\t\t\t\t\t\t498\t600\t1243\tY\tY\tN\t2\t0\t300\trr.com\t18/0/2011 21:20:54 2 480\t8\t45\t2\t10003,00011,10020,10021\t71\t1295414456\t1295047392\t1295414456\t0\t0\thttp://www.removedremoved.com/removeds/pumpkins-pumpkin-pupcakes-removed/index.html\t/removeds/pumpkins-pumpkin-pupcakes-removed/index.html\thttp://www.removedremoved.com/\t/removed/home/0,1904,,00.html\thttp://www.facebook.com/l.php?u=http%3A%2F%2Fwww.removedremoved.com%2Fremoveds%2Fpumpkins-pumpkin-pupcakes-removed%2Findex.html&h=f130a\t\t0\t2\t2\t3483642180\tlos angeles\tusa\tca\t0\t0\t0\t0\t0\t0\t\t\t\t\tcashew chicken\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\tRepeat\t\t\t\t\t\t\t\t\tChicken removed Collection\t\t\t\t\t\t\t\t\tcashew chicken\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\tRepeat\t\t\t\t\t\t\t\t\tChicken removed Collection\t\t\t\t\t\t\\\n                                                Search\\\n\t3\t/removed/home/0,1904,,00.html\t1\t0\tSUBMIT\t340878110\tremovedremovednetnew\t2729687\t2781881376760528896\t5725589719228350391\t600\t1243\t24\tH.21\tY\t\t\t1.7\tY\tDefault Plug-in;Java Embedding Plugin 0.9.7.4;Moveremoveds Quantum Media Player;Shockwave Flash;QuickTime Plug-in 7.6.6;Silverlight Plug-In;DivX Web Player;\tN\t1280x800\t0\twww429.sj2.removed.com\t18/0/2011 21:21:1 2 480\t\t\t\tUSD\t0.000000000000\t2\tY\tN\t803\t\t0\t0\t0\t0\t0\t\t0\t1295414463\t\t\t\t\t1\t0\t405483083\t841936015\t\t\t\t1\tN\t0\t\t1\t\t0\t\t\t\t\t\t\t\t\t\n";
const char* LINE2 = "1295414487\tss\ten-US,en;q=0.9,ja;q=\t2011-01-19 00:21:27\t405483083\t841936015\t200,135\tU\t24.43.46.75\t0\t\t\t\thttp://www.removedremoved.com/\t/removed/home/0,1904,,00.html\t\tMozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en; rv:1.9.0.19) Gecko/2010111021 Camino/2.0.6 (like\thome\t\t\t\t\thome\tremoved removed - Easy removeds, Healthy Eating Ideas and Chef removed Videos\t\tHOME,removed\tremoved-SECTION-32176-1\tsection\t\t\tRepeat\t\t\t\t\t\t\t\t\t\t\t\t\t\t\thttp://www.removedremoved.com/\t\t\t\t\tWed-12A\t70|49\t/removedredesign/site/search.do\t\t\t/removed/home/0,1904,,00.html\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\tMozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en; rv:1.9.0.19) Gecko/2010111021 Camino/2.0.6 (like Firefox/3.0.19)\t\t0\t0\t\t\t\t\t\t498\t600\t1243\tY\tY\tN\t2\t0\t300\trr.com\t18/0/2011 21:20:54 2 480\t8\t45\t2\t10003,00011,10020,10021\t71\t1295414463\t1295047392\t1295414456\t0\t0\thttp://www.removedremoved.com/removeds/pumpkins-pumpkin-pupcakes-removed/index.html\t/removeds/pumpkins-pumpkin-pupcakes-removed/index.html\thttp://www.removedremoved.com/\t/removed/home/0,1904,,00.html\thttp://www.facebook.com/l.php?u=http%3A%2F%2Fwww.removedremoved.com%2Fremoveds%2Fpumpkins-pumpkin-pupcakes-removed%2Findex.html&h=f130a\t\t0\t2\t3\t1616801072\tlos angeles\tusa\tca\t0\t0\t0\t0\t0\t0\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\tRepeat\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\tcashew chicken\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\tRepeat\t\t\t\t\t\t\t\t\tChicken removed Collection\t\t\t\t\t\t\t0\t\t0\t0\t\t340878110\tremovedremovednetnew\t2729687\t2781881428300103680\t5037348159352429215\t600\t1243\t24\tH.21\tY\t\t\t1.7\tY\tDefault Plug-in;Java Embedding Plugin 0.9.7.4;Moveremoveds Quantum Media Player;Shockwave Flash;QuickTime Plug-in 7.6.6;Silverlight Plug-In;DivX Web Player;\tN\t1280x800\t0\twww378.sj2.removed.com\t18/0/2011 21:21:25 2 480\t\t\t\tUSD\t0.000000000000\t2\tY\tN\t803\t\t0\t0\t0\t0\t0\t\t0\t1295414487\t\t\t\t\t1\t0\t405483083\t841936015\t\t\t\t1\tN\t0\t\t1\t\t0\t\t\t\t\t\t\t\t\t";

int main(int argc, char** argv) {
    int i;
    int minArg = 3;
    
    if (argc < minArg) {
        fputs("arguments: line_size prefix {outputs}\n",stderr);
        return EXIT_FAILURE;
    }
    
    const int sizeLine = atoi(argv[1]) + 1;
    if (sizeLine <= 0) {
        fputs("line_size <= 0",stderr);
        return EXIT_FAILURE;
    }

    char* line = malloc(sizeLine * sizeof(char));
    char* lineCursor = line;
    char* lineReset = line;
    
    char* tmp = malloc(sizeLine * sizeof(char));
    char* tmpCursor = tmp;
    char* tmpReset = tmp;
    
    if (strlen(argv[2]) > 0) {
        strcpy(line,argv[2]);
        lineReset = line + strlen(argv[2]);
        *lineReset = TAB;
        lineReset++;
        lineCursor = lineReset;
    }

    int fields = 1;
    unsigned int cntLines = 0;
    
    char c;
    int afterSlash = 0;
    int notEol;
    int k;
    for (k = 0; k < 3569516; k++) { //20gb = 3569516
        if (k % 3 == 0)
            tmpCursor = LINE0;
        if (k % 3 == 1)
            tmpCursor = LINE1;
        if (k % 3 == 2)
            tmpCursor = LINE2;
        notEol = 1;
        while(notEol) {
            c = *tmpCursor;
            tmpCursor++;
            if (afterSlash) {
                switch(c) {
                    case '\x0A':
                    case '\x0D':
                    case '\t':
                        *lineCursor = SPC;
                        lineCursor++;
                        break;
                    case '\0':
                        *lineCursor = SPC;
                        lineCursor++;
                        afterSlash = 0;
                        notEol = 0;
                        break;
                    default:
                        *lineCursor = BS;
                        lineCursor++;
                        *lineCursor = c;
                        lineCursor++;
                }
                afterSlash = 0;
            }
            else {
                switch(c) {
                    case '\0':
                        if (fields == CNT_FIELD) {
                            *lineCursor = LF;
                            lineCursor++;
                            *lineCursor = NUL;
                            fields = 1;
                            lineCursor = lineReset;
                        }
                        else if (fields > CNT_FIELD) {
                            *lineCursor = LF;
                            lineCursor++;
                            *lineCursor = NUL;
                            fields = 1;
                            lineCursor = lineReset;
                        }
                        notEol = 0;
                        break;
                    case '\\':
                        afterSlash = 1;
                        break;
                    case '\x0D':
                    case '\x0A':
                        break;
                    case '\t':
                        fields++;
                    default:
                        *lineCursor = c;
                        lineCursor++;
                }
            }
        }
    }
    
    free(tmp);
    free(line);
    return (EXIT_SUCCESS);
}

Open in new window

0
 
LVL 3

Author Comment

by:modsiw
ID: 35038569
The code below uses 16M buffers. It ran slower than 70sec / gb. I wasn't able to get an exact time bc I had to kill the process.
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

const char NUL = '\0';
const char CR = 0x0D;
const char LF = 0x0A;
const char TAB = '\t';
const char BS = '\\';
const char SPC = ' ';
const int CNT_FIELD = 292;
const int TMP_BUFFER_SIZE = 16 * 1024 * 1024;
const int OUT_BUFFER_SIZE = 16 * 1024 * 1024;

int main(int argc, char** argv) {
    int i;
    int minArg = 3;
    
    if (argc < minArg) {
        fputs("arguments: line_size prefix {outputs}\n",stderr);
        return EXIT_FAILURE;
    }
    
    const int sizeLine = atoi(argv[1]) + 1;
    if (sizeLine <= 0) {
        fputs("line_size <= 0",stderr);
        return EXIT_FAILURE;
    }

    char* line = malloc(sizeLine * sizeof(char));
    char* lineCursor = line;
    char* lineReset = line;
    
    char* tmp = malloc(TMP_BUFFER_SIZE * sizeof(char));
    
    if (strlen(argv[2]) > 0) {
        strcpy(line,argv[2]);
        lineReset = line + strlen(argv[2]);
        *lineReset = TAB;
        lineReset++;
        lineCursor = lineReset;
    }
    
    FILE* out;
    FILE** outs;
    int cntOuts = argc - minArg;
    char** outBuffer;
    if (cntOuts) {
        outBuffer = malloc(cntOuts * sizeof(char*));
        outs = malloc(cntOuts * sizeof(FILE*));
        for (i = minArg; i < argc; i++) {
            outs[i - minArg] = fopen(argv[i],"a+");
            outBuffer[i - minArg] = malloc(OUT_BUFFER_SIZE * sizeof(char));
            setvbuf(outs[i - minArg],outBuffer[i - minArg],_IOFBF,OUT_BUFFER_SIZE);
        }
    }
    else {
        outBuffer = malloc(sizeof(char*));
        outBuffer[0] = malloc(OUT_BUFFER_SIZE * sizeof(char));
        setvbuf(stdout,outBuffer[0],_IOFBF,OUT_BUFFER_SIZE);
        out = stdout;
    }
    
    char* inBuffer = malloc(TMP_BUFFER_SIZE * sizeof(char));
    setvbuf(stdin,inBuffer,_IOFBF,TMP_BUFFER_SIZE);

    int fields = 1;
    unsigned int cntLines = 0;
    
    char c;
    int afterSlash = 0;
    int inFileNo = fileno(stdin);
    int n;
    while (!feof(stdin)) {
        n = read(inFileNo,tmp,TMP_BUFFER_SIZE);
        for(i = 0; i < n; i++) {
            c = tmp[i];
            if (afterSlash) {
                switch(c) {
                    case '\x0A':
                    case '\x0D':
                    case '\t':
                        *lineCursor = SPC;
                        lineCursor++;
                        break;
                    case '\0':
                        *lineCursor = SPC;
                        lineCursor++;
                        afterSlash = 0;
                        break;
                    default:
                        *lineCursor = BS;
                        lineCursor++;
                        *lineCursor = c;
                        lineCursor++;
                }
                afterSlash = 0;
            }
            else {
                switch(c) {
                    case '\\':
                        afterSlash = 1;
                        break;
                    case '\x0A':
                        if (fields == CNT_FIELD) {
                            *lineCursor = LF;
                            lineCursor++;
                            *lineCursor = NUL;
                            if (cntOuts)
                                out = outs[cntLines++ % cntOuts];
                            fputs(line,out);
                            fields = 1;
                            lineCursor = lineReset;
                        }
                        else if (fields > CNT_FIELD) {
                            *lineCursor = LF;
                            lineCursor++;
                            *lineCursor = NUL;
                            fputs(line,stderr);
                            fields = 1;
                            lineCursor = lineReset;
                        }
                        break;
                    case '\0':
                    case '\x0D':
                        break;
                    case '\t':
                        fields++;
                    default:
                        *lineCursor = c;
                        lineCursor++;
                }
            }
        }
    }
    
    for (i = 0; i < cntOuts; i++) {
        fclose(outs[i]);
        free(outBuffer[i]);
    }
    
    free(inBuffer);
    free(outBuffer);
    free(outs);
    free(tmp);
    free(line);
    return (EXIT_SUCCESS);
}

Open in new window

0
 
LVL 3

Author Comment

by:modsiw
ID: 35038581
The 16M buffer code fluxed between 75% and 100% cpu usage. It consumed more cpu time than the original. All of my gb / sec s measured by wall clock time
0
 
LVL 3

Author Comment

by:modsiw
ID: 35038761
16k buffer is also slower than original.
0
 
LVL 3

Author Comment

by:modsiw
ID: 35039552
IO only code below.

IO only: 15 sec / gb
Everything but IO: 9 sec / gb

IO & Everything but IO: 50 sec /gb



Why does doing both take so much more time than doing each task separately?
#include <stdio.h>
#include <stdlib.h>

const int BUFFER_SIZE = 16 * 1024 * 1024;

int main(int argc, char** argv) {
    char* buffer = malloc(BUFFER_SIZE * sizeof(char));

    int inFileNo = fileno(stdin);
    int outFileNo = fileno(stdout);
    int n;
    
    while (1) {
        n = read(inFileNo,buffer,BUFFER_SIZE);
        if (n <= 0)
            break;
        write(outFileNo,buffer,n);
    }
    
    free(buffer);
    return (EXIT_SUCCESS);
}

Open in new window

0
 
LVL 3

Author Comment

by:modsiw
ID: 35040046
In the code below. There are two comment blocks. The changes between them seems to be what is slowing it down.

Anyway to speed it back up?

I'm going to see if I can replace the backslash-linefeed sequence with two spaces instead of one. If so, this would remove the requirement to "shift" the array and could opperate on buffer directly and not use line.
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

const char NUL = '\0';
const char CR = 0x0D;
const char LF = 0x0A;
const char TAB = '\t';
const char BS = '\\';
const char SPC = ' ';
const int CNT_FIELD = 292;
const int BUFFER_SIZE = 16 * 1024 * 1024;

int main(int argc, char** argv) {
    int i;
    int minArg = 3;
    
    if (argc < minArg) {
        fputs("arguments: line_size prefix {outputs}\n",stderr);
        return EXIT_FAILURE;
    }
    
    const int sizeLine = atoi(argv[1]) + 1;
    if (sizeLine <= 0) {
        fputs("line_size <= 0",stderr);
        return EXIT_FAILURE;
    }

    char* line = malloc(sizeLine * sizeof(char));
    
    char* buffer = malloc(BUFFER_SIZE * sizeof(char));
    
    int lineResetIdx = strlen(argv[2]);
    if (lineResetIdx > 0) {
        strcpy(line,argv[2]);
        line[lineResetIdx++] = TAB;
    }
    
    FILE* out;
    FILE** outs;
    int cntOuts = argc - minArg;
    if (cntOuts) {
        outs = malloc(cntOuts * sizeof(FILE*));
        for (i = minArg; i < argc; i++)
            outs[i - minArg] = fopen(argv[i],"a+");
    }
    else
        out = stdout;
    
    int* outNos;
    int outNo = fileno(stdout);
    if (cntOuts) {
        outNos = malloc(cntOuts * sizeof(int));
        for (i = 0; i < cntOuts; i++)
            outNos[i] = fileno(outs[i]);
    }
    
    int errNo = fileno(stderr);
    
    int fields = 1;
    unsigned int cntLines = 0;
    
    char c;
    int afterSlash = 0;
    int inNo = fileno(stdin);
    int n;
    int lineIdx = lineResetIdx;

    do {
        n = read(inNo,buffer,BUFFER_SIZE);
        
//32 sec/gb
/*
        for (i = 0; i < n; i++) {
            char c = buffer[i];
            line[lineIdx++] = c;
            if (c == LF) {
                write(outNo,line,lineIdx);            
                lineIdx = lineResetIdx;
            }
        }
*/
        
//9 sec/gb
        write(outNo,buffer,n);
        
//
    } while (n > 0);
    
    for (i = 0; i < cntOuts; i++)
        fclose(outs[i]);
    
    if (cntOuts) {
        free(outs);
        free(outNos);
    }
    free(buffer);
    free(line);

    return (EXIT_SUCCESS);
}

Open in new window

0
 
LVL 32

Expert Comment

by:phoffric
ID: 35040219
I just had time to glance at your posts, and I'll be away for 2 hours. In your mock-up tests, I don't understand why you had to kill a process. Shouldn't you exit as soon as you hit the 1GB mark for timing purposes?

Also, I'd like to try your code on my system. I think I mentioned that it took about 18 sec for me to write 1GB to a disk file. So, could you mention what input you use for each timing test - what is the size of the input file and what is the nature of the data in each test. If the data is complicated, the could you provide a template for creation (or place such a file in mediafire (200 MB max), but maybe we can read the file 5x.

In my quick glance I didn't see pipes or mention of gzip - isn't that what you are trying to test. Or, do you now believe that pipes and gzip is not part of the timing problem?
0
 
LVL 3

Author Comment

by:modsiw
ID: 35040316
I had to kill it bc someone else needed the box. I was using a 20gb sample (better average) and writing to /dev/null

I believe that pipes and gzip are not part of the problem bc I get the same results starting with a .gz file and piping it trough gzip -dc as I do having already deflated the file on disk. Also of note is that gz is free to run on a different processor. This code is obviously not multithreaded.


http://www.mediafire.com/?9guy1lykvgrzqj0


This is just 3 lines (well, what would be 3 lines without incorrect LFs). It is a tab delimited file with 292 fields. You'll need to double the file many times to get a meaningful test. I have to manually edit the file to remove things my company prob wouldn't want published, so it is a small sample.

I test with a 20gb file. The real data measures into the TBs.



Do note my last code sample. It demonstrates a part of the app that seems to be causing the slowness.
0
 
LVL 13

Expert Comment

by:Superdave
ID: 35040737
The I/O-only version did both input and output in huge blocks, but the version without the output commented out did line-at-a-time output; that must be the difference.  I think the easiest fix would be to use fopen/fwrite for the output.  I looked back at the original version and saw it was using fputs, which means it needs to scan for nulls pointlessly, when you already know the size.  With fwrite you can pass it the size, but it won't do as many OS-level I/O calls as the version with write..

You should probably make your buffer smaller, too.  I think it's well past the point of diminishing returns where a bigger buffer will help, and to the point where it will slow it down by making your cache useless.
0
 
LVL 32

Expert Comment

by:phoffric
ID: 35057338
=====================
re: your code post in http:#35037386
    for (k = 0; k < 3569516; k++) { //20gb = 3569516
        VS 2010 61 sec

cygwin:
cc -O2 no_io.c

      for (k = 0; k < 3569516/6; k++) { //20gb = 3569516

$ time ./a 8192 X

real    0m4.672s
user    0m4.640s
sys     0m0.015s
=====================

For your next post, I had trouble using ddd to both redirect the stdin to an input file. When trying to use VS 2010, the mixture of unix and standard C I/O results in compiler and runtime issues. So, I wasn't able to proceed. I also wasn't sure how you wanted me to setup the arguments and the input file using that code, since the file you linked to was obviously small. If you elaborate to the point that I can reproduce the timing issues, I may be able to identify the problem.

BTW - A coworker complained that all he did was rebuild his embedded system (had a disk), and then the run-time performace dropped significantly. The problem was disk fragmentation. When reading and writing, that alone can cause the weakest link due to time-consuming head movements; but add onto that fragmentation, and now you are really head-moving.
0
 
LVL 32

Expert Comment

by:phoffric
ID: 35057376
If you wish to repost in standard portable C, I'll be able to better time your various programs using VS 2010. Please include detailed instructions on how to run each program (e.g., what arguments, what input file). Since the input file in your link is short, you may wish to loop to get the right amount of processed bytes.
0

Featured Post

IT, Stop Being Called Into Every Meeting

Highfive is so simple that setting up every meeting room takes just minutes and every employee will be able to start or join a call from any room with ease. Never be called into a meeting just to get it started again. This is how video conferencing should work!

Join & Write a Comment

Have you thought about creating an iPhone application (app), but didn't even know where to get started? Here's how: ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ Important pre-programming comments: I’ve never tri…
This is a short and sweet, but (hopefully) to the point article. There seems to be some fundamental misunderstanding about the function prototype for the "main" function in C and C++, more specifically what type this function should return. I see so…
The goal of this video is to provide viewers with basic examples to understand and use structures in the C programming language.
The goal of this video is to provide viewers with basic examples to understand how to create, access, and change arrays in the C programming language.

708 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

18 Experts available now in Live!

Get 1:1 Help Now