kayvey
asked on
error in stdio.h
I have a little c program and I am trying to compile it for gdb.. I am getting a very troubling error in the libraries
kayve@kayve-laptop:~/src/t hesis/cs89 9$ gcc -g dirty_demog.c -O0 -o dirty_demog
dirty_demog.c: In function ‘main’:
dirty_demog.c:121: error: ‘statbuf’ undeclared (first use in this function)
dirty_demog.c:121: error: (Each undeclared identifier is reported only once
dirty_demog.c:121: error: for each function it appears in.)
dirty_demog.c:319: warning: passing argument 1 of ‘fflush’ makes pointer from integer without a cast
/usr/include/stdio.h:219: note: expected ‘struct FILE *’ but argument is of type ‘int’
kayve@kayve-laptop:~/src/t hesis/cs89 9$
kayve@kayve-laptop:~/src/t hesis/cs89 9$ uname -a
Linux kayve-laptop 2.6.31-20-generic #58-Ubuntu SMP Fri Mar 12 04:38:19 UTC 2010 x86_64 GNU/Linux
kayve@kayve-laptop:~/src/t hesis/cs89 9$
kayve@kayve-laptop:~/src/t
dirty_demog.c: In function ‘main’:
dirty_demog.c:121: error: ‘statbuf’ undeclared (first use in this function)
dirty_demog.c:121: error: (Each undeclared identifier is reported only once
dirty_demog.c:121: error: for each function it appears in.)
dirty_demog.c:319: warning: passing argument 1 of ‘fflush’ makes pointer from integer without a cast
/usr/include/stdio.h:219: note: expected ‘struct FILE *’ but argument is of type ‘int’
kayve@kayve-laptop:~/src/t
kayve@kayve-laptop:~/src/t
Linux kayve-laptop 2.6.31-20-generic #58-Ubuntu SMP Fri Mar 12 04:38:19 UTC 2010 x86_64 GNU/Linux
kayve@kayve-laptop:~/src/t
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Please post (all) the code so we can examine it.
ASKER
hmm.. I used the include now.. Did I compile here? Never encountered a "note" before:
Quit anyway? (y or n) y
kayve@kayve-laptop:~/src/t hesis/cs89 9$ gcc -g dirty_demog.c -O0 -o dirty_demog
dirty_demog.c: In function ‘main’:
dirty_demog.c:320: warning: passing argument 1 of ‘fflush’ makes pointer from integer without a cast
/usr/include/stdio.h:219: note: expected ‘struct FILE *’ but argument is of type ‘int’
kayve@kayve-laptop:~/src/t hesis/cs89 9$
Quit anyway? (y or n) y
kayve@kayve-laptop:~/src/t
dirty_demog.c: In function ‘main’:
dirty_demog.c:320: warning: passing argument 1 of ‘fflush’ makes pointer from integer without a cast
/usr/include/stdio.h:219: note: expected ‘struct FILE *’ but argument is of type ‘int’
kayve@kayve-laptop:~/src/t
ASKER
careful what you wish for... {:)
out_apr3 prots_loc uniprot_sprot.dat
kayve@kayve-laptop:~/src/thesis/cs899$ vi dirty_demog.c
kayve@kayve-laptop:~/src/thesis/cs899$ cat dirty_demog.c
#include <stdio.h>
#include <math.h>
#include <unistd.h>
#include <string.h>
#include <stdlib.h>
#include <fcntl.h>
#include <sys/mman.h>
#include <sys/stat.h>
//#define BLOCKSIZE 1<<25
#define STDOUT 1
#define STDIN 0
int main (int argc, char *argv[]) {
const int MAXLINE = getpagesize();
const int GOOD_EXIT = 0;
const int FALSE = 0;
const int TRUE = 1;
const int BAD_ARGC = -1;
const int BAD_DATAFILE = -2 ;
const int BAD_FSTAT = -3 ;
//const int BLOCKSIZE = 1<<25;
const int BLOCKSIZE = 1<<15;
/*********************************************************************
*
* DATA STRUCTURES
*
*********************************************************************/
/*********************************************************************
*
* line_entry:
*
* A single protein is defined by a prot_entry,as described @
*
* http://expasy.org/sprot/usrman/html#Ter_line (deprocated)
* http://www.expasy.org/sprot/userman.html#CCIA
*
* Each prot_entry has a list of code_set(s), but some line_codes
* allow for multiple instances of single lines of the same code_set.
* The line_content is the wrapper struct for a single line's
* content.
*
*********************************************************************/
typedef struct {
char * content;
struct line_content* next;
} line_content;
/*********************************************************************
*
* code_set:
*
* Some content is associated uniquely with a line code, and other
* line_codes can have multiple instances of different contents.
* The code_set struct allows all content with the same line code
* to be associated in the same list.
*
*********************************************************************/
typedef struct {
char line_code[2];
struct line_content *content_list;
} code_set;
/*********************************************************************
*
* prot_entry:
*
* Each protein is defined by the protein entry, which according
* to the afforementioned URL (in prot_line_entry doc) is begun
* with a line_code that has a value of ID, and ends with the
* termination line, which is a line that begins with the "//"
* terminator code.
*
*********************************************************************/
typedef struct {
struct code_set *line_list;
struct prot_entry *next;
} prot_entry;
/*********************************************************************
*********************************************************************/
int fd, fd_plist, a,b,c, i, len, tot, j,k,l,m,n,got_EOL, line_WRAP;
int tot_proteins, tot_human_proteins, line_num, wrap_SHIFT;
int wrap_ANNEAL, done, used_block, n_prot_lines;
int max_prot_lines,max_prot_chars, this_prot_chars;
float r;
int max_line, zero_line;
long long this_seek, line_begin, seek_result, last_lbegin,status;
long long char_count;
char *line, *this_line, this_char, *block, *wrap_frag, err_msg[1024];
struct stat statbuf;
if ((argc < 2 ) || (argc > 2)) {
printf("USAGE: findmax [datafile].dat \n");
printf(" Finds length of maximum line\n");
return BAD_ARGC;
}
if ((fd = open( argv[1], O_RDONLY )) < 0) {
sprintf(err_msg,"CAN'T OPEN FILE: %s \ncause", argv[1]);
perror(err_msg);
close(fd);
return BAD_DATAFILE;
}
if ((fd_plist = open( "prots_loc", O_WRONLY )) < 0) {
sprintf(err_msg,"CAN'T OPEN FILE: %s \ncause", "prots_loc");
perror(err_msg);
close(fd_plist);
return BAD_DATAFILE;
}
if (fstat(fd, &statbuf) < 0 ) {
sprintf(err_msg,"BAD FILE STATS: %s \ncause", argv[1]);
perror(err_msg);
close(fd);
return BAD_FSTAT;
}/**/
/*line = (char *) malloc(MAXLINE);
if (line == NULL) {
printf("malloc failed line 103\n");
exit(-1);
}/**/
//printf("line address is '%d'\n",line);
i=0;
a=0;
b=0;
c=0;
tot_proteins = 0;
line_num = 0;
char_count = 0;
tot_human_proteins = 0;
max_prot_lines = 0;
char_count = 0;
zero_line = FALSE;
max_prot_chars = 0;
this_prot_chars = 0;
n_prot_lines = 0;
/********************************************************
*
* Data parameterization:
*
* BLOCKSIZE chunks of the files are removed as this_seek
* is incremented by that amount in the outer while
* loop with termination on read failure. The next
* level of while loop iterates on a variable line_begin,
* which must be less than BLOCKSIZE. This index steps
* through the currently examined block by leaps determined
* by an iterator b that checks for newline.
*
********************************************************/
this_seek = 0;
max_line = 0;
line_WRAP = FALSE;
done = FALSE;
used_block = FALSE;
wrap_ANNEAL = FALSE;
if ((seek_result = lseek(fd, (off_t) this_seek, SEEK_SET))<0)
done = TRUE;
/********************************************************
* *********** PUT THESE BACK IN!!!!!!!!!!!!!!!!
********************************************************/
/*printf("about to allocate block\n");
printf("BLOCKSIZE IS %d\n",BLOCKSIZE);
printf("MAXLINE IS %d\n",MAXLINE);
printf("seek_result IS %lld\n",seek_result);/**/
/********************************************************
* *********** PUT THESE BACK IN!!!!!!!!!!!!!!!!
*
********************************************************/
//system("freecolor");
//system("freecolor -o");
block = malloc(BLOCKSIZE);
line_begin = 0;
last_lbegin = 0;
n_prot_lines = 0;
//system("freecolor -o");
/*********************************************************************
* this_seek--ing BLOCKSIZE steps
*********************************************************************/
if ((status = read(fd, block, BLOCKSIZE)) <= 0)
done = TRUE;
while (!done) {
// printf ("entering while !done\n");
//printf("new block this_seek is %lld\n",this_seek);
//printf("seek_result IS %lld\n",seek_result);
//printf("read stats is %d\n",status);
/* if (line_WRAP) {
wrap_ANNEAL = TRUE;
line_WRAP = FALSE;
} /**/
/*****************************************************************
* hopping line by line through one block
* used_block indicates data in block has been exhausted.
*
*****************************************************************/
//while (line_begin < BLOCKSIZE) {
used_block = FALSE;
while (!used_block) {
// printf ("entering while !got_line\n");
/* if ((((int)line_begin / 1000)%1000)==0) {
printf("line_begin is %lld\n",line_begin);
//system("freecolor");
//system("freecolor -o");
}/**/
b=0;
/**************************************************************
* finding next line_begin with b
**************************************************************/
//while ((!got_EOL) && (!line_WRAP)){
got_EOL = FALSE;
while (!got_EOL) {
//printf ("e!gE.");
if (line_begin == BLOCKSIZE) {
this_seek += BLOCKSIZE;
if ((status= (lseek(fd, (off_t) this_seek, SEEK_SET)) < 0)) {
done = TRUE;
got_EOL = TRUE;
used_block = TRUE;
break;
}
if ((status = read(fd, block, BLOCKSIZE)) <= 0) {
got_EOL = TRUE;
used_block = TRUE;
done = TRUE;
break;
}
else
line_begin = 0;
}
b++;
if (line_begin + b >= BLOCKSIZE) {
/**************************************************************
* this line has wrapped accross a block.
* shift the file pointer so that the beginning of the line
* is the beginning of the new block.
**************************************************************/
// line_WRAP = TRUE;
if ((line_begin + b) > status) {
got_EOL = TRUE;
used_block = TRUE;
done = TRUE;
break;
}
wrap_SHIFT = b;
//printf("this seek is %lld",this_seek);
this_seek += BLOCKSIZE - wrap_SHIFT;
if ((status= (lseek(fd, (off_t) this_seek, SEEK_SET)) < 0))
done = TRUE;
//printf("lseek status is %lld\n",status);
/*if (status == 1) {
printf("woo hoo\n");
}/*/ // printf("seek_result IS %lld\n",seek_result);
line_begin = 0;
last_lbegin = 0;
if ((status = read(fd, block, BLOCKSIZE)) <= 0) {
/**************************************************************
*
* there is no more file to read--end all loops.
*
**************************************************************/
//printf("status is %lld\n",status);
got_EOL = TRUE;
used_block = TRUE;
done = TRUE;
}
/*else
printf("status is %lld\n",status);/**/
}// if (line_begin + b == BLOCKSIZE) {
if ( block[line_begin + b ] == '\n') {
/*printf("got a \\n\n");
fflush(STDOUT);/**/
n_prot_lines++;
char_count += b;
this_prot_chars += b;
got_EOL = TRUE;
}
if ( b == MAXLINE) {
b -= 2;
got_EOL = TRUE;
}// if ( b == MAXLINE) {
}// while (!got_EOL) {
//printf("\n");
/**************************************************************
**************************************************************/
if (line_begin == BLOCKSIZE)
break;
if ( b > max_line) {
max_line = b;
/*if (max_line > 8200)
printf("max line:\n %s",this_line);/**/
}// if ( b > max_line) {
if (line_begin != BLOCKSIZE)
line_num++;
/*this_line = malloc(b+1);
if (this_line == NULL) {
printf("malloc failed line 210\n");
exit(-2);
}/**/
//printf("this_line address is '%d'\n",this_line);
/*printf("b is %d\n",b);
printf("lb:%lld-ln:%d:",line_begin,line_num);
fflush(STDOUT);/**/
if (b != 0)
printas(STDOUT,block,line_begin,line_begin+b-1);
fflush(STDOUT);/**/
/*printf("out of printas\n");
fflush(STDOUT);/**/
//system("vmstat");
//system("freecolor");
//system("freecolor -o");
a = 0;
if ((block[line_begin] == '/')&&(block[line_begin+1] == '/')) {
tot_proteins++;
if ( n_prot_lines > max_prot_lines)
max_prot_lines = n_prot_lines;
if (this_prot_chars > max_prot_chars)
max_prot_chars = this_prot_chars;
n_prot_lines = 0;
this_prot_chars= 0;
}// if ((block[line_begin] == '/')&&(block[line_begin+1] == '/')) {
if ((block[line_begin] == 'O')&&(block[line_begin+1] == 'S')) {
/*****************************************************************
*
* Organism Species (OS) line
*
* http://www.expasy.org/sprot/userman.html#OS_line
*
*****************************************************************/
/*printf("osl-lb:%lld-ln:%d:",line_begin,line_num);
printas(STDOUT, block, line_begin,line_begin);
fflush(STDOUT);
printas(STDOUT,block,line_begin+ 5, line_begin +8);
fflush(STDOUT);/**/
if ((block[line_begin+5] == 'H')&&(block[line_begin+7] == 'm') &&(block[line_begin+10] == 's')) { /*****************************************************************
*
* Human protein (Homo Sapiens)
*
*****************************************************************/
tot_human_proteins++;
/*printf("human-lb:%lld-ln:%d:",line_begin,line_num);
printas(STDOUT, block,line_begin,line_begin);
fflush(STDOUT);/**/
/*****************************************************************
*
* PUT BACK IN!!!!?!?!!?
*
*****************************************************************/
/*printf("subtotal human proteins: %d\n",tot_human_proteins);
fflush(STDOUT);
printf("subtotal prteins: %d\n",tot_proteins);
fflush(STDOUT);/**/
/*****************************************************************
*
*****************************************************************/
}// if ((block[line_begin+5] == 'H')&&(block[line_begin+7] == 'm') &&(block[line_begin+10] == 's')) {
}// if ((block[line_begin] == 'O')&&(block[line_begin+1] == 'S')) {
last_lbegin = line_begin;
line_begin += b + 1;
if (line_begin >= BLOCKSIZE) {
used_block = TRUE;
}
}// while (!used_block) {
/*****************************************************************
*****************************************************************/
if (line_begin != BLOCKSIZE)
this_seek += BLOCKSIZE;
used_block = FALSE;
lseek(fd, (off_t) this_seek, SEEK_SET);
//printf("just updated this_seek to %lld\n",this_seek);
//system("freecolor");
//system("freecolor -o");
//if (!line_WRAP) {
/*if (this_line != NULL) {
free(this_line);
this_line = NULL;
}/**/
}// while (!done) was ((status = read(fd, block, BLOCKSIZE)) > 0) {
/*****************************************************************
*****************************************************************/
//printf("new block this_seek is %d\n",this_seek);
//printf("012345678901234567890123456789012345678901234567890\n");
/*****************************************************************
*
* !!!!!!!!!!!!! PUT BACK IN !!!!!!!!!!!!!!
*
*****************************************************************/
/*printf("The longest line has %d characters\n",max_line);
printf("There are a total of %d lines\n",line_num);
printf("subtotal human proteins: %d\n",tot_human_proteins);
printf("There are %d total proteins \n",tot_proteins);
printf("The protein with the most lines has %d lines\n",max_prot_lines);/**/
/*****************************************************************
*
*****************************************************************/
close(fd);
return GOOD_EXIT;
}// int main (int argc, char *argv[]) {
int printas (int fd, char * s, long long begin, long long end) {
/*****************************************************************
*
* PRINTAS--Print Array Segment
*
* Prints an array segment to file fd avoiding segmentation fault
*
* String s[...begin to end. ...]
*
********* long long i;
int j;
//printf("about to sizeof(block)\n");
system("date");
printf("in printas printing %d to %d of file descriptor %d\n",(int)begin, (int)end,fd);
if (begin <= end) {
//system("date");
//printf("done testing sizeof(block)\n");/**/
i = begin;
j = 0;
while ( i <= end) {
if (!(j%10)) printf("%d",j/10);
else printf(" ");
i++;
j++;
}
printf("\n");
j = 0;
i = begin;
while ( i <= end) {
printf("%d",j);
i++;
j++;
if (j == 10) j = 0;
}
********************************************************/
printf("\n");
write(fd,&s[begin],end-begin-1);
write(fd,"\n",1);
printf("in printas just wrote\n");
// i++;
//}
}
write(fd,"\n",1);
//fflush(fd);
}
kayve@kayve-laptop:~/src/thesis/cs899$
ASKER
decided to comment out line 320 and it is happily running long now.
/usr/include/stdio.h:219: note: expected ‘struct FILE *’ but argument is of type ‘int’
kayve@kayve-laptop:~/src/thesis/cs899$ gdb ./dirty_demog
GNU gdb (GDB) 7.0-ubuntu
Copyright (C) 2009 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /home/kayve/src/thesis/cs899/dirty_demog...done.
(gdb) run uniprot_sprot.dat
Starting program: /home/kayve/src/thesis/cs899/dirty_demog uniprot_sprot.dat
Sat Apr 3 11:28:27 PDT 2010
in printas printing 0 to 53 of file descriptor 1
0 1 2 3 4 5
012345678901234567890123456789012345678901234567890123
ID 1001R_ASFK5 Reviewed; 122 A
in printas just wrote
Program received signal SIGSEGV, Segmentation fault.
0x00007ffff7ad67fd in fflush () from /lib/libc.so.6
(gdb) break 199
Breakpoint 1 at 0x400d7f: file dirty_demog.c, line 199.
(gdb) run uniprot_sprot.dat
The program being debugged has been started already.
Start it from the beginning? (y or n) uy
Please answer y or n.
The program being debugged has been started already.
Start it from the beginning? (y or n) y
Starting program: /home/kayve/src/thesis/cs899/dirty_demog uniprot_sprot.dat
Breakpoint 1, main (argc=2, argv=0x7fffffffe3c8) at dirty_demog.c:209
209 used_block = FALSE;
(gdb) break 218
Breakpoint 2 at 0x400d90: file dirty_demog.c, line 218.
(gdb) continue
Continuing.
Breakpoint 2, main (argc=2, argv=0x7fffffffe3c8) at dirty_demog.c:218
218 b=0;
(gdb) print b
$1 = 0
(gdb) print line_begin
$2 = 0
(gdb) print block[0]
$3 = 73 'I'
(gdb) continue 40
Will ignore next 39 crossings of breakpoint 2. Continuing.
Sat Apr 3 11:32:35 PDT 2010
in printas printing 0 to 53 of file descriptor 1
0 1 2 3 4 5
012345678901234567890123456789012345678901234567890123
ID 1001R_ASFK5 Reviewed; 122 A
in printas just wrote
Program received signal SIGSEGV, Segmentation fault.
0x00007ffff7ad67fd in fflush () from /lib/libc.so.6
(gdb) run uniprot_sprot.dat
The program being debugged has been started already.
Start it from the beginning? (y or n) y
Starting program: /home/kayve/src/thesis/cs899/dirty_demog uniprot_sprot.dat
Breakpoint 1, main (argc=2, argv=0x7fffffffe3c8) at dirty_demog.c:209
209 used_block = FALSE;
(gdb) break 199
Note: breakpoint 1 also set at pc 0x400d7f.
Breakpoint 3 at 0x400d7f: file dirty_demog.c, line 199.
(gdb) break 267
Breakpoint 4 at 0x400f5d: file dirty_demog.c, line 267.
(gdb) run uniprot_sprot.dat
The program being debugged has been started already.
Start it from the beginning? (y or n) y
Starting program: /home/kayve/src/thesis/cs899/dirty_demog uniprot_sprot.dat
Breakpoint 1, main (argc=2, argv=0x7fffffffe3c8) at dirty_demog.c:209
209 used_block = FALSE;
(gdb) print b
$4 = 0
(gdb) continue
Continuing.
Sat Apr 3 11:33:34 PDT 2010
in printas printing 0 to 53 of file descriptor 1
0 1 2 3 4 5
012345678901234567890123456789012345678901234567890123
ID 1001R_ASFK5 Reviewed; 122 A
in printas just wrote
Program received signal SIGSEGV, Segmentation fault.
0x00007ffff7ad67fd in fflush () from /lib/libc.so.6
(gdb) break 200
Note: breakpoints 1 and 3 also set at pc 0x400d7f.
Breakpoint 5 at 0x400d7f: file dirty_demog.c, line 200.
(gdb) run uniprot_sprot.dat
The program being debugged has been started already.
Start it from the beginning? (y or n) y
Starting program: /home/kayve/src/thesis/cs899/dirty_demog uniprot_sprot.dat
Breakpoint 1, main (argc=2, argv=0x7fffffffe3c8) at dirty_demog.c:209
209 used_block = FALSE;
(gdb) continue
Continuing.
Sat Apr 3 11:42:02 PDT 2010
in printas printing 0 to 53 of file descriptor 1
0 1 2 3 4 5
012345678901234567890123456789012345678901234567890123
ID 1001R_ASFK5 Reviewed; 122 A
in printas just wrote
Program received signal SIGSEGV, Segmentation fault.
0x00007ffff7ad67fd in fflush () from /lib/libc.so.6
(gdb) bt
#0 0x00007ffff7ad67fd in fflush () from /lib/libc.so.6
#1 0x00000000004010ce in main (argc=2, argv=0x7fffffffe3c8) at dirty_demog.c:320
(gdb) print line_begin
No symbol "line_begin" in current context.
(gdb) break 319
Breakpoint 6 at 0x401091: file dirty_demog.c, line 319.
(gdb) run uniprot_sprot.dat
The program being debugged has been started already.
Start it from the beginning? (y or n) y
Starting program: /home/kayve/src/thesis/cs899/dirty_demog uniprot_sprot.dat
Breakpoint 1, main (argc=2, argv=0x7fffffffe3c8) at dirty_demog.c:209
209 used_block = FALSE;
(gdb) continue
Continuing.
Breakpoint 6, main (argc=2, argv=0x7fffffffe3c8) at dirty_demog.c:319
319 printas(STDOUT,block,line_begin,line_begin+b-1);
(gdb) print line_begin
$5 = 0
(gdb) print b
$6 = 54
(gdb) print block[53]
$7 = 46 '.'
(gdb) print block[54]
$8 = 10 '\n'
(gdb) step
printas (fd=1,
s=0x603010 "ID 1001R_ASFK5", ' ' <repeats 13 times>, "Reviewed; 122 AA.\nAC P0C9F0;\nDT 05-MAY-2009, integrated into UniProtKB/Swiss-Prot.\nDT 05-MAY-2009, sequence version 1.\nDT 05-MAY-2009, entry version 1.\nDE "..., begin=0, end=53) at dirty_demog.c:432
432 system("date");
(gdb) step
Sat Apr 3 11:43:52 PDT 2010
433 printf("in printas printing %d to %d of file descriptor %d\n",(int)begin, (int)end,fd);
(gdb) step
in printas printing 0 to 53 of file descriptor 1
434 if (begin <= end) {
(gdb) print end
$9 = 53
(gdb) print begin
$10 = 0
(gdb) break 445
Breakpoint 7 at 0x401374: file dirty_demog.c, line 445.
(gdb) continue
Continuing.
Breakpoint 7, printas (fd=1,
s=0x603010 "ID 1001R_ASFK5", ' ' <repeats 13 times>, "Reviewed; 122 AA.\nAC P0C9F0;\nDT 05-MAY-2009, integrated into UniProtKB/Swiss-Prot.\nDT 05-MAY-2009, sequence version 1.\nDT 05-MAY-2009, entry version 1.\nDE "..., begin=0, end=53) at dirty_demog.c:445
445 printf("\n");
(gdb) print i
$11 = 54
(gdb) print j
$12 = 54
(gdb) break 454
Breakpoint 8 at 0x4013c6: file dirty_demog.c, line 454.
(gdb) continue
Continuing.
0 1 2 3 4 5
Breakpoint 8, printas (fd=1,
s=0x603010 "ID 1001R_ASFK5", ' ' <repeats 13 times>, "Reviewed; 122 AA.\nAC P0C9F0;\nDT 05-MAY-2009, integrated into UniProtKB/Swiss-Prot.\nDT 05-MAY-2009, sequence version 1.\nDT 05-MAY-2009, entry version 1.\nDE "..., begin=0, end=53) at dirty_demog.c:454
454 printf("\n");
(gdb) step
012345678901234567890123456789012345678901234567890123
455 write(fd,&s[begin],end-begin-1);
(gdb) step
ID 1001R_ASFK5 Reviewed; 122 A456 write(fd,"\n",1);
(gdb) step
457 printf("in printas just wrote\n");
(gdb) step
in printas just wrote
461 write(fd,"\n",1);
(gdb) step
463 }
(gdb) step
main (argc=2, argv=0x7fffffffe3c8) at dirty_demog.c:320
320 fflush(STDOUT);/**/
(gdb) step
Program received signal SIGSEGV, Segmentation fault.
0x00007ffff7ad67fd in fflush () from /lib/libc.so.6
(gdb) print STDOUT
No symbol "STDOUT" in current context.
(gdb) run uniprot_sprot.dat
The program being debugged has been started already.
Start it from the beginning? (y or n) y
Starting program: /home/kayve/src/thesis/cs899/dirty_demog uniprot_sprot.dat
Breakpoint 1, main (argc=2, argv=0x7fffffffe3c8) at dirty_demog.c:209
209 used_block = FALSE;
(gdb) continue
Continuing.
Breakpoint 6, main (argc=2, argv=0x7fffffffe3c8) at dirty_demog.c:319
319 printas(STDOUT,block,line_begin,line_begin+b-1);
(gdb) print STDOUT
No symbol "STDOUT" in current context.
(gdb) continue
Continuing.
Sat Apr 3 11:48:41 PDT 2010
in printas printing 0 to 53 of file descriptor 1
Breakpoint 7, printas (fd=1,
s=0x603010 "ID 1001R_ASFK5", ' ' <repeats 13 times>, "Reviewed; 122 AA.\nAC P0C9F0;\nDT 05-MAY-2009, integrated into UniProtKB/Swiss-Prot.\nDT 05-MAY-2009, sequence version 1.\nDT 05-MAY-2009, entry version 1.\nDE "..., begin=0, end=53) at dirty_demog.c:445
445 printf("\n");
(gdb) continue
Continuing.
0 1 2 3 4 5
Breakpoint 8, printas (fd=1,
s=0x603010 "ID 1001R_ASFK5", ' ' <repeats 13 times>, "Reviewed; 122 AA.\nAC P0C9F0;\nDT 05-MAY-2009, integrated into UniProtKB/Swiss-Prot.\nDT 05-MAY-2009, sequence version 1.\nDT 05-MAY-2009, entry version 1.\nDE "..., begin=0, end=53) at dirty_demog.c:454
454 printf("\n");
(gdb) continue
Continuing.
012345678901234567890123456789012345678901234567890123
ID 1001R_ASFK5 Reviewed; 122 A
in printas just wrote
Program received signal SIGSEGV, Segmentation fault.
0x00007ffff7ad67fd in fflush () from /lib/libc.so.6
(gdb) quit
A debugging session is active.
Inferior 7 [process 16684] will be killed.
Quit anyway? (y or n) y
kayve@kayve-laptop:~/src/thesis/cs899$ gcc -g dirty_demog.c -O0 -o dirty_demog
kayve@kayve-laptop:~/src/thesis/cs899$ ./dirty_demog uniprot_sprot.dat > out_apr3b
ASKER
not happy it should be duplicating this file "uniprot_sprot.dat" but it has overrun.
kayve@kayve-laptop:~/src/thesis/cs899$ ls -l
total 3198380
-rwxr-xr-x 1 kayve kayve 17882 2010-04-03 12:00 dirty_demog
-rwxr--r-- 1 kayve kayve 15781 2010-04-03 12:00 dirty_demog.c
-rw-r--r-- 1 kayve kayve 378681 2010-02-15 20:33 ln_good_bad.png
lrwxrwxrwx 1 kayve kayve 16 2010-02-15 16:40 mini_uni.dat -> ../mini_uni.data
lrwxrwxrwx 1 kayve kayve 20 2010-02-15 16:43 mylink -> ../uniprot_sprot.dat
-rw-r--r-- 1 kayve kayve 83 2010-04-03 09:48 out_apr3
-rw-r--r-- 1 kayve kayve 11976739 2010-04-03 11:58 out_apr3b
-rw-r--r-- 1 kayve kayve 3262730240 2010-04-03 12:07 out_apr3c
-rw-r--r-- 1 kayve kayve 83 2010-03-31 21:25 out_mar31
-rw-r--r-- 1 kayve kayve 0 2010-02-24 12:51 prots_loc
lrwxrwxrwx 1 kayve kayve 6 2010-02-18 13:37 stupid -> stupid
lrwxrwxrwx 1 kayve kayve 20 2010-02-15 16:44 uniprot_sprot.dat -> ../uniprot_sprot.dat
kayve@kayve-laptop:~/src/thesis/cs899$ ls ../uniprot_sprot.dat
../uniprot_sprot.dat
kayve@kayve-laptop:~/src/thesis/cs899$ ls -l ../uniprot_sprot.dat
-rwxrwxr-x 1 kayve kayve 2008143039 2010-02-12 19:07 ../uniprot_sprot.dat
kayve@kayve-laptop:~/src/thesis/cs899$
ASKER
^C
Terminated
kayve@kayve-laptop:~/src/t hesis/cs89 9$
kayve@kayve-laptop:~/src/t hesis/cs89 9$ gcc -g dirty_demog.c -O0 -o dirty_demog
kayve@kayve-laptop:~/src/t hesis/cs89 9$ ./dirty_demog uniprot_sprot.dat > out_apr3c
Terminated
kayve@kayve-laptop:~/src/t hesis/cs89 9$
root@kayve-laptop:~# ps -aux | grep dirty
Warning: bad ps syntax, perhaps a bogus '-'? See http://procps.sf.net/faq.html
kayve 14936 0.0 0.0 21100 1928 pts/1 S+ 11:39 0:00 vi dirty_demog.c
kayve 18371 89.7 0.0 3940 500 pts/2 R+ 12:00 6:15 ./dirty_demog uniprot_sprot.dat
root 19721 0.0 0.0 7336 876 pts/0 R+ 12:07 0:00 grep dirty
root@kayve-laptop:~# kill 18371
root@kayve-laptop:~#
Terminated
kayve@kayve-laptop:~/src/t
kayve@kayve-laptop:~/src/t
kayve@kayve-laptop:~/src/t
Terminated
kayve@kayve-laptop:~/src/t
root@kayve-laptop:~# ps -aux | grep dirty
Warning: bad ps syntax, perhaps a bogus '-'? See http://procps.sf.net/faq.html
kayve 14936 0.0 0.0 21100 1928 pts/1 S+ 11:39 0:00 vi dirty_demog.c
kayve 18371 89.7 0.0 3940 500 pts/2 R+ 12:00 6:15 ./dirty_demog uniprot_sprot.dat
root 19721 0.0 0.0 7336 876 pts/0 R+ 12:07 0:00 grep dirty
root@kayve-laptop:~# kill 18371
root@kayve-laptop:~#
ASKER
-12 19:07 ../uniprot_sprot.dat
kayve@kayve-laptop:~/src/t hesis/cs89 9$ tail out_apr3c
CC Distributed under the Creative Commons Attribution-NoDerivs Licen
CC -------------------------- ---------- ---------- ---------- ---------- ---
DR EMBL; CP001191; ACI57138.1; -; Genomic_DN
DR RefSeq; YP_002283364.1;
DR GeneID; 6982640;
kayve@kayve-laptop:~/src/t hesis/cs89 9$ tail uniprot_sprot.dat
FT CHAIN 2 95 RING finger protein Z.
FT /FTId=PRO_0000361042.
FT ZN_FING 38 74 RING-type; atypical.
FT MOTIF 88 91 PSAP motif late-budding.
FT LIPID 2 2 N-myristoyl glycine; by host (By
FT similarity).
SQ SEQUENCE 95 AA; 10916 MW; D60C0C51C3619E42 CRC64;
MGLRYSKDVK DRYGDREPEG RIPITLNMPQ SLYGRYNCKS CWFANKGLLK CSNHYLCLKC
LTLMLRRSDY CGICGEVLPK KLVFENSPSA PPYEA
//
kayve@kayve-laptop:~/src/t hesis/cs89 9$
kayve@kayve-laptop:~/src/t
CC Distributed under the Creative Commons Attribution-NoDerivs Licen
CC --------------------------
DR EMBL; CP001191; ACI57138.1; -; Genomic_DN
DR RefSeq; YP_002283364.1;
DR GeneID; 6982640;
kayve@kayve-laptop:~/src/t
FT CHAIN 2 95 RING finger protein Z.
FT /FTId=PRO_0000361042.
FT ZN_FING 38 74 RING-type; atypical.
FT MOTIF 88 91 PSAP motif late-budding.
FT LIPID 2 2 N-myristoyl glycine; by host (By
FT similarity).
SQ SEQUENCE 95 AA; 10916 MW; D60C0C51C3619E42 CRC64;
MGLRYSKDVK DRYGDREPEG RIPITLNMPQ SLYGRYNCKS CWFANKGLLK CSNHYLCLKC
LTLMLRRSDY CGICGEVLPK KLVFENSPSA PPYEA
//
kayve@kayve-laptop:~/src/t
fflush:
http://www.cplusplus.com/reference/clibrary/cstdio/fflush/
takes in a FILE * parameter
You're mixing ANSI C standard library usage with Unix usage. You are using open rather than fopen. So, be consistent.
http://www.cplusplus.com/reference/clibrary/cstdio/fflush/
takes in a FILE * parameter
You're mixing ANSI C standard library usage with Unix usage. You are using open rather than fopen. So, be consistent.
ASKER
is there a flush for open?
Have you found a problem without the flushing? I see you are running Ubuntu. I'll check to see what has changed, if any, in this respect. In the mid-90's when Linux was an infant (I think), the Unix file I/O system was often referred to as unbuffered I/O. On the other hand, the ANSI C standard library I/O system using fopen, fwrite, etc. is a buffered system, meaning that the ANSI C standard library calls do not necessarily result in a kernel call.
SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
open ubuntu man page:
http://manpages.ubuntu.com/manpages/dapper/man2/open.2.html
I haven't reviewed your 462 line post in full detail. But perhaps this open parameter flag, O_SYNC, is what you want. It is an old flag, not unique to ubuntu. It forces a flush on every write. But the performance can degrade by a factor of 6-12 due to I/O complete delays (at least that was in the mid-90's - with better disk-cached systems with huge cache memory, maybe the delays are minimized to reduce the I/O complete delay).
O_SYNC
The file is opened for synchronous I/O. Any write()s on the resulting file descriptor will block the calling process until the data has been physically written to the underlying hardware.
Also, take a look at the O_DIRECT flag that is not from the mid-90's.
http://manpages.ubuntu.com/manpages/dapper/man2/open.2.html
I haven't reviewed your 462 line post in full detail. But perhaps this open parameter flag, O_SYNC, is what you want. It is an old flag, not unique to ubuntu. It forces a flush on every write. But the performance can degrade by a factor of 6-12 due to I/O complete delays (at least that was in the mid-90's - with better disk-cached systems with huge cache memory, maybe the delays are minimized to reduce the I/O complete delay).
O_SYNC
The file is opened for synchronous I/O. Any write()s on the resulting file descriptor will block the calling process until the data has been physically written to the underlying hardware.
Also, take a look at the O_DIRECT flag that is not from the mid-90's.
ASKER
I think I was.. I guess I sort of swatted the flies with a sledgehammer here..
I am very concerned with efficiency. How can I justify using C instead of Java or even C++ without being so? The file I am operating on is quite large, about 2GB
kayve@kayve-laptop:~/Pictu res/monkey view$ ls -l /home/kayve/src/thesis/uni prot_sprot .dat
-rwxrwxr-x 1 kayve kayve 2008143039 2010-02-12 19:07 /home/kayve/src/thesis/uni prot_sprot .dat
kayve@kayve-laptop:~/Pictu res/monkey view$
Currently I am using a BLOCKSIZE of 2^25. I wrote this program about 10 months ago and haven't looked at it for about 7 months until I started looking at it a few weeks ago. I guess I need to work on my own efficiency {:P.. I haven't altered too much so far..(since August) I am finally getting to the point where I know what is happening.
I have been hemming and hawing over how I am doing things. I have a bunch of variables that are basically pointers, but they operate on the array that I read in these BLOCKSIZE blocks at a time. It is sort of a problem when the records I am trying to parse cross block boundaries. The array is called "
block" I have int's like line_begin and line_end to see where I am. For the time being I have been trying to ensure that it can simply duplicate the input file so I can validate my code is correct. I have been trying to get counts of the records and records with given attributes, but garbage in garbage out, right? if it doesn't duplicate the file correctly, I can't trust the statistics.
I am very concerned with efficiency. How can I justify using C instead of Java or even C++ without being so? The file I am operating on is quite large, about 2GB
kayve@kayve-laptop:~/Pictu
-rwxrwxr-x 1 kayve kayve 2008143039 2010-02-12 19:07 /home/kayve/src/thesis/uni
kayve@kayve-laptop:~/Pictu
Currently I am using a BLOCKSIZE of 2^25. I wrote this program about 10 months ago and haven't looked at it for about 7 months until I started looking at it a few weeks ago. I guess I need to work on my own efficiency {:P.. I haven't altered too much so far..(since August) I am finally getting to the point where I know what is happening.
I have been hemming and hawing over how I am doing things. I have a bunch of variables that are basically pointers, but they operate on the array that I read in these BLOCKSIZE blocks at a time. It is sort of a problem when the records I am trying to parse cross block boundaries. The array is called "
block" I have int's like line_begin and line_end to see where I am. For the time being I have been trying to ensure that it can simply duplicate the input file so I can validate my code is correct. I have been trying to get counts of the records and records with given attributes, but garbage in garbage out, right? if it doesn't duplicate the file correctly, I can't trust the statistics.
Hello kayvey,
I take it that you are now able to compile and build without errors as described in your OP.
In general, as you well know, it may be that your performance improvements may come from an interplay between your algorithm, your architecture, and the OS. As this potentially may get complex, I suggest that you open a new question to help improve the performance and provide the current performance level achieved and what your required goals are (e.g., 2x, 5x, 100x ?). If you can include a performance profile showing where the time is being spent, that would also help the contributing Experts.
Be sure to include the target platform, OS, and compiler that you are running the program on, since these can potentially affect the best solution.
Also, by opening a new question, this will keep the two threads focused on one question which not only results in cleaner threads, but also improves the quality of the PAQ entry for others searching for help.
I take it that you are now able to compile and build without errors as described in your OP.
In general, as you well know, it may be that your performance improvements may come from an interplay between your algorithm, your architecture, and the OS. As this potentially may get complex, I suggest that you open a new question to help improve the performance and provide the current performance level achieved and what your required goals are (e.g., 2x, 5x, 100x ?). If you can include a performance profile showing where the time is being spent, that would also help the contributing Experts.
Be sure to include the target platform, OS, and compiler that you are running the program on, since these can potentially affect the best solution.
Also, by opening a new question, this will keep the two threads focused on one question which not only results in cleaner threads, but also improves the quality of the PAQ entry for others searching for help.
BTW - your performance problem does sound pretty interesting and challenging.
I looked again at some of your earlier posts and see that you have commented out some lines (besides the fflush). Do you need to put those lines back in to regain functionality? If so, and if you are getting errors, then we can continue with that thread here.
You may be surprised that it may take me a little bit to be able to really wrap around your long post. But your comments about your BLOCKSIZE of 2^25 (i.e., 32MB) and "a problem when the records I am trying to parse cross block boundaries". You have divided your large file in only about 60 blocks to improve performance.
But, your problem may not be performance; it may just be a bug in your program.
You also say: "For the time being I have been trying to ensure that it can simply duplicate the input file so I can validate my code is correct. I have been trying to get counts of the records and records with given attributes, but garbage in garbage out, right?"
I don't know how long your program runs, and whether that is an impediment to your solving your problem. You should shed light on this as well.
Normally, the mantra you hear is "don't worry about performance, since when there is a problem 90% of the problem is in only 10% of the code". But maybe not so in your case.
Sounds like you may need help in getting your program to be functionally correct. Do you have two relatively small files you can work one with no block boundary crossing, and one with just one crossing of the block boundary? I know from experience that if a program takes days (or in my case I accepted a program that took months to run), then to determine its correctness, you must work on performance from the start. (BTW - I got that program down to less than 1 minute, but it took me 6 weeks to do that!!, and yes, I was working with huge files whose records had variable lengths.)
But, your problem may not be performance; it may just be a bug in your program.
You also say: "For the time being I have been trying to ensure that it can simply duplicate the input file so I can validate my code is correct. I have been trying to get counts of the records and records with given attributes, but garbage in garbage out, right?"
I don't know how long your program runs, and whether that is an impediment to your solving your problem. You should shed light on this as well.
Normally, the mantra you hear is "don't worry about performance, since when there is a problem 90% of the problem is in only 10% of the code". But maybe not so in your case.
Sounds like you may need help in getting your program to be functionally correct. Do you have two relatively small files you can work one with no block boundary crossing, and one with just one crossing of the block boundary? I know from experience that if a program takes days (or in my case I accepted a program that took months to run), then to determine its correctness, you must work on performance from the start. (BTW - I got that program down to less than 1 minute, but it took me 6 weeks to do that!!, and yes, I was working with huge files whose records had variable lengths.)
SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
I am so happy to have so much company here. I feel like I should give points soon. I don't think I really need the flushes.. I wrote another program to debug my program because I ran a diff on my generated file versus the original file (that took.. I don't know maybe hours) and the diff output file STILL had like 20 million lines, but most of them contained simply a "<," a space and a carriage return
kayve@kayve-laptop:~/src/thesis/CSc899$ cat parse_diff.c
#include <stdio.h>
#include <limits.h>
#include <stdlib.h>
#include <unistd.h>
#include <string.h>
#include <fcntl.h>
#include <sys/stat.h>
#include <sys/mman.h>
int main (int argc, char *argv[]) {
const int TRUE = 1;
const int FALSE = 0;
const int STDOUT = 1;
const int STDIN = 0;
const int BAD_INFILE = -1;
const int OUT_OF_MEMORY = -2;
const int MAXLINE = 4096;
const long long BLOCKSIZE = 1<<29;
int fd, done, PAGESIZE, status;
char err_msg[1024], line[MAXLINE], *block;
long long i = 0, inf_size, numlines = 0, j = 0, last_nl = 0, k = 0;
long long line_begin = 0, last_line_begin = 0;
struct stat fstat;
PAGESIZE = getpagesize();
if ((block = malloc(BLOCKSIZE)) < 0) {
sprintf(err_msg, "CAN'T MALLOC %ll BYTES \ncause:",BLOCKSIZE);
perror(err_msg);
return OUT_OF_MEMORY;
} //----- if ((block = malloc(BLOCKSIZE)) < 0) -----//
if ((fd = open( argv[1], O_RDONLY )) < 0) {
sprintf(err_msg, "CAN'T OPEN FILE: %s \ncause:",argv[1]);
perror(err_msg);
close(fd);
return BAD_INFILE;
} //----- if ((fd = open( argv[1], O_RDONLY )) < 0) -----//
stat(argv[1],&fstat);
inf_size = (long long) fstat.st_size;
printf("fd is %d\n",fd);
printf("the \"%s\" file is %d\n bytes.\n",argv[1],inf_size);
printf("BLOCKSIZE-filesize= %d\n",BLOCKSIZE-inf_size);
done = FALSE;
if ((status = read(fd, block, BLOCKSIZE)) <= 0 )
done = TRUE;
/*if (done) printf ("DONE IS TRUE\n");
else printf ("DONE IS FALSE\n");/**/
printf("status is %d\n",status);
while (!done) {
i++;
printf("read block %d\n",i);
line_begin = 0;
last_line_begin = 0;
for(j=2; j<BLOCKSIZE; j++) {
if ( block[j] == '\n') {
numlines++;
last_line_begin = line_begin;
line_begin = j + 1;
if ((block[j-1] != ' ') && (block[j-2] != '<')
&&(!((block[last_line_begin] >= '0')
&&(block[last_line_begin] <= '9')))) {
printf("line %d:",numlines);
for(k=last_line_begin; k<line_begin; k++)
printf("%c",block[k]);
} //----- if ((block[j-1] != ' ') && (block[j-2] != '<')) -----//
} //----- if ( block[j] == '\n') -----//
} //----- for(j=0; j<BLOCKSIZE; j++) -----//
lseek(fd, (off_t)((i-1)*BLOCKSIZE+line_begin), SEEK_SET);
if ((status = read(fd, block, BLOCKSIZE)) <= 0 )
done = TRUE;
} //----- while (!done) -----//
printf("there are %d lines in the file\n",numlines);
close(fd);
} //----- int main (int argc, char *argv[]) -----//
kayve@kayve-laptop:~/src/thesis/CSc899$
ASKER
This was an interesting bit:
if ((block[j-1] != ' ') && (block[j-2] != '<')
&&(!((block[last_line_begi n] >= '0')
&&(block[last_line_begin] <= '9')))) {
printf("line %d:",numlines);
for(k=last_line_begin; k<line_begin; k++)
printf("%c",block[k]);
after getting rid of the lines I described above, I realized there were also tons of lines from the diff output that where [line#]d[line#] which of course is diff's way of telling you that a line was missing. Now I am running gdb on the code.. I started it up, walked about a block to get a sparkling mineral water and back, puttered around on this legal help website and it's STILL running. my program above told me that in addition to whatever is creaing all those silly one space lines (realized that is another problem on the way to the mineral water cafe), that the only difference is in the end of the file.. on the 40 millionth line (give or take a few hundred thousand lines)!
So I hope I didn't gdb->continue too many steps {:P I will have to start that aaalllllll over! {:P
if ((block[j-1] != ' ') && (block[j-2] != '<')
&&(!((block[last_line_begi
&&(block[last_line_begin] <= '9')))) {
printf("line %d:",numlines);
for(k=last_line_begin; k<line_begin; k++)
printf("%c",block[k]);
after getting rid of the lines I described above, I realized there were also tons of lines from the diff output that where [line#]d[line#] which of course is diff's way of telling you that a line was missing. Now I am running gdb on the code.. I started it up, walked about a block to get a sparkling mineral water and back, puttered around on this legal help website and it's STILL running. my program above told me that in addition to whatever is creaing all those silly one space lines (realized that is another problem on the way to the mineral water cafe), that the only difference is in the end of the file.. on the 40 millionth line (give or take a few hundred thousand lines)!
So I hope I didn't gdb->continue too many steps {:P I will have to start that aaalllllll over! {:P
Curious.. When using cygwin, I had to change BLOCKSIZE and PAGESIZE to other names to remove errors.
If your problem is not debugging, why don't you make the problem more manageable by setting blockSize to maybe 1024, and work with smaller files so that one debugging session doesn't take hours, and then have to start all over if you forgot to stop at the right location :)
Using the below setting, the input file is 1050 bytes. If your program is supposed to output the same file, it got lost at page boundaries. I commented out informational comments and extra items like Line #:
The resultant file, ana.cpp, was 1059 bytes, and the middle of the file was repeated twice clobbering the end.
>> for(j=2; j<blockSize; j++) {
why do you start at j=2?
fyi - your lseek is called twice with line_begin = 1005
what is the purpose of line_begin?
If your problem is not debugging, why don't you make the problem more manageable by setting blockSize to maybe 1024, and work with smaller files so that one debugging session doesn't take hours, and then have to start all over if you forgot to stop at the right location :)
Using the below setting, the input file is 1050 bytes. If your program is supposed to output the same file, it got lost at page boundaries. I commented out informational comments and extra items like Line #:
The resultant file, ana.cpp, was 1059 bytes, and the middle of the file was repeated twice clobbering the end.
>> for(j=2; j<blockSize; j++) {
why do you start at j=2?
fyi - your lseek is called twice with line_begin = 1005
what is the purpose of line_begin?
const long long blockSize = 1<<10;
./a anagram_no_stl.cpp > ana.cpp
Curious.. When using cygwin, I had to change BLOCKSIZE and PAGESIZE to other names to remove errors.
If your problem is debugging, why don't you make the problem more manageable by setting blockSize to maybe 1024, and work with smaller files so that one debugging session doesn't take hours, and then have to start all over if you forgot to stop at the right location :)
Using the below setting, the input file is 1050 bytes. If your program is supposed to output the same file, it got lost at page boundaries. I commented out informational comments and extra items like Line #:
The resultant file, ana.cpp, was 1059 bytes, and the middle of the file was repeated twice clobbering the end.
>> for(j=2; j<blockSize; j++) {
why do you start at j=2?
fyi - your lseek is called twice with line_begin = 1005
what is the purpose of line_begin?
If your problem is debugging, why don't you make the problem more manageable by setting blockSize to maybe 1024, and work with smaller files so that one debugging session doesn't take hours, and then have to start all over if you forgot to stop at the right location :)
Using the below setting, the input file is 1050 bytes. If your program is supposed to output the same file, it got lost at page boundaries. I commented out informational comments and extra items like Line #:
The resultant file, ana.cpp, was 1059 bytes, and the middle of the file was repeated twice clobbering the end.
>> for(j=2; j<blockSize; j++) {
why do you start at j=2?
fyi - your lseek is called twice with line_begin = 1005
what is the purpose of line_begin?
const long long blockSize = 1<<10;
./a anagram_no_stl.cpp > ana.cpp
ASKER
the point is not to do "small files." that would defeat the whole purpose.
Stevens & Rago, "Advanced Programming in the UNIX Environment, Second Edition", Addison-Wesley, 2005
has a reference for blocksize.. I made those variables myself.. oops.. hope I didn't step on anything. In Stevens & Rago, they note the significant performance gains by increasing the BLOCKSIZE. My code might be slighly spaghetti-y, I can fix that. In order to optimize efficiency, all operations are on my block array, resident in memory, the fastest possible processing. I have a number of variables that keep track of places in the block. I am going to switch this all to mmap at some point. I read in Stevens & Rago that mmap is faster than open, plus it will aid me in my transition to multiprocessor/multicore tuning.
I started with a small BLOCKSIZE, and found the performance difference was highly significant. Yesterday I did some tests and found that a BLOCKSIZE of 2^30 suddenly caused a performance hit, perhaps because my block didn't fit in some cache? 2^29 showed no problems, so that is what I will be going with.
I am not running an emulator (is that not what cygwin is? I have worked with that myself, but currently happily casting off Uncle Bill Gates) -- I have installed Ubuntu Linux directly on its very own barebones internal for my laptop. I have my Uncle Bill HD lying around somewhere. I can mount it if needed. Specs of Ubuntu:
root@kayve-laptop:~# uname -a
Linux kayve-laptop 2.6.31-20-generic #58-Ubuntu SMP Fri Mar 12 04:38:19 UTC 2010 x86_64 GNU/Linux
root@kayve-laptop:~#
BLOCKSIZE29.png
BLOCKSIZE30-killed.png
Stevens & Rago, "Advanced Programming in the UNIX Environment, Second Edition", Addison-Wesley, 2005
has a reference for blocksize.. I made those variables myself.. oops.. hope I didn't step on anything. In Stevens & Rago, they note the significant performance gains by increasing the BLOCKSIZE. My code might be slighly spaghetti-y, I can fix that. In order to optimize efficiency, all operations are on my block array, resident in memory, the fastest possible processing. I have a number of variables that keep track of places in the block. I am going to switch this all to mmap at some point. I read in Stevens & Rago that mmap is faster than open, plus it will aid me in my transition to multiprocessor/multicore tuning.
I started with a small BLOCKSIZE, and found the performance difference was highly significant. Yesterday I did some tests and found that a BLOCKSIZE of 2^30 suddenly caused a performance hit, perhaps because my block didn't fit in some cache? 2^29 showed no problems, so that is what I will be going with.
I am not running an emulator (is that not what cygwin is? I have worked with that myself, but currently happily casting off Uncle Bill Gates) -- I have installed Ubuntu Linux directly on its very own barebones internal for my laptop. I have my Uncle Bill HD lying around somewhere. I can mount it if needed. Specs of Ubuntu:
root@kayve-laptop:~# uname -a
Linux kayve-laptop 2.6.31-20-generic #58-Ubuntu SMP Fri Mar 12 04:38:19 UTC 2010 x86_64 GNU/Linux
root@kayve-laptop:~#
BLOCKSIZE29.png
BLOCKSIZE30-killed.png
>> the point is not to do "small files." that would defeat the whole purpose.
I understand that you have a large file, and that there are performance issues. But ..
>> I wrote another program to debug my program because I ran a diff on my generated file versus the original file (that took.. I don't know maybe hours) and the diff output file STILL had like 20 million lines, but most of them contained simply a "<," a space and a carriage return
So, maybe I misunderstood you. I thought now you have two problems - performance, and worse, the program does not work correctly, and earlier you suspected that the problem is related to the case where a record crosses block boundaries.
To figure out what is wrong with a program that takes hours, you want to try to modify the program so that a run takes only seconds. Then you may have a chance to debug it in your lifetime. (If you try to debug a program that takes many hours to get through one run, then for each minute that it runs, it takes one day off your expected lifespan.)
That is why I rewrote your test program. I was hoping to live a little longer. In any case, I'm not even sure what exactly your test program is supposed to do. I was guessing, but given your logic, I won't entertain any further a guess. I'll just let you tell me what it is supposed to do in as much gory detail as you have time for.
I understand that you have a large file, and that there are performance issues. But ..
>> I wrote another program to debug my program because I ran a diff on my generated file versus the original file (that took.. I don't know maybe hours) and the diff output file STILL had like 20 million lines, but most of them contained simply a "<," a space and a carriage return
So, maybe I misunderstood you. I thought now you have two problems - performance, and worse, the program does not work correctly, and earlier you suspected that the problem is related to the case where a record crosses block boundaries.
To figure out what is wrong with a program that takes hours, you want to try to modify the program so that a run takes only seconds. Then you may have a chance to debug it in your lifetime. (If you try to debug a program that takes many hours to get through one run, then for each minute that it runs, it takes one day off your expected lifespan.)
That is why I rewrote your test program. I was hoping to live a little longer. In any case, I'm not even sure what exactly your test program is supposed to do. I was guessing, but given your logic, I won't entertain any further a guess. I'll just let you tell me what it is supposed to do in as much gory detail as you have time for.
kayvey:
You can close this and award multiple posts. See:
https://www.experts-exchange.com/help.jsp#hs=8&hi=100
You can close this and award multiple posts. See:
https://www.experts-exchange.com/help.jsp#hs=8&hi=100
ASKER
I hit the dang button. The buttons are gone. WTF is going on here??
ASKER
THere were TWO buttons before and they were gone. I was at a webpage with check boxes. THe multiple points buttons have disappeared. YOu guys have some debuggging to do on your website!
>> WTF
Maybe because there is a "close" operation taking place. I'll contact the zone administrator.
Maybe because there is a "close" operation taking place. I'll contact the zone administrator.
ASKER
Won't mind learning something tho..