Solved

# N - Way Merge

Posted on 2001-06-11
Medium Priority
1,205 Views
Hi All,
I want to merge a set of files efficiently. The files
that I am going to merge are large in size ( > 2 MB) and
are ASCII files.
Now the problem is that I need to develop an algorithm
by which merging takes place efficiently in such a way
that each file is read only once. Once I open a file, it
should be read only once .
I have been suggested to use an n-way merge technique
by which all the files are opened simultaneously, the
first line is read, compared lexicographically and then
written to the new file. Then the second file is read,
and the process is repeated and so on till the end of all
the files are reached.
But this technique seems to take a lot of time. I want
the algorithm to merge and sort with a complexity of o(n).

Can any body suggest me a better technique ?
With some sample code if possible !
Thanks
Ace

0
Question by:aceventura
• 10
• 7

LVL 2

Expert Comment

ID: 6177735
I think your proposition was correct, and most efficient;
Make a list (or an array if you have a fixed or a maximum number of files) of structs that contain a file pointer and a line buffer. - Something like

struct line_post {
FILE *f;
char line[256]; /* or some other maximum line length */
int finnished; /* flag to indicate end of file */
/* struct line_post *next; if a list is required */
} line_array[MAX_FILES];

Assign each file pointer a corresponding file:
for (i = 0; i < files; i++)
if ((line_array[i].f = fopen(file[i], "r")) == NULL)
error();
... or something alike.

Read one line from each file (with fgets). Write the lexically smallest (or biggest) line to the ouput file, and read a new line into the appropriate line buffer. On end-of-file, (fgets returns NULL) set the flag finnished to exclude it in the future. You can also increase a counter, and compare it to the number of files to indicate that everything is read.

I you have a linked list, you can simply remove the line_post on EOF, and continue until the list is empty.

This process gets complexity O(n).

I actually have written a program that does this in DOS (ages ago) but I don't remember if it was in C or Pascal. If you want, I could post it if I manage to dig it up.
0

Author Comment

ID: 6180002
Obg,
Yes, that is what I was looking at. The technique that Im following is something similar to that.
It'd be great if you could manage to dig up the source code and post it .
Thanks
Ace
0

LVL 2

Expert Comment

ID: 6180354
Ok, I will give it a try this evening (CET).
0

Author Comment

ID: 6181173
The main problem here is that the files that I am sorting
tend to get large. And by large I mean > 2-3 MB.

Thats why Im stressing on the complexity part.

Regs
Ace
0

LVL 2

Expert Comment

ID: 6181349
I know what you mean. I can post the whole program (once I've found it). It was primarily intended to sort text files that were ~5M in DOS, and it worked just fine...
0

LVL 2

Expert Comment

ID: 6183643
Oh, this looks like crap. I would not write it like this today... :-( Furthermore it contains inline assembly statements and comments in Swedish... I really need to clean it up a little bit before I make it public. Can you wait until tomorrow? (I sure hope so!)
0

Author Comment

ID: 6184562
No probs Obq...
Take your time ...
Im waiting

ace
0

LVL 2

Accepted Solution

obg earned 300 total points
ID: 6184992
Ok, here it comes. It's still nothing I'm especially proud of, but at least it compiles and runs under unix/gcc.

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#define MAXTMPS            1000
#define MAXLINES      10000
#define LINESIZE      256

#define DEBUG

#define stricmp(a, b) strcasecmp(a, b)
#define min(a, b) ((a) < (b) ? (a) : (b))

typedef char line[LINESIZE];

struct tmpfile {
FILE *fil;
line str;
int eof;
};

struct tmpfile *tmpfiles[MAXTMPS], *tmpptr;
char *lines[MAXLINES];
int ignorecase = 0, reverse = 0, offset = 0;

void cleanup(void *p[], unsigned int ant)
{
unsigned int i;

for (i = 0; i < ant; i++)
free(((char **)p)[i]);
}

void incfname(char *name)
{
int i;

sprintf(name + 9, "%03d", atoi(name + 9) + 1);
}

void decfname(char *name)
{
int i;

sprintf(name + 9, "%03d", atoi(name + 9) - 1);
}

void erase(char *fname)
{
do {
#ifdef DEBUG
fprintf(stderr, "Removing %s\n", fname);
#endif
remove(fname);
decfname(fname);
} while (strcmp(fname, "~TMPFIL~.-01"));
}

int cmpstr(char *s1, char *s2)
{
int flag;

s1 += min(strlen(s1), offset);
s2 += min(strlen(s2), offset);
if (ignorecase)
flag = stricmp(s1, s2);
else
flag = strcmp(s1, s2);
if (reverse)
return -flag;
else
return flag;
}

int cmpfunc(const void *a, const void *b)
{
return cmpstr(*(char **)a, *(char **)b);
}

int sorttotmp(char *buffer[], unsigned int ant, char *fname)
{
FILE *fil;
int i;

qsort((void *)buffer, ant, sizeof(char *), cmpfunc);
#ifdef DEBUG
fprintf(stderr, "Trying to open %s\n", fname);
#endif
if ((fil = fopen(fname, "w")) == NULL) {
#ifdef DEBUG
fprintf(stderr, "Failed!\n");
#endif
return -1;
}
#ifdef DEBUG
fprintf(stderr, "Success!\n");
#endif
for (i = 0; i < ant; i++) {
fputs(buffer[i], fil);
}
fclose(fil);
return 0;
}

void readstr(struct tmpfile *p)
{
if (fgets(p->str, LINESIZE, p->fil) == NULL) {
p->eof = 1;
fclose(p->fil);
}
}

void usage(void)
{
puts("Sorts input and writes results to the screen, a file, or another device.");
puts("Works like DOS's sort, but this one can handle large files. The argument");
puts("/I is also added. Written by Olle Borg 920910.");
puts("");
puts("XLSORT [arguments] [<[drive1:][path1]filename1] [>[drive2:][path2]filename2]");
puts("[command |] XLSORT [arguments] [> [drive2:][path2]filename2]");
puts("");
puts("Arguments:");
puts("  /I                         Ignores case (that is: A < b and a < B).");
puts("  /R                         Reverses the sort order; that is, sorts Z to A,");
puts("                             then 9 to 0.");
puts("  /+n                        Sorts the file according to characters in");
puts("                             column n.");
puts("");
puts("  [drive1:][path1]filename1  Specifies a file to be sorted.");
puts("  [drive2:][path2]filename2  Specifies a file where the sorted input is to be");
puts("                             stored.");
puts("  command                    Specifies a command whose output is to be sorted.");
}

int main(int argc, char *argv[])
{
unsigned int cnt_tmps = 0, cnt_lines = 0;
line tmpline;
unsigned int i, nrempty = 0;
char tmpfilename[16];

strcpy(tmpfilename, "~TMPFIL~.000");
while (--argc)                        /* Parse arguments. */
if (argv[argc][0] == '/')
switch (argv[argc][1]) {
case 'R':
case 'r':
reverse = 1;
break;
case 'I':
case 'i':
ignorecase = 1;
break;
case '+':
if (argv[argc][2] <= '9' && argv[argc][2] >= '0')
offset = atoi(&argv[argc][2]) - 1;
else {
fprintf(stderr, "Unsupported argument: %s\n",
argv[argc]);
return -1;
}
break;
case '?':
usage();
return 0;
default:
fprintf(stderr, "Unsupported argument: %s\n", argv[argc]);
return -1;
}
else {
fprintf(stderr, "Unsupported argument: %s\n", argv[argc]);
return -1;
}

while (fgets(tmpline, LINESIZE, stdin))
if (cnt_lines < MAXLINES && (lines[cnt_lines] =
malloc(strlen(tmpline) + 1)) != NULL) {
strcpy(lines[cnt_lines++], tmpline);
} else {
#ifdef DEBUG
fprintf(stderr, "Memory full after %d lines!\n", cnt_lines);
#endif
if (sorttotmp(lines, cnt_lines, tmpfilename)) {
erase(tmpfilename);
fprintf(stderr, "Can't open tempfile (1) %s.\n", tmpfilename);
return -1;
}
cnt_tmps++;
incfname(tmpfilename);
cleanup((void *)lines, cnt_lines);
cnt_lines = 0;
if ((lines[cnt_lines] = malloc(strlen(tmpline) + 1)) != NULL)
strcpy(lines[cnt_lines++], tmpline);
else {
fprintf(stderr, "Memory full after cleanup!\n");
return -1;
}
}
if (cnt_lines) {
if (sorttotmp(lines, cnt_lines, tmpfilename)) {
erase(tmpfilename);
fprintf(stderr, "Can't open tempfile (2) %s.\n", tmpfilename);
return -1;
}
cnt_tmps++;
incfname(tmpfilename);
cleanup((void *)lines, cnt_lines);
}

/* Here is the merge part */
for (i = 0; i < cnt_tmps; i++) {
decfname(tmpfilename);
tmpfiles[i] = (struct tmpfile *)malloc(sizeof(struct tmpfile));
tmpfiles[i]->eof = 0;
if ((tmpfiles[i]->fil = fopen(tmpfilename, "r")) == NULL) {
fprintf(stderr, "Can't open more than %d of %d files.\n",
i + 1, cnt_tmps + 1);
fprintf(stderr, "Increase 'FILES=...' in CONFIG.SYS\n");
while (i--)
fclose(tmpfiles[i]->fil);
sprintf(&tmpfilename[9], "%3d", cnt_tmps + 1);
erase(tmpfilename);
return -1;
}
}
while (nrempty < cnt_tmps) {
for (i = 0; tmpfiles[i]->eof; i++);
tmpptr = tmpfiles[i];
while (++i < cnt_tmps)
if (!tmpfiles[i]->eof && cmpstr(tmpfiles[i]->str, tmpptr->str) < 0)
tmpptr = tmpfiles[i];
fputs(tmpptr->str, stdout);
if (tmpptr->eof)
nrempty++;
}
sprintf(&tmpfilename[9], "%03d", cnt_tmps);
erase(tmpfilename);
return 0;
}
0

Author Comment

ID: 6189605
Obq ..
I just got the prog ..
Let me try it out and then I'll get back to you...

Ace
0

LVL 2

Expert Comment

ID: 6189990
Sure! I'm in no hurry... ;-)
0

LVL 2

Expert Comment

ID: 6253808
Ace... Any luck? (I'm still in no hurry, just curious.)
0

Author Comment

ID: 6254486
Hi Obq,
I tried out your program, but thats not exactly what I ]
was looking for.
Anyway thanks for the effort and the time.

Ace
0

LVL 2

Expert Comment

ID: 6254671
Ok, should I get less curious after that...? Do you mind specifing what you miss?
0

LVL 2

Expert Comment

ID: 6272697
Hello again Ace!

I posted a project for you, that I worked quite a lot with a couple of years ago. As for the input I got from you, this seems (to me) to be exactly what you were looking for. Please specify!

Please forgive me if I've got things wrong here, but this is what I see:

I can see that you have asked 7 questions on EE. You have received quite good answers and suggestions to all of those, yet you have not graded even one. - That does not look too good...
0

Author Comment

ID: 6272759
Hi Obq,
Well,sorry for the delay.
Your program did help me to quite an extent and I thank you for that.
I did have to make some modifications to the program to
make it suit my needs.
Thanks again

Ace
0

LVL 2

Expert Comment

ID: 6272796
Thanks for the score, and I am sorry if my suspiscions were wrong. However, I know this program, and all I thought was that if it did not fulfill your needs, I might be the one to modify it...
0

Author Comment

ID: 6272837
Hey Obq ..
No need to feel bad man ... life goes on...
How about mailing me (ashishsudhakar@yahoo.com)..
maybe we can be frens :)
I notice that you are a teacher ... I like teaching
Ace

0

## Featured Post

Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

## Join & Write a Comment Already a member? Login.

Preface I don't like visual development tools that are supposed to write a program for me. Even if it is Xcode and I can use Interface Builder. Yes, it is a perfect tool and has helped me a lot, mainly, in the beginning, when my programs were smallâ€¦
Examines three attack vectors, specifically, the different types of malware used in malicious attacks, web application attacks, and finally, network based attacks.  Concludes by examining the means of securing and protecting critical systems and infâ€¦
The goal of this video is to provide viewers with basic examples to understand recursion in the C programming language.
The goal of this video is to provide viewers with basic examples to understand how to create, access, and change arrays in the C programming language.
###### Suggested Courses
Course of the Month5 days, 16 hours left to enroll

#### 569 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.