Solved

N - Way Merge

Posted on 2001-06-11
17
1,155 Views
Last Modified: 2010-05-18
Hi All,
  I want to merge a set of files efficiently. The files
that I am going to merge are large in size ( > 2 MB) and
are ASCII files.
  Now the problem is that I need to develop an algorithm
by which merging takes place efficiently in such a way
that each file is read only once. Once I open a file, it
should be read only once .
  I have been suggested to use an n-way merge technique
by which all the files are opened simultaneously, the
first line is read, compared lexicographically and then
written to the new file. Then the second file is read,
and the process is repeated and so on till the end of all
the files are reached.
  But this technique seems to take a lot of time. I want
the algorithm to merge and sort with a complexity of o(n).
 
  Can any body suggest me a better technique ?
  With some sample code if possible !
Thanks
Ace

0
Comment
Question by:aceventura
  • 10
  • 7
17 Comments
 
LVL 2

Expert Comment

by:obg
ID: 6177735
I think your proposition was correct, and most efficient;
Make a list (or an array if you have a fixed or a maximum number of files) of structs that contain a file pointer and a line buffer. - Something like

struct line_post {
  FILE *f;
  char line[256]; /* or some other maximum line length */
  int finnished; /* flag to indicate end of file */
  /* struct line_post *next; if a list is required */
} line_array[MAX_FILES];

Assign each file pointer a corresponding file:
for (i = 0; i < files; i++)
  if ((line_array[i].f = fopen(file[i], "r")) == NULL)
    error();
... or something alike.

Read one line from each file (with fgets). Write the lexically smallest (or biggest) line to the ouput file, and read a new line into the appropriate line buffer. On end-of-file, (fgets returns NULL) set the flag finnished to exclude it in the future. You can also increase a counter, and compare it to the number of files to indicate that everything is read.

I you have a linked list, you can simply remove the line_post on EOF, and continue until the list is empty.

This process gets complexity O(n).

I actually have written a program that does this in DOS (ages ago) but I don't remember if it was in C or Pascal. If you want, I could post it if I manage to dig it up.
0
 

Author Comment

by:aceventura
ID: 6180002
Obg,
  Yes, that is what I was looking at. The technique that Im following is something similar to that.
  It'd be great if you could manage to dig up the source code and post it .
Thanks
Ace
0
 
LVL 2

Expert Comment

by:obg
ID: 6180354
Ok, I will give it a try this evening (CET).
0
 

Author Comment

by:aceventura
ID: 6181173
The main problem here is that the files that I am sorting
tend to get large. And by large I mean > 2-3 MB.
 
Thats why Im stressing on the complexity part.

Regs
Ace
0
 
LVL 2

Expert Comment

by:obg
ID: 6181349
I know what you mean. I can post the whole program (once I've found it). It was primarily intended to sort text files that were ~5M in DOS, and it worked just fine...
0
 
LVL 2

Expert Comment

by:obg
ID: 6183643
Oh, this looks like crap. I would not write it like this today... :-( Furthermore it contains inline assembly statements and comments in Swedish... I really need to clean it up a little bit before I make it public. Can you wait until tomorrow? (I sure hope so!)
0
 

Author Comment

by:aceventura
ID: 6184562
No probs Obq...
  Take your time ...
   Im waiting

ace
0
 
LVL 2

Accepted Solution

by:
obg earned 100 total points
ID: 6184992
Ok, here it comes. It's still nothing I'm especially proud of, but at least it compiles and runs under unix/gcc.

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#define MAXTMPS            1000
#define MAXLINES      10000
#define LINESIZE      256

#define DEBUG

#define stricmp(a, b) strcasecmp(a, b)
#define min(a, b) ((a) < (b) ? (a) : (b))

typedef char line[LINESIZE];

struct tmpfile {
  FILE *fil;
  line str;
  int eof;
};

struct tmpfile *tmpfiles[MAXTMPS], *tmpptr;
char *lines[MAXLINES];
int ignorecase = 0, reverse = 0, offset = 0;

void cleanup(void *p[], unsigned int ant)
{
  unsigned int i;

  for (i = 0; i < ant; i++)
    free(((char **)p)[i]);
}

void incfname(char *name)
{
  int i;

  sprintf(name + 9, "%03d", atoi(name + 9) + 1);
}

void decfname(char *name)
{
  int i;

  sprintf(name + 9, "%03d", atoi(name + 9) - 1);
}

void erase(char *fname)
{
  do {
#ifdef DEBUG
    fprintf(stderr, "Removing %s\n", fname);
#endif
    remove(fname);
    decfname(fname);
  } while (strcmp(fname, "~TMPFIL~.-01"));
}

int cmpstr(char *s1, char *s2)
{
  int flag;

  s1 += min(strlen(s1), offset);
  s2 += min(strlen(s2), offset);
  if (ignorecase)
    flag = stricmp(s1, s2);
  else
    flag = strcmp(s1, s2);
  if (reverse)
    return -flag;
  else
    return flag;
}

int cmpfunc(const void *a, const void *b)
{
  return cmpstr(*(char **)a, *(char **)b);
}

int sorttotmp(char *buffer[], unsigned int ant, char *fname)
{
  FILE *fil;
  int i;

  qsort((void *)buffer, ant, sizeof(char *), cmpfunc);
#ifdef DEBUG
  fprintf(stderr, "Trying to open %s\n", fname);
#endif
  if ((fil = fopen(fname, "w")) == NULL) {
#ifdef DEBUG
    fprintf(stderr, "Failed!\n");
#endif
    return -1;
  }
#ifdef DEBUG
  fprintf(stderr, "Success!\n");
#endif
  for (i = 0; i < ant; i++) {
    fputs(buffer[i], fil);
  }
  fclose(fil);
  return 0;
}

void readstr(struct tmpfile *p)
{
  if (fgets(p->str, LINESIZE, p->fil) == NULL) {
    p->eof = 1;
    fclose(p->fil);
  }
}

void usage(void)
{
  puts("Sorts input and writes results to the screen, a file, or another device.");
  puts("Works like DOS's sort, but this one can handle large files. The argument");
  puts("/I is also added. Written by Olle Borg 920910.");
  puts("");
  puts("XLSORT [arguments] [<[drive1:][path1]filename1] [>[drive2:][path2]filename2]");
  puts("[command |] XLSORT [arguments] [> [drive2:][path2]filename2]");
  puts("");
  puts("Arguments:");
  puts("  /I                         Ignores case (that is: A < b and a < B).");
  puts("  /R                         Reverses the sort order; that is, sorts Z to A,");
  puts("                             then 9 to 0.");
  puts("  /+n                        Sorts the file according to characters in");
  puts("                             column n.");
  puts("");
  puts("  [drive1:][path1]filename1  Specifies a file to be sorted.");
  puts("  [drive2:][path2]filename2  Specifies a file where the sorted input is to be");
  puts("                             stored.");
  puts("  command                    Specifies a command whose output is to be sorted.");
}

int main(int argc, char *argv[])
{
  unsigned int cnt_tmps = 0, cnt_lines = 0;
  line tmpline;
  unsigned int i, nrempty = 0;
  char tmpfilename[16];

  strcpy(tmpfilename, "~TMPFIL~.000");
  while (--argc)                        /* Parse arguments. */
    if (argv[argc][0] == '/')
      switch (argv[argc][1]) {
      case 'R':
      case 'r':
      reverse = 1;
      break;
      case 'I':
      case 'i':
      ignorecase = 1;
      break;
      case '+':
      if (argv[argc][2] <= '9' && argv[argc][2] >= '0')
        offset = atoi(&argv[argc][2]) - 1;
      else {
        fprintf(stderr, "Unsupported argument: %s\n",
              argv[argc]);
        return -1;
      }
      break;
      case '?':
      usage();
      return 0;
      default:
      fprintf(stderr, "Unsupported argument: %s\n", argv[argc]);
      return -1;
      }
    else {
      fprintf(stderr, "Unsupported argument: %s\n", argv[argc]);
      return -1;
    }
 
  while (fgets(tmpline, LINESIZE, stdin))
    if (cnt_lines < MAXLINES && (lines[cnt_lines] =
                        malloc(strlen(tmpline) + 1)) != NULL) {
      strcpy(lines[cnt_lines++], tmpline);
    } else {
#ifdef DEBUG
      fprintf(stderr, "Memory full after %d lines!\n", cnt_lines);
#endif
      if (sorttotmp(lines, cnt_lines, tmpfilename)) {
      erase(tmpfilename);
      fprintf(stderr, "Can't open tempfile (1) %s.\n", tmpfilename);
      return -1;
      }
      cnt_tmps++;
      incfname(tmpfilename);
      cleanup((void *)lines, cnt_lines);
      cnt_lines = 0;
      if ((lines[cnt_lines] = malloc(strlen(tmpline) + 1)) != NULL)
      strcpy(lines[cnt_lines++], tmpline);
      else {
      fprintf(stderr, "Memory full after cleanup!\n");
      return -1;
      }
    }
  if (cnt_lines) {
    if (sorttotmp(lines, cnt_lines, tmpfilename)) {
      erase(tmpfilename);
      fprintf(stderr, "Can't open tempfile (2) %s.\n", tmpfilename);
      return -1;
    }
    cnt_tmps++;
    incfname(tmpfilename);
    cleanup((void *)lines, cnt_lines);
  }

  /* Here is the merge part */
  for (i = 0; i < cnt_tmps; i++) {
    decfname(tmpfilename);
    tmpfiles[i] = (struct tmpfile *)malloc(sizeof(struct tmpfile));
    tmpfiles[i]->eof = 0;
    if ((tmpfiles[i]->fil = fopen(tmpfilename, "r")) == NULL) {
      fprintf(stderr, "Can't open more than %d of %d files.\n",
            i + 1, cnt_tmps + 1);
      fprintf(stderr, "Increase 'FILES=...' in CONFIG.SYS\n");
      while (i--)
      fclose(tmpfiles[i]->fil);
      sprintf(&tmpfilename[9], "%3d", cnt_tmps + 1);
      erase(tmpfilename);
      return -1;
    }
    readstr(tmpfiles[i]);
  }
  while (nrempty < cnt_tmps) {
    for (i = 0; tmpfiles[i]->eof; i++);
    tmpptr = tmpfiles[i];
    while (++i < cnt_tmps)
      if (!tmpfiles[i]->eof && cmpstr(tmpfiles[i]->str, tmpptr->str) < 0)
      tmpptr = tmpfiles[i];
    fputs(tmpptr->str, stdout);
    readstr(tmpptr);
    if (tmpptr->eof)
      nrempty++;
  }
  sprintf(&tmpfilename[9], "%03d", cnt_tmps);
  erase(tmpfilename);
  return 0;
}
0
How to run any project with ease

Manage projects of all sizes how you want. Great for personal to-do lists, project milestones, team priorities and launch plans.
- Combine task lists, docs, spreadsheets, and chat in one
- View and edit from mobile/offline
- Cut down on emails

 

Author Comment

by:aceventura
ID: 6189605
Obq ..
  I just got the prog ..
  Let me try it out and then I'll get back to you...

Ace
0
 
LVL 2

Expert Comment

by:obg
ID: 6189990
Sure! I'm in no hurry... ;-)
0
 
LVL 2

Expert Comment

by:obg
ID: 6253808
Ace... Any luck? (I'm still in no hurry, just curious.)
0
 

Author Comment

by:aceventura
ID: 6254486
Hi Obq,
  I tried out your program, but thats not exactly what I ]
was looking for.
  Anyway thanks for the effort and the time.

Ace
0
 
LVL 2

Expert Comment

by:obg
ID: 6254671
Ok, should I get less curious after that...? Do you mind specifing what you miss?
0
 
LVL 2

Expert Comment

by:obg
ID: 6272697
Hello again Ace!

I posted a project for you, that I worked quite a lot with a couple of years ago. As for the input I got from you, this seems (to me) to be exactly what you were looking for. Please specify!

Please forgive me if I've got things wrong here, but this is what I see:

I can see that you have asked 7 questions on EE. You have received quite good answers and suggestions to all of those, yet you have not graded even one. - That does not look too good...
0
 

Author Comment

by:aceventura
ID: 6272759
Hi Obq,
   Well,sorry for the delay.
   Your program did help me to quite an extent and I thank you for that.
   I did have to make some modifications to the program to
make it suit my needs.
  Thanks again

Ace
0
 
LVL 2

Expert Comment

by:obg
ID: 6272796
Thanks for the score, and I am sorry if my suspiscions were wrong. However, I know this program, and all I thought was that if it did not fulfill your needs, I might be the one to modify it...
0
 

Author Comment

by:aceventura
ID: 6272837
Hey Obq ..
   No need to feel bad man ... life goes on...
   How about mailing me (ashishsudhakar@yahoo.com)..
     maybe we can be frens :)
       I notice that you are a teacher ... I like teaching
Ace
 
0

Featured Post

How to run any project with ease

Manage projects of all sizes how you want. Great for personal to-do lists, project milestones, team priorities and launch plans.
- Combine task lists, docs, spreadsheets, and chat in one
- View and edit from mobile/offline
- Cut down on emails

Join & Write a Comment

Preface I don't like visual development tools that are supposed to write a program for me. Even if it is Xcode and I can use Interface Builder. Yes, it is a perfect tool and has helped me a lot, mainly, in the beginning, when my programs were small…
This is a short and sweet, but (hopefully) to the point article. There seems to be some fundamental misunderstanding about the function prototype for the "main" function in C and C++, more specifically what type this function should return. I see so…
The goal of this video is to provide viewers with basic examples to understand and use pointers in the C programming language.
The goal of this video is to provide viewers with basic examples to understand how to use strings and some functions related to them in the C programming language.

757 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

22 Experts available now in Live!

Get 1:1 Help Now