Solved

code to remove dups

Posted on 2012-04-09
12
259 Views
Last Modified: 2012-04-11
hello there,
I am using this code to remove duplicates from a text file,
for now its working perfectly fine but the issue is that the list is loaded into memory and if the list.txt is about 1GB in size then it takes all the available ram. I would like to know what can I do to prevent this using C# or C++ I am using VB2010


#include "StdAfx.h"
#include <fstream>
#include <iostream>
#include <string>
#include <set>
using namespace std;

int main () {

	printf("software running please wait!!!");

ifstream is1("dups.txt");
ofstream os("no_dups.txt");
set<string> data;

while(!is1.eof()) {

  string sLine;

  getline(is1,sLine);

  data.insert(sLine);
}

set<string>::iterator i;

for (i = data.begin(); i != data.end(); ++i) {

  os << *i << endl;
}

return 0;
}

Open in new window

0
Comment
Question by:XK8ER
  • 3
  • 3
  • 2
  • +3
12 Comments
 
LVL 8

Expert Comment

by:lomo74
ID: 37826903
you could use a DB library to track duplicates.
example attached, using GDBM library (http://sourceforge.net/projects/gnuwin32/files/gdbm/1.8.3-1/).
main.cpp
0
 
LVL 8

Expert Comment

by:lomo74
ID: 37826915
BTW GDBM library needs some fixes before it can be used successfully:
1) change this line
#include <gdbm-dll.h>
to
#include "gdbm-dll.h"
2) you have to fetch gdbm-dll.h from the sources because the installer skips it lol.
http://sourceforge.net/projects/gnuwin32/files/gdbm/1.8.3-1/gdbm-1.8.3-1-src.zip/download
3) not strictly required for this example... change this line
GDBM_DLL_IMPEXP gdbm_error gdbm_errno;

to

#if defined(__cplusplus) || defined(c_plusplus)
extern "C" {
#endif
GDBM_DLL_IMPEXP gdbm_error gdbm_errno;
#if defined(__cplusplus) || defined(c_plusplus)
}
#endif
0
 
LVL 32

Expert Comment

by:sarabande
ID: 37827063
in a first step you could split the file into a set of multiple different files for example by using the first or first two characters of each string to decide into which file you would write it. you also could use the text length or a senseful combination of criterias to get about 100 files fairly distributed in file size out of the one huge file. the important thing is that duplicates must be written to same file. in a second step you would rewind/reopen the files and remove duplicates for each file and finally write the set to output file.

Sara
0
 
LVL 1

Author Comment

by:XK8ER
ID: 37829994
how exactly do i do that with my current code?
0
 
LVL 32

Expert Comment

by:phoffric
ID: 37830136
If your lines are long, then, knowing the nature of your data in the lines, you could insert into a map a short string taken from a subset of chars in the line and the value will be a set of file offsets at the start of the line(s). The string could be the (first, middle, last chars - your choice depending upon your data - could be the first or last 5 chars in the string). When reading a new line, you create this abbreviated string, and see whether it exists in the map. If it does, then retrieve the line using the offset and compare the currently read string with the previous string. If it matches, then discard the current string. If it doesn't match,  then add the new offset to that node. (Now, when you have a match, you will have to compare the currently read line with the previous two lines using the two offsets.) This may help you fit in the lines metadata into your online map without running out of memory on the heap.
0
 
LVL 32

Expert Comment

by:phoffric
ID: 37830143
A better approach that may lead to more accurate results is to use for the key a 32 bit CRC value of the line rather than a string. You still probably want to check the current line with the previous line, however, you will find that when the CRC matches, then the entire line will also match.
0
How to run any project with ease

Manage projects of all sizes how you want. Great for personal to-do lists, project milestones, team priorities and launch plans.
- Combine task lists, docs, spreadsheets, and chat in one
- View and edit from mobile/offline
- Cut down on emails

 
LVL 1

Author Comment

by:XK8ER
ID: 37830728
hum im not really an expert on c++ so i would need like a few sample source code
0
 
LVL 32

Expert Comment

by:sarabande
ID: 37831929
a CRC is always reducing information. so it would be a necessary but not sufficient condition for equality.

i would try to make categories (files) with first and last char of string and length. these 3 criterias also would be used to determine the filename.

std::map<std::string, std::ofstream*> files;
....
while (std::getline(is1, sLine))
{
      // extract 1st word
      size_t len = sLine.length();
      if (len == 0)
           continue;
      size_t pos = sLine.find_first_of(" \t\r\n")); // find whitespace
      if (pos != std::string::npos)
      {
            len = pos;
            sLine.resize(len);
      }
      std::ostringstream oss;
      oss <<  sLine[0] << sLine[len-1] << (int)len;
      std::string filekey = oss.str();
      // get pointer to ofstream from map
      std::ofstream *& pFile  = files[filekey];
      if (pFile == NULL)
      {
             pFile = new std::ofstream((filekey + ".txt").c_str());
      } 
      if (!(*pFile << sLine << std::endl))
      {
            // failed to write to file 
            ...

Open in new window


after eof of is1 the loop would break. you then would iterate all files in map 'files' close them (close the ofstream)  and open them for read  (as ifstream). then do your current duplicate check using a new map.

Sara
0
 
LVL 22

Expert Comment

by:ambience
ID: 37832007
CRC32 isnt accurate because the crc can collide (at least theoretically) also I think it only solves by lowering the amount of RAM needed but the usage would still be linearly increasing.

With multiple files in the worst case, when every line begins with same letters or perhaps a few combinations (like log files with every line starting with either ERR, INFO etc) you end up with the same situation that it lowers down the RAM but isnt linearly scalable.

My vote is also for using a lightweight embeded DB engine like SQLite.  http://www.sqlite.org/cintro.html
0
 
LVL 32

Expert Comment

by:sarabande
ID: 37832317
i don't think it is so difficult to find a good "hash" string for your data. the suggested criterias would give a reasonable good distribution for real or randomly generated names with different lengths. if the keys mostly have same length or all have same prefix, you could compute the sum of all character codes and use modulo 256 (or higher) to get a filename and limit the maximum number of files to a senseful value. you also could break the first step and discard the results if you got too many or too less files and try another algorithm.

a database solution also is a good choice though probably slower. i would not recommend it in case you were not experienced in accessing a database programmatically.

another solution is to use the filesystem for your purpose by opening and closing a file named exactly like the word you extracted from line in an empty temporary folder. finally the list of filenames in the folder would be the required result.  an advantage (or disqualifier) of that method on windows would be that filenames are not case sensitive.

i would guess that it is slowest of all methods but depending on your environment i could be wrong (for example if you were using an SSD).

Sara
0
 
LVL 32

Expert Comment

by:phoffric
ID: 37833790
I left out a word - likely ..
>> You still probably want to check the current line with the previous line, however, you will find that when the CRC matches, then the entire line will likely also match.
That's why I said that you still need to check the current line with the previous line. The CRC32 hashes your line into one of 4 billion integer values, so you will find that two identical CRC32 values will likely be due to a dup. But, as with all hashes, you still have to check against collisions.

If scalability is actually an issue for you (or is this just a C++ learning project), then adding more memory is one solution up to a point. But adding larger disk drives and a database is the natural choice.
0
 
LVL 23

Accepted Solution

by:
wdosanjos earned 500 total points
ID: 37834090
Here is a Dedup and Sort C# solution.  On my local system for a 1 GB file the process takes between 1 min (all records are duplicates) and 4 mins (all records are unique) to complete.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Reflection;
using System.IO;
using System.Threading.Tasks;

namespace ConsoleApplication7
{
    public class Program
    {
        const string INPUT_FILE = @"c:\temp\dups.txt";
        const string OUTPUT_FILE = @"c:\temp\no_dups.txt";
        const long BUCKET_SIZE = 1024 * 1024 * 256; // 256 MB

        static void Main(string[] args)
        {
            Console.WriteLine("{0}: Start", DateTime.Now);

            var fileSize = new FileInfo(INPUT_FILE).Length;
            var bucketCount = 1 + (int)(fileSize / BUCKET_SIZE);

            if (bucketCount == 1)
            {
                var data = new SortedSet<string>();

                using (var reader = new StreamReader(INPUT_FILE))
                {
                    for (var line = reader.ReadLine(); line != null; line = reader.ReadLine())
                    {
                        data.Add(line);
                    }
                }

                using (var writer = new StreamWriter(OUTPUT_FILE))
                {
                    foreach (var line in data)
                    {
                        writer.WriteLine(line);
                    }
                }
            }
            else
            {
                var buckets = new string[bucketCount];

                for (int i = 0; i < bucketCount; i++)
                {
                    buckets[i] = Path.GetTempFileName();
                }

                Console.WriteLine("{0}: Spliting {1} in {2} buckets", DateTime.Now, INPUT_FILE, bucketCount);

                using (var reader = new StreamReader(INPUT_FILE))
                {
                    string line = reader.ReadLine();
                    string prevLine = null;

                    for (int i = 0; i < bucketCount; i++)
                    {
                        Console.WriteLine("Bucket {0}", i);

                        using (var writer = new StreamWriter(buckets[i]))
                        {
                            while (line != null && writer.BaseStream.Length < BUCKET_SIZE)
                            {
                                if (line != prevLine)
                                {
                                    writer.WriteLine(line);

                                    prevLine = line;
                                }

                                line = reader.ReadLine();
                            }
                        }
                    }
                }

                Console.WriteLine("{0}: Dedup & Sort", DateTime.Now);

                var data = new SortedSet<string>();

                for (int i = 0; i < bucketCount; i++)
                {
                    data.Clear();

                    Console.WriteLine("Bucket {0}", i);

                    using (var reader = new StreamReader(buckets[i]))
                    {
                        for (var line = reader.ReadLine(); line != null; line = reader.ReadLine())
                        {
                            data.Add(line);
                        }
                    }

                    using (var writer = new StreamWriter(buckets[i]))
                    {
                        foreach (var line in data)
                        {
                            writer.WriteLine(line);
                        }
                    }
                }

                data = null;

                Console.WriteLine("{0}: Writing to {1}", DateTime.Now, OUTPUT_FILE);

                using (var writer = new StreamWriter(OUTPUT_FILE))
                {
                    var readers = new StreamReader[bucketCount];
                    var lines = new string[bucketCount];

                    try
                    {
                        for (int i = 0; i < bucketCount; i++)
                        {
                            readers[i] = new StreamReader(buckets[i]);
                            lines[i] = readers[i].ReadLine();
                        }

                        for (string line = lines.Min(); line != null; line = lines.Min())
                        {
                            writer.WriteLine(line);

                            for (int i = 0; i < bucketCount; i++)
                            {
                                if (lines[i] == line)
                                {
                                    lines[i] = readers[i].ReadLine();
                                }
                            }
                        }
                    }
                    finally
                    {
                        for (int i = 0; i < bucketCount; i++)
                        {
                            readers[i].Close();
                        }
                    }
                }

                Console.WriteLine("{0}: Clean up", DateTime.Now);

                for (int i = 0; i < bucketCount; i++)
                {
                    File.Delete(buckets[i]);
                }
            }

            Console.WriteLine("{0}: End", DateTime.Now);
        }
    }
}

Open in new window

0

Featured Post

What Is Threat Intelligence?

Threat intelligence is often discussed, but rarely understood. Starting with a precise definition, along with clear business goals, is essential.

Join & Write a Comment

Article by: SunnyDark
This article's goal is to present you with an easy to use XML wrapper for C++ and also present some interesting techniques that you might use with MS C++. The reason I built this class is to ease the pain of using XML files with C++, since there is…
Today I had a very interesting conundrum that had to get solved quickly. Needless to say, it wasn't resolved quickly because when we needed it we were very rushed, but as soon as the conference call was over and I took a step back I saw the correct …
The goal of the video will be to teach the user the concept of local variables and scope. An example of a locally defined variable will be given as well as an explanation of what scope is in C++. The local variable and concept of scope will be relat…
The viewer will learn how to clear a vector as well as how to detect empty vectors in C++.

747 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

12 Experts available now in Live!

Get 1:1 Help Now