code to remove dups

hello there,
I am using this code to remove duplicates from a text file,
for now its working perfectly fine but the issue is that the list is loaded into memory and if the list.txt is about 1GB in size then it takes all the available ram. I would like to know what can I do to prevent this using C# or C++ I am using VB2010


#include "StdAfx.h"
#include <fstream>
#include <iostream>
#include <string>
#include <set>
using namespace std;

int main () {

	printf("software running please wait!!!");

ifstream is1("dups.txt");
ofstream os("no_dups.txt");
set<string> data;

while(!is1.eof()) {

  string sLine;

  getline(is1,sLine);

  data.insert(sLine);
}

set<string>::iterator i;

for (i = data.begin(); i != data.end(); ++i) {

  os << *i << endl;
}

return 0;
}

Open in new window

LVL 1
XK8ERAsked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

lomo74Commented:
you could use a DB library to track duplicates.
example attached, using GDBM library (http://sourceforge.net/projects/gnuwin32/files/gdbm/1.8.3-1/).
main.cpp
0
lomo74Commented:
BTW GDBM library needs some fixes before it can be used successfully:
1) change this line
#include <gdbm-dll.h>
to
#include "gdbm-dll.h"
2) you have to fetch gdbm-dll.h from the sources because the installer skips it lol.
http://sourceforge.net/projects/gnuwin32/files/gdbm/1.8.3-1/gdbm-1.8.3-1-src.zip/download
3) not strictly required for this example... change this line
GDBM_DLL_IMPEXP gdbm_error gdbm_errno;

to

#if defined(__cplusplus) || defined(c_plusplus)
extern "C" {
#endif
GDBM_DLL_IMPEXP gdbm_error gdbm_errno;
#if defined(__cplusplus) || defined(c_plusplus)
}
#endif
0
sarabandeCommented:
in a first step you could split the file into a set of multiple different files for example by using the first or first two characters of each string to decide into which file you would write it. you also could use the text length or a senseful combination of criterias to get about 100 files fairly distributed in file size out of the one huge file. the important thing is that duplicates must be written to same file. in a second step you would rewind/reopen the files and remove duplicates for each file and finally write the set to output file.

Sara
0
Cloud Class® Course: Amazon Web Services - Basic

Are you thinking about creating an Amazon Web Services account for your business? Not sure where to start? In this course you’ll get an overview of the history of AWS and take a tour of their user interface.

XK8ERAuthor Commented:
how exactly do i do that with my current code?
0
phoffricCommented:
If your lines are long, then, knowing the nature of your data in the lines, you could insert into a map a short string taken from a subset of chars in the line and the value will be a set of file offsets at the start of the line(s). The string could be the (first, middle, last chars - your choice depending upon your data - could be the first or last 5 chars in the string). When reading a new line, you create this abbreviated string, and see whether it exists in the map. If it does, then retrieve the line using the offset and compare the currently read string with the previous string. If it matches, then discard the current string. If it doesn't match,  then add the new offset to that node. (Now, when you have a match, you will have to compare the currently read line with the previous two lines using the two offsets.) This may help you fit in the lines metadata into your online map without running out of memory on the heap.
0
phoffricCommented:
A better approach that may lead to more accurate results is to use for the key a 32 bit CRC value of the line rather than a string. You still probably want to check the current line with the previous line, however, you will find that when the CRC matches, then the entire line will also match.
0
XK8ERAuthor Commented:
hum im not really an expert on c++ so i would need like a few sample source code
0
sarabandeCommented:
a CRC is always reducing information. so it would be a necessary but not sufficient condition for equality.

i would try to make categories (files) with first and last char of string and length. these 3 criterias also would be used to determine the filename.

std::map<std::string, std::ofstream*> files;
....
while (std::getline(is1, sLine))
{
      // extract 1st word
      size_t len = sLine.length();
      if (len == 0)
           continue;
      size_t pos = sLine.find_first_of(" \t\r\n")); // find whitespace
      if (pos != std::string::npos)
      {
            len = pos;
            sLine.resize(len);
      }
      std::ostringstream oss;
      oss <<  sLine[0] << sLine[len-1] << (int)len;
      std::string filekey = oss.str();
      // get pointer to ofstream from map
      std::ofstream *& pFile  = files[filekey];
      if (pFile == NULL)
      {
             pFile = new std::ofstream((filekey + ".txt").c_str());
      } 
      if (!(*pFile << sLine << std::endl))
      {
            // failed to write to file 
            ...

Open in new window


after eof of is1 the loop would break. you then would iterate all files in map 'files' close them (close the ofstream)  and open them for read  (as ifstream). then do your current duplicate check using a new map.

Sara
0
ambienceCommented:
CRC32 isnt accurate because the crc can collide (at least theoretically) also I think it only solves by lowering the amount of RAM needed but the usage would still be linearly increasing.

With multiple files in the worst case, when every line begins with same letters or perhaps a few combinations (like log files with every line starting with either ERR, INFO etc) you end up with the same situation that it lowers down the RAM but isnt linearly scalable.

My vote is also for using a lightweight embeded DB engine like SQLite.  http://www.sqlite.org/cintro.html
0
sarabandeCommented:
i don't think it is so difficult to find a good "hash" string for your data. the suggested criterias would give a reasonable good distribution for real or randomly generated names with different lengths. if the keys mostly have same length or all have same prefix, you could compute the sum of all character codes and use modulo 256 (or higher) to get a filename and limit the maximum number of files to a senseful value. you also could break the first step and discard the results if you got too many or too less files and try another algorithm.

a database solution also is a good choice though probably slower. i would not recommend it in case you were not experienced in accessing a database programmatically.

another solution is to use the filesystem for your purpose by opening and closing a file named exactly like the word you extracted from line in an empty temporary folder. finally the list of filenames in the folder would be the required result.  an advantage (or disqualifier) of that method on windows would be that filenames are not case sensitive.

i would guess that it is slowest of all methods but depending on your environment i could be wrong (for example if you were using an SSD).

Sara
0
phoffricCommented:
I left out a word - likely ..
>> You still probably want to check the current line with the previous line, however, you will find that when the CRC matches, then the entire line will likely also match.
That's why I said that you still need to check the current line with the previous line. The CRC32 hashes your line into one of 4 billion integer values, so you will find that two identical CRC32 values will likely be due to a dup. But, as with all hashes, you still have to check against collisions.

If scalability is actually an issue for you (or is this just a C++ learning project), then adding more memory is one solution up to a point. But adding larger disk drives and a database is the natural choice.
0
wdosanjosCommented:
Here is a Dedup and Sort C# solution.  On my local system for a 1 GB file the process takes between 1 min (all records are duplicates) and 4 mins (all records are unique) to complete.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Reflection;
using System.IO;
using System.Threading.Tasks;

namespace ConsoleApplication7
{
    public class Program
    {
        const string INPUT_FILE = @"c:\temp\dups.txt";
        const string OUTPUT_FILE = @"c:\temp\no_dups.txt";
        const long BUCKET_SIZE = 1024 * 1024 * 256; // 256 MB

        static void Main(string[] args)
        {
            Console.WriteLine("{0}: Start", DateTime.Now);

            var fileSize = new FileInfo(INPUT_FILE).Length;
            var bucketCount = 1 + (int)(fileSize / BUCKET_SIZE);

            if (bucketCount == 1)
            {
                var data = new SortedSet<string>();

                using (var reader = new StreamReader(INPUT_FILE))
                {
                    for (var line = reader.ReadLine(); line != null; line = reader.ReadLine())
                    {
                        data.Add(line);
                    }
                }

                using (var writer = new StreamWriter(OUTPUT_FILE))
                {
                    foreach (var line in data)
                    {
                        writer.WriteLine(line);
                    }
                }
            }
            else
            {
                var buckets = new string[bucketCount];

                for (int i = 0; i < bucketCount; i++)
                {
                    buckets[i] = Path.GetTempFileName();
                }

                Console.WriteLine("{0}: Spliting {1} in {2} buckets", DateTime.Now, INPUT_FILE, bucketCount);

                using (var reader = new StreamReader(INPUT_FILE))
                {
                    string line = reader.ReadLine();
                    string prevLine = null;

                    for (int i = 0; i < bucketCount; i++)
                    {
                        Console.WriteLine("Bucket {0}", i);

                        using (var writer = new StreamWriter(buckets[i]))
                        {
                            while (line != null && writer.BaseStream.Length < BUCKET_SIZE)
                            {
                                if (line != prevLine)
                                {
                                    writer.WriteLine(line);

                                    prevLine = line;
                                }

                                line = reader.ReadLine();
                            }
                        }
                    }
                }

                Console.WriteLine("{0}: Dedup & Sort", DateTime.Now);

                var data = new SortedSet<string>();

                for (int i = 0; i < bucketCount; i++)
                {
                    data.Clear();

                    Console.WriteLine("Bucket {0}", i);

                    using (var reader = new StreamReader(buckets[i]))
                    {
                        for (var line = reader.ReadLine(); line != null; line = reader.ReadLine())
                        {
                            data.Add(line);
                        }
                    }

                    using (var writer = new StreamWriter(buckets[i]))
                    {
                        foreach (var line in data)
                        {
                            writer.WriteLine(line);
                        }
                    }
                }

                data = null;

                Console.WriteLine("{0}: Writing to {1}", DateTime.Now, OUTPUT_FILE);

                using (var writer = new StreamWriter(OUTPUT_FILE))
                {
                    var readers = new StreamReader[bucketCount];
                    var lines = new string[bucketCount];

                    try
                    {
                        for (int i = 0; i < bucketCount; i++)
                        {
                            readers[i] = new StreamReader(buckets[i]);
                            lines[i] = readers[i].ReadLine();
                        }

                        for (string line = lines.Min(); line != null; line = lines.Min())
                        {
                            writer.WriteLine(line);

                            for (int i = 0; i < bucketCount; i++)
                            {
                                if (lines[i] == line)
                                {
                                    lines[i] = readers[i].ReadLine();
                                }
                            }
                        }
                    }
                    finally
                    {
                        for (int i = 0; i < bucketCount; i++)
                        {
                            readers[i].Close();
                        }
                    }
                }

                Console.WriteLine("{0}: Clean up", DateTime.Now);

                for (int i = 0; i < bucketCount; i++)
                {
                    File.Delete(buckets[i]);
                }
            }

            Console.WriteLine("{0}: End", DateTime.Now);
        }
    }
}

Open in new window

0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
C++

From novice to tech pro — start learning today.

Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.