Solved

distinct rows in a file

Posted on 2002-05-26
10
149 Views
Last Modified: 2012-05-04
If I have a file that looks like this:

123
123
56723
123
45632
56723
123
56723
4235423
234423
123

etc.
What is the fastest way to get only distinct (unique) rows out of it? So that in this case result (in another file) would be:
123
56723
45632
4235423
234423

(values of input file are not sorted - also there's no need for output file to be sorted)
I really need the fastest way to do this. The files I need to process like this are about 1GB long so I really need the most optimized solution to handle this.

thanx!
0
Comment
Question by:jakac
  • 5
  • 2
  • 2
  • +1
10 Comments
 
LVL 19

Expert Comment

by:Kim Ryan
ID: 7036559
What platform are you running on? If you are using Unix, the fastest approach could be to use the supplied sort utility as this will be optimised for the platform. You may need to check your ulimit if you are processing giant files.
sort -u infile > outfile
0
 
LVL 1

Author Comment

by:jakac
ID: 7036564
I am using Debian Linux but maybe it would be good for this to be multi-platform solution
0
 
LVL 19

Expert Comment

by:Kim Ryan
ID: 7036572
sort should be avaialable on all flavours of Unix and there should be a windows port as well I guess. A perl solution would be simple, but as speed is so important, 'sort' should be much faster.
0
Free Tool: Subnet Calculator

The subnet calculator helps you design networks by taking an IP address and network mask and returning information such as network, broadcast address, and host range.

One of a set of tools we're offering as a way of saying thank you for being a part of the community.

 
LVL 6

Expert Comment

by:sstouk
ID: 7037797
I guess this could be the fastest way to do it:
########################################
# Assign a fine name where your data is:
$File = "srs.txt";

# Open it
open (FILE ,"<$File") || die "Could not open $File\n";
     
# Read all the lines and assign each line as a key of the
# hash. Hash can only have one unique key, so if the
# second key is written, it ovewrites the first one.          
while (<FILE>)
{
$Unique{$_} = 1;    
};
# Close the file
close FILE;

# Now Check Your Unique lines and do with them whatever
# you want:
foreach $key (sort (keys %Unique))
{
print "$key";    
};
##########################################
0
 
LVL 6

Expert Comment

by:sstouk
ID: 7037800
The above should be a multiplatform solution.
It does not use any platform-specific functions.
I tested it on Windows NT.
0
 
LVL 19

Expert Comment

by:Kim Ryan
ID: 7037932
but will this work woth files up to 1 GB? You would need to monitor memory usage closely. sort utility works by creating intermediate sort files
0
 
LVL 1

Author Comment

by:jakac
ID: 7038562
sstouk: I tried a solution like yours myself but I just get "out of memory" error message after about 50% of my file is processed... So "sort" solution by teraplane is still the best for handling the big files...
0
 
LVL 6

Expert Comment

by:holli
ID: 7041799
you could keep the above solution, but tie the hash to a serialization-engine, like SDBM:

tie(%h, 'SDBM_File', 'filename', O_RDWR|O_CREAT, 0666)
   or die "Couldn't tie SDBM file 'filename': $!; aborting";

then it will work on a file, not on memory.
0
 
LVL 19

Expert Comment

by:Kim Ryan
ID: 7049301
Has this helped?
sort -u infile > outfile
0
 
LVL 19

Accepted Solution

by:
Kim Ryan earned 100 total points
ID: 7058423
sort -u infile > outfile
0

Featured Post

Free Tool: SSL Checker

Scans your site and returns information about your SSL implementation and certificate. Helpful for debugging and validating your SSL configuration.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

I've just discovered very important differences between Windows an Unix formats in Perl,at least 5.xx.. MOST IMPORTANT: Use Unix file format while saving Your script. otherwise it will have ^M s or smth likely weird in the EOL, Then DO NOT use m…
In the distant past (last year) I hacked together a little toy that would allow a couple of Manager types to query, preview, and extract data from a number of MongoDB instances, to their tool of choice: Excel (http://dilbert.com/strips/comic/2007-08…
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…

821 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question