distinct rows in a file

If I have a file that looks like this:

123
123
56723
123
45632
56723
123
56723
4235423
234423
123

etc.
What is the fastest way to get only distinct (unique) rows out of it? So that in this case result (in another file) would be:
123
56723
45632
4235423
234423

(values of input file are not sorted - also there's no need for output file to be sorted)
I really need the fastest way to do this. The files I need to process like this are about 1GB long so I really need the most optimized solution to handle this.

thanx!
LVL 1
jakacAsked:
Who is Participating?

Improve company productivity with a Business Account.Sign Up

x
 
Kim RyanConnect With a Mentor IT ConsultantCommented:
sort -u infile > outfile
0
 
Kim RyanIT ConsultantCommented:
What platform are you running on? If you are using Unix, the fastest approach could be to use the supplied sort utility as this will be optimised for the platform. You may need to check your ulimit if you are processing giant files.
sort -u infile > outfile
0
 
jakacAuthor Commented:
I am using Debian Linux but maybe it would be good for this to be multi-platform solution
0
Keep up with what's happening at Experts Exchange!

Sign up to receive Decoded, a new monthly digest with product updates, feature release info, continuing education opportunities, and more.

 
Kim RyanIT ConsultantCommented:
sort should be avaialable on all flavours of Unix and there should be a windows port as well I guess. A perl solution would be simple, but as speed is so important, 'sort' should be much faster.
0
 
sstoukCommented:
I guess this could be the fastest way to do it:
########################################
# Assign a fine name where your data is:
$File = "srs.txt";

# Open it
open (FILE ,"<$File") || die "Could not open $File\n";
     
# Read all the lines and assign each line as a key of the
# hash. Hash can only have one unique key, so if the
# second key is written, it ovewrites the first one.          
while (<FILE>)
{
$Unique{$_} = 1;    
};
# Close the file
close FILE;

# Now Check Your Unique lines and do with them whatever
# you want:
foreach $key (sort (keys %Unique))
{
print "$key";    
};
##########################################
0
 
sstoukCommented:
The above should be a multiplatform solution.
It does not use any platform-specific functions.
I tested it on Windows NT.
0
 
Kim RyanIT ConsultantCommented:
but will this work woth files up to 1 GB? You would need to monitor memory usage closely. sort utility works by creating intermediate sort files
0
 
jakacAuthor Commented:
sstouk: I tried a solution like yours myself but I just get "out of memory" error message after about 50% of my file is processed... So "sort" solution by teraplane is still the best for handling the big files...
0
 
holliCommented:
you could keep the above solution, but tie the hash to a serialization-engine, like SDBM:

tie(%h, 'SDBM_File', 'filename', O_RDWR|O_CREAT, 0666)
   or die "Couldn't tie SDBM file 'filename': $!; aborting";

then it will work on a file, not on memory.
0
 
Kim RyanIT ConsultantCommented:
Has this helped?
sort -u infile > outfile
0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.