Go Premium for a chance to win a PS4. Enter to Win

x
  • Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 153
  • Last Modified:

distinct rows in a file

If I have a file that looks like this:

123
123
56723
123
45632
56723
123
56723
4235423
234423
123

etc.
What is the fastest way to get only distinct (unique) rows out of it? So that in this case result (in another file) would be:
123
56723
45632
4235423
234423

(values of input file are not sorted - also there's no need for output file to be sorted)
I really need the fastest way to do this. The files I need to process like this are about 1GB long so I really need the most optimized solution to handle this.

thanx!
0
jakac
Asked:
jakac
  • 5
  • 2
  • 2
  • +1
1 Solution
 
Kim RyanIT ConsultantCommented:
What platform are you running on? If you are using Unix, the fastest approach could be to use the supplied sort utility as this will be optimised for the platform. You may need to check your ulimit if you are processing giant files.
sort -u infile > outfile
0
 
jakacAuthor Commented:
I am using Debian Linux but maybe it would be good for this to be multi-platform solution
0
 
Kim RyanIT ConsultantCommented:
sort should be avaialable on all flavours of Unix and there should be a windows port as well I guess. A perl solution would be simple, but as speed is so important, 'sort' should be much faster.
0
Concerto's Cloud Advisory Services

Want to avoid the missteps to gaining all the benefits of the cloud? Learn more about the different assessment options from our Cloud Advisory team.

 
sstoukCommented:
I guess this could be the fastest way to do it:
########################################
# Assign a fine name where your data is:
$File = "srs.txt";

# Open it
open (FILE ,"<$File") || die "Could not open $File\n";
     
# Read all the lines and assign each line as a key of the
# hash. Hash can only have one unique key, so if the
# second key is written, it ovewrites the first one.          
while (<FILE>)
{
$Unique{$_} = 1;    
};
# Close the file
close FILE;

# Now Check Your Unique lines and do with them whatever
# you want:
foreach $key (sort (keys %Unique))
{
print "$key";    
};
##########################################
0
 
sstoukCommented:
The above should be a multiplatform solution.
It does not use any platform-specific functions.
I tested it on Windows NT.
0
 
Kim RyanIT ConsultantCommented:
but will this work woth files up to 1 GB? You would need to monitor memory usage closely. sort utility works by creating intermediate sort files
0
 
jakacAuthor Commented:
sstouk: I tried a solution like yours myself but I just get "out of memory" error message after about 50% of my file is processed... So "sort" solution by teraplane is still the best for handling the big files...
0
 
holliCommented:
you could keep the above solution, but tie the hash to a serialization-engine, like SDBM:

tie(%h, 'SDBM_File', 'filename', O_RDWR|O_CREAT, 0666)
   or die "Couldn't tie SDBM file 'filename': $!; aborting";

then it will work on a file, not on memory.
0
 
Kim RyanIT ConsultantCommented:
Has this helped?
sort -u infile > outfile
0
 
Kim RyanIT ConsultantCommented:
sort -u infile > outfile
0

Featured Post

Technology Partners: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

  • 5
  • 2
  • 2
  • +1
Tackle projects and never again get stuck behind a technical roadblock.
Join Now