Still celebrating National IT Professionals Day with 3 months of free Premium Membership. Use Code ITDAY17

x
?
Solved

distinct rows in a file

Posted on 2002-05-26
10
Medium Priority
?
152 Views
Last Modified: 2012-05-04
If I have a file that looks like this:

123
123
56723
123
45632
56723
123
56723
4235423
234423
123

etc.
What is the fastest way to get only distinct (unique) rows out of it? So that in this case result (in another file) would be:
123
56723
45632
4235423
234423

(values of input file are not sorted - also there's no need for output file to be sorted)
I really need the fastest way to do this. The files I need to process like this are about 1GB long so I really need the most optimized solution to handle this.

thanx!
0
Comment
Question by:jakac
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 5
  • 2
  • 2
  • +1
10 Comments
 
LVL 19

Expert Comment

by:Kim Ryan
ID: 7036559
What platform are you running on? If you are using Unix, the fastest approach could be to use the supplied sort utility as this will be optimised for the platform. You may need to check your ulimit if you are processing giant files.
sort -u infile > outfile
0
 
LVL 1

Author Comment

by:jakac
ID: 7036564
I am using Debian Linux but maybe it would be good for this to be multi-platform solution
0
 
LVL 19

Expert Comment

by:Kim Ryan
ID: 7036572
sort should be avaialable on all flavours of Unix and there should be a windows port as well I guess. A perl solution would be simple, but as speed is so important, 'sort' should be much faster.
0
Industry Leaders: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 
LVL 6

Expert Comment

by:sstouk
ID: 7037797
I guess this could be the fastest way to do it:
########################################
# Assign a fine name where your data is:
$File = "srs.txt";

# Open it
open (FILE ,"<$File") || die "Could not open $File\n";
     
# Read all the lines and assign each line as a key of the
# hash. Hash can only have one unique key, so if the
# second key is written, it ovewrites the first one.          
while (<FILE>)
{
$Unique{$_} = 1;    
};
# Close the file
close FILE;

# Now Check Your Unique lines and do with them whatever
# you want:
foreach $key (sort (keys %Unique))
{
print "$key";    
};
##########################################
0
 
LVL 6

Expert Comment

by:sstouk
ID: 7037800
The above should be a multiplatform solution.
It does not use any platform-specific functions.
I tested it on Windows NT.
0
 
LVL 19

Expert Comment

by:Kim Ryan
ID: 7037932
but will this work woth files up to 1 GB? You would need to monitor memory usage closely. sort utility works by creating intermediate sort files
0
 
LVL 1

Author Comment

by:jakac
ID: 7038562
sstouk: I tried a solution like yours myself but I just get "out of memory" error message after about 50% of my file is processed... So "sort" solution by teraplane is still the best for handling the big files...
0
 
LVL 6

Expert Comment

by:holli
ID: 7041799
you could keep the above solution, but tie the hash to a serialization-engine, like SDBM:

tie(%h, 'SDBM_File', 'filename', O_RDWR|O_CREAT, 0666)
   or die "Couldn't tie SDBM file 'filename': $!; aborting";

then it will work on a file, not on memory.
0
 
LVL 19

Expert Comment

by:Kim Ryan
ID: 7049301
Has this helped?
sort -u infile > outfile
0
 
LVL 19

Accepted Solution

by:
Kim Ryan earned 300 total points
ID: 7058423
sort -u infile > outfile
0

Featured Post

Free Tool: Site Down Detector

Helpful to verify reports of your own downtime, or to double check a downed website you are trying to access.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Email validation in proper way is  very important validation required in any web pages. This code is self explainable except that Regular Expression which I used for pattern matching. I originally published as a thread on my website : http://www…
I have been pestered over the years to produce and distribute regular data extracts, and often the request have explicitly requested the data be emailed as an Excel attachement; specifically Excel, as it appears: CSV files confuse (no Red or Green h…
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…
Six Sigma Control Plans

721 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question