Solved

distinct rows in a file

Posted on 2002-05-26
10
146 Views
Last Modified: 2012-05-04
If I have a file that looks like this:

123
123
56723
123
45632
56723
123
56723
4235423
234423
123

etc.
What is the fastest way to get only distinct (unique) rows out of it? So that in this case result (in another file) would be:
123
56723
45632
4235423
234423

(values of input file are not sorted - also there's no need for output file to be sorted)
I really need the fastest way to do this. The files I need to process like this are about 1GB long so I really need the most optimized solution to handle this.

thanx!
0
Comment
Question by:jakac
  • 5
  • 2
  • 2
  • +1
10 Comments
 
LVL 19

Expert Comment

by:Kim Ryan
ID: 7036559
What platform are you running on? If you are using Unix, the fastest approach could be to use the supplied sort utility as this will be optimised for the platform. You may need to check your ulimit if you are processing giant files.
sort -u infile > outfile
0
 
LVL 1

Author Comment

by:jakac
ID: 7036564
I am using Debian Linux but maybe it would be good for this to be multi-platform solution
0
 
LVL 19

Expert Comment

by:Kim Ryan
ID: 7036572
sort should be avaialable on all flavours of Unix and there should be a windows port as well I guess. A perl solution would be simple, but as speed is so important, 'sort' should be much faster.
0
 
LVL 6

Expert Comment

by:sstouk
ID: 7037797
I guess this could be the fastest way to do it:
########################################
# Assign a fine name where your data is:
$File = "srs.txt";

# Open it
open (FILE ,"<$File") || die "Could not open $File\n";
     
# Read all the lines and assign each line as a key of the
# hash. Hash can only have one unique key, so if the
# second key is written, it ovewrites the first one.          
while (<FILE>)
{
$Unique{$_} = 1;    
};
# Close the file
close FILE;

# Now Check Your Unique lines and do with them whatever
# you want:
foreach $key (sort (keys %Unique))
{
print "$key";    
};
##########################################
0
 
LVL 6

Expert Comment

by:sstouk
ID: 7037800
The above should be a multiplatform solution.
It does not use any platform-specific functions.
I tested it on Windows NT.
0
How your wiki can always stay up-to-date

Quip doubles as a “living” wiki and a project management tool that evolves with your organization. As you finish projects in Quip, the work remains, easily accessible to all team members, new and old.
- Increase transparency
- Onboard new hires faster
- Access from mobile/offline

 
LVL 19

Expert Comment

by:Kim Ryan
ID: 7037932
but will this work woth files up to 1 GB? You would need to monitor memory usage closely. sort utility works by creating intermediate sort files
0
 
LVL 1

Author Comment

by:jakac
ID: 7038562
sstouk: I tried a solution like yours myself but I just get "out of memory" error message after about 50% of my file is processed... So "sort" solution by teraplane is still the best for handling the big files...
0
 
LVL 6

Expert Comment

by:holli
ID: 7041799
you could keep the above solution, but tie the hash to a serialization-engine, like SDBM:

tie(%h, 'SDBM_File', 'filename', O_RDWR|O_CREAT, 0666)
   or die "Couldn't tie SDBM file 'filename': $!; aborting";

then it will work on a file, not on memory.
0
 
LVL 19

Expert Comment

by:Kim Ryan
ID: 7049301
Has this helped?
sort -u infile > outfile
0
 
LVL 19

Accepted Solution

by:
Kim Ryan earned 100 total points
ID: 7058423
sort -u infile > outfile
0

Featured Post

Free Trending Threat Insights Every Day

Enhance your security with threat intelligence from the web. Get trending threat insights on hackers, exploits, and suspicious IP addresses delivered to your inbox with our free Cyber Daily.

Join & Write a Comment

I've just discovered very important differences between Windows an Unix formats in Perl,at least 5.xx.. MOST IMPORTANT: Use Unix file format while saving Your script. otherwise it will have ^M s or smth likely weird in the EOL, Then DO NOT use m…
There are many situations when we need to display the data in sorted order. For example: Student details by name or by rank or by total marks etc. If you are working on data driven based projects then you will use sorting techniques very frequently.…
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…
This video explains how to create simple products associated to Magento configurable product and offers fast way of their generation with Store Manager for Magento tool.

708 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

15 Experts available now in Live!

Get 1:1 Help Now