Solved

Remove Duplicate Lines When Parsing CSV

Posted on 2007-12-04
8
994 Views
Last Modified: 2008-02-01
I am using Java to read lines from a text file.  In this case, it is a CSV.  I parse the file and process each line based on my application.  I noticed that sometimes the file contains duplicate lines.  Instead of processing the duplicate line multiple times, I would like to somehow remove the duplicate lines from the CSV and then process each line in the file.  Can someone instruct me on the one of the more efficient ways of doing this?  Thanks
0
Comment
Question by:pcarrollnf
8 Comments
 
LVL 86

Accepted Solution

by:
CEHJ earned 125 total points
ID: 20405022
If it's not too big you can save the file to a Set<String>. That will ensure uniqueness
0
 

Author Comment

by:pcarrollnf
ID: 20405043
That's would be an issue.  These CSV files can become large and may contain thousands of lines.
0
 
LVL 86

Expert Comment

by:CEHJ
ID: 20405108
What OS are you using?
0
 

Author Comment

by:pcarrollnf
ID: 20405193
Windows 2000, 2003
0
Is Your Active Directory as Secure as You Think?

More than 75% of all records are compromised because of the loss or theft of a privileged credential. Experts have been exploring Active Directory infrastructure to identify key threats and establish best practices for keeping data safe. Attend this month’s webinar to learn more.

 
LVL 86

Expert Comment

by:CEHJ
ID: 20405968
Get a Windows port of textutils from http://gnuwin32.sourceforge.net. You can then do

cat orig.csv | sort | uniq >uniq.csv

Doubt if you'll get much more efficient than that
0
 
LVL 9

Expert Comment

by:brunoguimaraes
ID: 20406102
There is a software called Clippy that does what you want.

http://www.snapfiles.com/get/clippy.html
0
 
LVL 92

Expert Comment

by:objects
ID: 20406402
you don't need to store the entire csv in memory, just the unique keys.
That will allow you to check each line as you process it
0
 
LVL 86

Expert Comment

by:CEHJ
ID: 20781397
:-)
0

Featured Post

Is Your Active Directory as Secure as You Think?

More than 75% of all records are compromised because of the loss or theft of a privileged credential. Experts have been exploring Active Directory infrastructure to identify key threats and establish best practices for keeping data safe. Attend this month’s webinar to learn more.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Title # Comments Views Activity
topping1 challenge 7 72
sites similar to codingbat to improve coding hanson skills 3 40
github account with ecipse 1 42
difference of if loops 23 39
This was posted to the Netbeans forum a Feb, 2010 and I also sent it to Verisign. Who didn't help much in my struggles to get my application signed. ------------------------- Start The idea here is to target your cell phones with the correct…
Introduction Java can be integrated with native programs using an interface called JNI(Java Native Interface). Native programs are programs which can directly run on the processor. JNI is simply a naming and calling convention so that the JVM (Java…
Viewers learn about the “while” loop and how to utilize it correctly in Java. Additionally, viewers begin exploring how to include conditional statements within a while loop and avoid an endless loop. Define While Loop: Basic Example: Explanatio…
This tutorial covers a practical example of lazy loading technique and early loading technique in a Singleton Design Pattern.

914 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

16 Experts available now in Live!

Get 1:1 Help Now