Remove Duplicate Lines When Parsing CSV

I am using Java to read lines from a text file.  In this case, it is a CSV.  I parse the file and process each line based on my application.  I noticed that sometimes the file contains duplicate lines.  Instead of processing the duplicate line multiple times, I would like to somehow remove the duplicate lines from the CSV and then process each line in the file.  Can someone instruct me on the one of the more efficient ways of doing this?  Thanks
pcarrollnfAsked:
Who is Participating?
 
CEHJConnect With a Mentor Commented:
If it's not too big you can save the file to a Set<String>. That will ensure uniqueness
0
 
pcarrollnfAuthor Commented:
That's would be an issue.  These CSV files can become large and may contain thousands of lines.
0
 
CEHJCommented:
What OS are you using?
0
Cloud Class® Course: Certified Penetration Testing

This CPTE Certified Penetration Testing Engineer course covers everything you need to know about becoming a Certified Penetration Testing Engineer. Career Path: Professional roles include Ethical Hackers, Security Consultants, System Administrators, and Chief Security Officers.

 
pcarrollnfAuthor Commented:
Windows 2000, 2003
0
 
CEHJCommented:
Get a Windows port of textutils from http://gnuwin32.sourceforge.net. You can then do

cat orig.csv | sort | uniq >uniq.csv

Doubt if you'll get much more efficient than that
0
 
brunoguimaraesCommented:
There is a software called Clippy that does what you want.

http://www.snapfiles.com/get/clippy.html
0
 
objectsCommented:
you don't need to store the entire csv in memory, just the unique keys.
That will allow you to check each line as you process it
0
 
CEHJCommented:
:-)
0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.