• Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 1002
  • Last Modified:

Remove Duplicate Lines When Parsing CSV

I am using Java to read lines from a text file.  In this case, it is a CSV.  I parse the file and process each line based on my application.  I noticed that sometimes the file contains duplicate lines.  Instead of processing the duplicate line multiple times, I would like to somehow remove the duplicate lines from the CSV and then process each line in the file.  Can someone instruct me on the one of the more efficient ways of doing this?  Thanks
0
pcarrollnf
Asked:
pcarrollnf
1 Solution
 
CEHJCommented:
If it's not too big you can save the file to a Set<String>. That will ensure uniqueness
0
 
pcarrollnfAuthor Commented:
That's would be an issue.  These CSV files can become large and may contain thousands of lines.
0
 
CEHJCommented:
What OS are you using?
0
What does it mean to be "Always On"?

Is your cloud always on? With an Always On cloud you won't have to worry about downtime for maintenance or software application code updates, ensuring that your bottom line isn't affected.

 
pcarrollnfAuthor Commented:
Windows 2000, 2003
0
 
CEHJCommented:
Get a Windows port of textutils from http://gnuwin32.sourceforge.net. You can then do

cat orig.csv | sort | uniq >uniq.csv

Doubt if you'll get much more efficient than that
0
 
brunoguimaraesCommented:
There is a software called Clippy that does what you want.

http://www.snapfiles.com/get/clippy.html
0
 
objectsCommented:
you don't need to store the entire csv in memory, just the unique keys.
That will allow you to check each line as you process it
0
 
CEHJCommented:
:-)
0

Featured Post

VIDEO: THE CONCERTO CLOUD FOR HEALTHCARE

Modern healthcare requires a modern cloud. View this brief video to understand how the Concerto Cloud for Healthcare can help your organization.

Tackle projects and never again get stuck behind a technical roadblock.
Join Now