Solved

Remove duplicate lines from a text file

Posted on 2008-06-24
14
4,545 Views
Last Modified: 2013-11-23
Hi,

I am trying to remove duplicate lines from a text file. To make things difficult the lines contain non unique timestamps but a unique reference number. Some of the duplicates amount to 10 lines whereas others can only be 2 lines.

1. Here are some examples of duplicates lines: <timestamp>,<reference>,<error message>

08:47:22,95847170050,Problem inputting data.
08:47:29,95847170050,Problem inputting data.
08:47:35,95847170050,Problem inputting data.
08:53:28, 96672540040, More problems inputting data.
08:53:35, 96672540040, More problems inputting data.
08:53:41, 96672540040, More problems inputting data.

I want to delete all but the most recent duplicate line.

I am new to java so can you tell what the best way of doing this is?

Thank you in advance.
0
Comment
Question by:mistermuv
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 5
  • 5
  • 4
14 Comments
 
LVL 6

Expert Comment

by:RishadanPort
ID: 21860630
Step 1. Read in All data to an Array of Strings
Step 2. Remove unwanted Lines.
Step 3. Insert back into file.

For Steps 1 and 3 you can use a FileStream with a read and write method
is a realitivly easy step

For step 2, you will need
1. Remove out the reference number by using the Substring method.
2. compare this reference number with other lines reference number
   2a If the reference number is the same remove one of the lines
   2b else skip
0
 
LVL 6

Expert Comment

by:RishadanPort
ID: 21860717
Here is sometimes very rough I did in a minutes...
		String[] fileLines = new String[1000];
		List<String> updatedLines = new List<String>();
 
		//populate the fileLines here
 
		String currentString;
		String referenceNumber;
 
		for (int index = 0; index < fileLines.length; index++)
		{
			currentString = fileLines[index];
 
			if(currentString == null)
			{
				continue;
			}
 
			referenceNumber = currentString.substring(referenceNumber.indexOf(","), referenceNumber.indexOf(",", referenceNumber.indexOf(","));
 
			referenceNumber = referenceNumber.trim();
 
			for (int index2 = index + 1; index2 < fileLines.length; index2++)
			{
				if(fileLines[index2].Contains(referenceNumber))
				{
					fileLines[index2] = null;
				}
			}
		}
 
		for(int index = 0; index < fileLines.length; index++)
		{
			if(fileLines[index] != null){
				updatedLines.add(fileLines[index]);
			}
		}

Open in new window

0
 
LVL 6

Assisted Solution

by:RishadanPort
RishadanPort earned 20 total points
ID: 21860749
Here is a slightly better version...
Basically what it does is it copies all unique strings to the updatedLines list.

After you do that step, you will need to reoutput all of updatedLines to the file using a BufferedInputStream, and a BufferedOutputStream

		String[] fileLines = new String[1000];
		List<String> updatedLines = new List<String>();
 
		//populate the fileLines here
 
		String currentString;
		String referenceNumber;
 
		for (int index = 0; index < fileLines.length; index++)
		{
			currentString = fileLines[index];
 
			if(currentString == null)
			{
				continue;
			}
			else{
				updatedLines.add(fileLines[index]);
			}
 
 
			referenceNumber = currentString.substring(referenceNumber.indexOf(","), referenceNumber.indexOf(",", referenceNumber.indexOf(",")));
			referenceNumber = referenceNumber.trim();
 
			for (int index2 = index + 1; index2 < fileLines.length; index2++)
			{
				if(fileLines[index2].Contains(referenceNumber))
				{
					fileLines[index2] = null;
				}
			}
		}

Open in new window

0
Independent Software Vendors: We Want Your Opinion

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 

Author Comment

by:mistermuv
ID: 21860827
RishadanPort.

As I am new to java can you tell where in the code it differentiates the duplicate lines by timestamp so that only the most recent duplicate line remains whereas all the others are deleted?

Thankyou
0
 
LVL 6

Expert Comment

by:RishadanPort
ID: 21860877
Ah, I didn't know you wanted to keep the most recent TimeStamp as well... That's another issue altogether.

Now you will need to do this.

1. Find all Strings that contain the reference numbers
2. for each of these strings that contain the reference numbers you will need to also parse out timestamp, and string compare them.  
0
 
LVL 6

Expert Comment

by:RishadanPort
ID: 21860954
for comparing the TimeStamps, you can use a simple string compare operation:

Example:
String timestamp1 = "08:47:22";
String timestamp2 = "08:57:41";
int compareInt = String.Compare(timestamp1, timestamp2);

if(compareInt < 0){
   //timestamp1 is earlier then timestamp2
}
else{
///timepstamp2 is ealier then timestamp1
}

Note this way only works if the format of the timestamps stay the same in that order...

For example if in your code it does something like this:
timestamp1 = 0:00:10
then something like this
timestamp2 = 01:00:00
then this way is not a good way to do it... and you will have to use some DateTime Class to handle it for you.
0
 
LVL 86

Expert Comment

by:CEHJ
ID: 21864093
You can use a Set to remove duplicates, together with a Wrapper class for the file line:
import java.io.*;
import java.util.*;
 
public class Uniq {
	public static void main(String[] args) {
		if (args.length < 2) {
		    System.err.println("Usage: java Uniq <infile> <outfile>");
		    System.exit(1);
		}
		Uniq u = new Uniq();
		Set<Line> lines = u.read(args[0]);
		u.write(lines, args[1]);
	}
 
	public Set<Line> read(String fileName) {
		Set<Line> lines = new LinkedHashSet<Line>();
		Scanner s = null;
		try {
			s = new Scanner(new File(fileName));
			while (s.hasNextLine()) {
				lines.add(new Line(s.nextLine()));
			}
		} catch (IOException e) {
			e.printStackTrace();
		} finally {
			s.close();
		}
		return lines;
	}
 
	public void write(Set<Line> lines, String fileName) {
		PrintWriter out = null;
		try {
			out = new PrintWriter(new FileWriter(fileName));
			for (Line line : lines) {
				out.println(line);
			}
		} catch (IOException e) {
			e.printStackTrace();
		} finally {
			out.close();
		}
	}
}
 
 
class Line {
	private String line;
	private String[] atoms;
 
	public Line(String line) {
		this.line = line;
		atoms = line.split("\\s*,\\s*");
	}
 
	public String getLine() {
		return this.line;
	}
 
	public void setLine(String line) {
		this.line = line;
	}
 
	public boolean equals(Object o) {
		Line otherLine = (Line) o;
		return atoms[1].equals(otherLine.atoms[1]);
	}
 
	public int hashCode() {
		return atoms[1].hashCode();
	}
 
	public String toString() {
	    return line;
	}
}

Open in new window

0
 

Author Comment

by:mistermuv
ID: 21864165
Hi CEHJ,

When I compile your code I get the following:  I can't see anything wrong with it.
I am compiling against j2sdk1.4.2_17 due to work contraints.

>javac Uniq.java
Uniq.java:15: <identifier> expected
      public Set<Line> read(String fileName) {
                  ^
Uniq.java:44: ';' expected
}
^
2 errors
0
 
LVL 86

Expert Comment

by:CEHJ
ID: 21864236
Yes, my code uses generics, which is >= Java 1.5. Remove everything in angle brackets and cast appropriately
0
 
LVL 86

Accepted Solution

by:
CEHJ earned 55 total points
ID: 21864260
Try the following:
import java.io.*;
import java.util.*;
 
public class Uniq {
	public static void main(String[] args) {
		if (args.length < 2) {
		    System.err.println("Usage: java Uniq ");
		    System.exit(1);
		}
		Uniq u = new Uniq();
		Set lines = u.read(args[0]);
		u.write(lines, args[1]);
	}
 
	public Set read(String fileName) {
		Set lines = new LinkedHashSet();
		Scanner s = null;
		try {
			s = new Scanner(new File(fileName));
			while (s.hasNextLine()) {
				lines.add(new Line(s.nextLine()));
			}
		} catch (IOException e) {
			e.printStackTrace();
		} finally {
			s.close();
		}
		return lines;
	}
 
	public void write(Set lines, String fileName) {
		PrintWriter out = null;
		try {
			out = new PrintWriter(new FileWriter(fileName));
			Iterator i = lines.iterator();
			while (i.hasNext()) {
				out.println(i.next());
			}
		} catch (IOException e) {
			e.printStackTrace();
		} finally {
			out.close();
		}
	}
}
 
 
class Line {
	private String line;
	private String[] atoms;
 
	public Line(String line) {
		this.line = line;
		atoms = line.split("\\s*,\\s*");
	}
 
	public String getLine() {
		return this.line;
	}
 
	public void setLine(String line) {
		this.line = line;
	}
 
	public boolean equals(Object o) {
		Line otherLine = (Line) o;
		return atoms[1].equals(otherLine.atoms[1]);
	}
 
	public int hashCode() {
		return atoms[1].hashCode();
	}
 
	public String toString() {
	    return line;
	}
}

Open in new window

0
 

Author Comment

by:mistermuv
ID: 21864338
I think the "scanner" class was introduced in v1.5 too.

>javac Uniq.java
Uniq.java:17: cannot resolve symbol
symbol  : class Scanner
location: class Uniq
      Scanner s = null;
                ^
Uniq.java:19: cannot resolve symbol
symbol  : class Scanner
location: class Uniq
      s = new Scanner(new File(fileName));
                                ^
2 errors
0
 

Author Comment

by:mistermuv
ID: 21864349
I suppose I would have to use the traditional methods of utilising FileReader and BufferedReader and the such like.
0
 
LVL 86

Expert Comment

by:CEHJ
ID: 21864634
>>I think the "scanner" class was introduced in v1.5 too.

Sorry, yes. BufferedReader can be used instead. Running the app i posted produces:

08:47:22,95847170050,Problem inputting data.
08:53:28, 96672540040, More problems inputting data.
0
 
LVL 86

Expert Comment

by:CEHJ
ID: 21868610
:-)
0

Featured Post

Technology Partners: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Title # Comments Views Activity
numbers ascending pyramid 101 239
eclipse apache tomcat admin console 52 153
throw exception 21 67
Tagging and Merging on Branch 1 41
INTRODUCTION Working with files is a moderately common task in Java.  For most projects hard coding the file names, using parameters in configuration files, or using command-line arguments is sufficient.   However, when your application has vi…
After being asked a question last year, I went into one of my moods where I did some research and code just for the fun and learning of it all.  Subsequently, from this journey, I put together this article on "Range Searching Using Visual Basic.NET …
Video by: Michael
Viewers learn about how to reduce the potential repetitiveness of coding in main by developing methods to perform specific tasks for their program. Additionally, objects are introduced for the purpose of learning how to call methods in Java. Define …
The viewer will learn how to implement Singleton Design Pattern in Java.

749 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question