Solved

Getting rid of duplicate lines

Posted on 2013-11-27
20
313 Views
Last Modified: 2013-11-30
I need something that will read a textfile and pull out all of the text that DOES NOT have a duplaicate line in the file. I would like to see a .bat file do this so I can understand how to set and reset variables within a for loop. I am not quite getting what is going on.

I would appreciate any help here.

Thanks,
Jim
0
Comment
Question by:scuzz1
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 7
  • 6
  • 5
  • +2
20 Comments
 
LVL 35

Expert Comment

by:Dan Craciun
ID: 39682728
A simpler way (but manual) is to import your text file into Excel and use conditional formatting to highlight all your duplicate lines.

HTH,
Dan
0
 

Author Comment

by:scuzz1
ID: 39682732
There are almost 200000 lines.

That would take a while. I am also hoping to learn something from this solution.

Thanks,
Jim
0
 
LVL 35

Expert Comment

by:Dan Craciun
ID: 39682738
I don't know how one would implement a hashtable/array in batch or how efficient that would be.

But if you have access to a linux box the following will copy only lines without duplicates to output.txt:

uniq -u input.txt > output.txt

Gotta love Linux sometimes :)
0
Free Tool: ZipGrep

ZipGrep is a utility that can list and search zip (.war, .ear, .jar, etc) archives for text patterns, without the need to extract the archive's contents.

One of a set of tools we're offering as a way to say thank you for being a part of the community.

 

Author Comment

by:scuzz1
ID: 39682744
That only tells me it can be done....Only thing I need now is an algorithm that works in Winders...
0
 
LVL 35

Expert Comment

by:Dan Craciun
ID: 39682749
The algorithm is simple:
1. create a sort file that will contain unique lines from input and no of repetitions:
create sort file
read line from input file
if line not in sort file
  add it with repetition number 1
else
  read repetition number
  increment repetition number
  replace repetition number with incremented one
end if

2. read sort file and output only lines with repetition number 1

The problem is how efficient that would be in batch with a 200 000 lines input. Especially if duplicates are scarce.
0
 
LVL 43

Expert Comment

by:Steve Knight
ID: 39682815
could we see a sample few lines please?  also is duplicate based on a completely identical line or a certain part of it?

as long as it is the whole line, a field at the start of the line, or certain chars from the left then a sort solution would work.

this would leave your output in alphabetical order, and any blank lines etc. will have gone.

Are there likely to be any non batch friendly chars like <>|& etc. in there?

steve
0
 
LVL 35

Assisted Solution

by:Dan Craciun
Dan Craciun earned 50 total points
ID: 39682826
If you're stuck on Windows, you can use a more modern approach, in Powershell:

Get-Content test.txt | Select-Object -Unique > out.txt
0
 
LVL 27

Expert Comment

by:tliotta
ID: 39683250
Since this is in the MS DOS topic, are you seriously asking if this can be done in an actual DOS .BAT file? So far, it sounds like you are. It's possible that (1) it can't be reasonably done in a DOS .BAT file, and (2) no one remembers how to do it assuming that it's even possible.

Tom
0
 
LVL 35

Expert Comment

by:Dan Craciun
ID: 39683265
@Tom: I'm sure oBda and probably others can prove that they have a long memory :)

The OP did say this is a learning exercise. Why would someone put a lot of time and effort in learning batch in 2013, I don't know, but I guess some things never die.
0
 
LVL 43

Expert Comment

by:Steve Knight
ID: 39683361
If the OP can come back with the info, mentioned before it is not that difficult in batch, all depending upon what is classed as unique.  Can be as simple as sort it and loop through lines, outputting if different from the last one, 2-3 lines of batch:

For a one off job I wouldn't do it in batch mind, and if there was any possibility of any control characters in there then you are much better off doing similar job in VBScript, or even easier PS as has been shown already.

In batch though, roughly something like this... sort it to temp.txt, loop over lines in temp.txt and if the lines are different to the last then echo it, redirect all that to output.txt

Steve

@echo off
setlocal enabledelayedexpansion
sort < input.txt > temp.txt
set lastline=###
(for /f "tokens=*" %%a in (temp.txt) do (
  if NOT "!lastline!"=="%%~a" echo %%~a
)) > output.txt

Steve
0
 

Author Comment

by:scuzz1
ID: 39683513
? is duplicate based on a completely identical line or a certain part of it?
? long as it is the whole line

a. Lines are completely identical

? Are there likely to be any non batch friendly chars like <>|& etc. in there

a. No special characters

Text is very short. Maybe an average of 10 chars per line.

I am going to try Steve's approach.

Thanks for all of your input.

Jim
0
 
LVL 43

Expert Comment

by:Steve Knight
ID: 39683527
It would help if it kept note of the last line...

Add
set lastline=%%~a

Open in new window


before the last line, i.e.

@echo off
setlocal enabledelayedexpansion
sort < input.txt > temp.txt
set lastline=###
(for /f "tokens=*" %%a in (temp.txt) do (
  if NOT "!lastline!"=="%%~a" echo %%~a
  set lastline=%%~a
)) > output.txt

Open in new window

0
 

Author Comment

by:scuzz1
ID: 39683561
Ok. That only deleted one of the duplicated lines. I need to delete delete bolth lines.

i.e.

a99552
a99553
a99554 < --- Delete
a99554 < --- Delete
a99555
a99556
a9955g
a9955r < --- Delete
a9955r < --- Delete
a9956
a99568

There will only be pairs like that too.

Jim
0
 
LVL 43

Expert Comment

by:Steve Knight
ID: 39683674
Oh, then I misunderstood, I thought you wanted a list just made unique, will consider now.

Steve
0
 
LVL 35

Expert Comment

by:Dan Craciun
ID: 39683677
Just as a curiosity: what's wrong with the one line in PS approach?
0
 
LVL 55

Expert Comment

by:Bill Prew
ID: 39683720
Here's an approach, adjust to your file names and give it a try.

@echo off
setlocal EnableDelayedExpansion

set InFile=in.txt
set Outfile=out.txt

(
  for /f "usebackq tokens=*" %%A in ("%InFile%") do (
    for /f %%B in ('find /c "%%~A" ^< "%InFile%"') do (
      if %%B EQU 1 echo %%~A
    )
  )
) > "%OutFile%"

Open in new window

~bp
0
 
LVL 43

Accepted Solution

by:
Steve Knight earned 450 total points
ID: 39684322
This is what I came up with earlier but didn't have time to post, bit more complicated than Bill's but doesn't need to run a "find" for each line.

Steve

@echo off
setlocal enabledelayedexpansion
sort < input.txt > temp.txt

Set lastline=#FIRST#
set Dup=#NO#

(for /f "tokens=*" %%a in (temp.txt) do (
  if "!DUP!!lastline!"=="#NO#%%~a" set Dup=#YES#
  if "!DUP!!lastline!"=="#YES#%%~a" set Dup=#YES#
  if "!lastline!"=="#FIRST#" set lastline=%%~a
  if NOT "!lastline!"=="%%~a" (
    IF "!Dup!"=="#NO#" echo !lastline!
    set Dup=#NO#
  )
  set lastline=%%~a
)
if "!Dup!"=="#NO#" echo !lastline!
) > output.txt
start output.txt

Open in new window

0
 

Author Closing Comment

by:scuzz1
ID: 39687245
That is it Steve. I knew it could be done. I just could not get it in my head. Thanks.

Dan, This is just something that bothered me. Your solution did work. I am not too familiar with PowerShell. I guess I should be. I am going to take Steve version and see if I can convert it to vbs. I just like to play around. That is the only way I can learn.

Thank you all.
Jim
0
 
LVL 43

Expert Comment

by:Steve Knight
ID: 39687284
No problem, EE is where I learnt most of the fancier batch stuff and VBScript from people like SteveGTR, Bill Prew, Qlemo and many others and come in so useful frankly... good luck with it.

Now to get the kids asleep...
0
 
LVL 43

Expert Comment

by:Steve Knight
ID: 39687287
Is there an EE zone I have missed for teaching kids how to sleep at night, get up in the morning and eat the food you put in front of them?
0

Featured Post

[Webinar] How Hackers Steal Your Credentials

Do You Know How Hackers Steal Your Credentials? Join us and Skyport Systems to learn how hackers steal your credentials and why Active Directory must be secure to stop them. Thursday, July 13, 2017 10:00 A.M. PDT

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

VALIDATING DATES One method of validating dates is to jam the date into the DATE command and see if it accepts it by examining the system's errorlevel value. A non-zero result indicates failure. A typical example might look something like the fol…
YESTERDAY YESTERDAY.BAT is inspired by a previous article I wrote entitled: TOMORROW.BAT (http://www.experts-exchange.com/OS/Microsoft_Operating_Systems/MS_DOS/A_4196-Advanced-Batch-File-Programming-TOMORROW-BAT.html). The crux of this batch f…
This video Micro Tutorial shows how to password-protect PDF files with free software. Many software products can do this, such as Adobe Acrobat (but not Adobe Reader), Nuance PaperPort, and Nuance Power PDF, but they are not free products. This vide…
In this brief tutorial Pawel from AdRem Software explains how you can quickly find out which services are running on your network, or what are the IP addresses of servers responsible for each service. Software used is freeware NetCrunch Tools (https…

707 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question