Getting rid of duplicate lines

I need something that will read a textfile and pull out all of the text that DOES NOT have a duplaicate line in the file. I would like to see a .bat file do this so I can understand how to set and reset variables within a for loop. I am not quite getting what is going on.

I would appreciate any help here.

Thanks,
Jim
scuzz1Asked:
Who is Participating?

[Product update] Infrastructure Analysis Tool is now available with Business Accounts.Learn More

x
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

Dan CraciunIT ConsultantCommented:
A simpler way (but manual) is to import your text file into Excel and use conditional formatting to highlight all your duplicate lines.

HTH,
Dan
scuzz1Author Commented:
There are almost 200000 lines.

That would take a while. I am also hoping to learn something from this solution.

Thanks,
Jim
Dan CraciunIT ConsultantCommented:
I don't know how one would implement a hashtable/array in batch or how efficient that would be.

But if you have access to a linux box the following will copy only lines without duplicates to output.txt:

uniq -u input.txt > output.txt

Gotta love Linux sometimes :)
Introduction to R

R is considered the predominant language for data scientist and statisticians. Learn how to use R for your own data science projects.

scuzz1Author Commented:
That only tells me it can be done....Only thing I need now is an algorithm that works in Winders...
Dan CraciunIT ConsultantCommented:
The algorithm is simple:
1. create a sort file that will contain unique lines from input and no of repetitions:
create sort file
read line from input file
if line not in sort file
  add it with repetition number 1
else
  read repetition number
  increment repetition number
  replace repetition number with incremented one
end if

2. read sort file and output only lines with repetition number 1

The problem is how efficient that would be in batch with a 200 000 lines input. Especially if duplicates are scarce.
Steve KnightIT ConsultancyCommented:
could we see a sample few lines please?  also is duplicate based on a completely identical line or a certain part of it?

as long as it is the whole line, a field at the start of the line, or certain chars from the left then a sort solution would work.

this would leave your output in alphabetical order, and any blank lines etc. will have gone.

Are there likely to be any non batch friendly chars like <>|& etc. in there?

steve
Dan CraciunIT ConsultantCommented:
If you're stuck on Windows, you can use a more modern approach, in Powershell:

Get-Content test.txt | Select-Object -Unique > out.txt
tliottaCommented:
Since this is in the MS DOS topic, are you seriously asking if this can be done in an actual DOS .BAT file? So far, it sounds like you are. It's possible that (1) it can't be reasonably done in a DOS .BAT file, and (2) no one remembers how to do it assuming that it's even possible.

Tom
Dan CraciunIT ConsultantCommented:
@Tom: I'm sure oBda and probably others can prove that they have a long memory :)

The OP did say this is a learning exercise. Why would someone put a lot of time and effort in learning batch in 2013, I don't know, but I guess some things never die.
Steve KnightIT ConsultancyCommented:
If the OP can come back with the info, mentioned before it is not that difficult in batch, all depending upon what is classed as unique.  Can be as simple as sort it and loop through lines, outputting if different from the last one, 2-3 lines of batch:

For a one off job I wouldn't do it in batch mind, and if there was any possibility of any control characters in there then you are much better off doing similar job in VBScript, or even easier PS as has been shown already.

In batch though, roughly something like this... sort it to temp.txt, loop over lines in temp.txt and if the lines are different to the last then echo it, redirect all that to output.txt

Steve

@echo off
setlocal enabledelayedexpansion
sort < input.txt > temp.txt
set lastline=###
(for /f "tokens=*" %%a in (temp.txt) do (
  if NOT "!lastline!"=="%%~a" echo %%~a
)) > output.txt

Steve
scuzz1Author Commented:
? is duplicate based on a completely identical line or a certain part of it?
? long as it is the whole line

a. Lines are completely identical

? Are there likely to be any non batch friendly chars like <>|& etc. in there

a. No special characters

Text is very short. Maybe an average of 10 chars per line.

I am going to try Steve's approach.

Thanks for all of your input.

Jim
Steve KnightIT ConsultancyCommented:
It would help if it kept note of the last line...

Add
set lastline=%%~a

Open in new window


before the last line, i.e.

@echo off
setlocal enabledelayedexpansion
sort < input.txt > temp.txt
set lastline=###
(for /f "tokens=*" %%a in (temp.txt) do (
  if NOT "!lastline!"=="%%~a" echo %%~a
  set lastline=%%~a
)) > output.txt

Open in new window

scuzz1Author Commented:
Ok. That only deleted one of the duplicated lines. I need to delete delete bolth lines.

i.e.

a99552
a99553
a99554 < --- Delete
a99554 < --- Delete
a99555
a99556
a9955g
a9955r < --- Delete
a9955r < --- Delete
a9956
a99568

There will only be pairs like that too.

Jim
Steve KnightIT ConsultancyCommented:
Oh, then I misunderstood, I thought you wanted a list just made unique, will consider now.

Steve
Dan CraciunIT ConsultantCommented:
Just as a curiosity: what's wrong with the one line in PS approach?
Bill PrewIT / Software Engineering ConsultantCommented:
Here's an approach, adjust to your file names and give it a try.

@echo off
setlocal EnableDelayedExpansion

set InFile=in.txt
set Outfile=out.txt

(
  for /f "usebackq tokens=*" %%A in ("%InFile%") do (
    for /f %%B in ('find /c "%%~A" ^< "%InFile%"') do (
      if %%B EQU 1 echo %%~A
    )
  )
) > "%OutFile%"

Open in new window

~bp
Steve KnightIT ConsultancyCommented:
This is what I came up with earlier but didn't have time to post, bit more complicated than Bill's but doesn't need to run a "find" for each line.

Steve

@echo off
setlocal enabledelayedexpansion
sort < input.txt > temp.txt

Set lastline=#FIRST#
set Dup=#NO#

(for /f "tokens=*" %%a in (temp.txt) do (
  if "!DUP!!lastline!"=="#NO#%%~a" set Dup=#YES#
  if "!DUP!!lastline!"=="#YES#%%~a" set Dup=#YES#
  if "!lastline!"=="#FIRST#" set lastline=%%~a
  if NOT "!lastline!"=="%%~a" (
    IF "!Dup!"=="#NO#" echo !lastline!
    set Dup=#NO#
  )
  set lastline=%%~a
)
if "!Dup!"=="#NO#" echo !lastline!
) > output.txt
start output.txt

Open in new window

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
scuzz1Author Commented:
That is it Steve. I knew it could be done. I just could not get it in my head. Thanks.

Dan, This is just something that bothered me. Your solution did work. I am not too familiar with PowerShell. I guess I should be. I am going to take Steve version and see if I can convert it to vbs. I just like to play around. That is the only way I can learn.

Thank you all.
Jim
Steve KnightIT ConsultancyCommented:
No problem, EE is where I learnt most of the fancier batch stuff and VBScript from people like SteveGTR, Bill Prew, Qlemo and many others and come in so useful frankly... good luck with it.

Now to get the kids asleep...
Steve KnightIT ConsultancyCommented:
Is there an EE zone I have missed for teaching kids how to sleep at night, get up in the morning and eat the food you put in front of them?
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Microsoft DOS

From novice to tech pro — start learning today.