?
Solved

Getting rid of duplicate lines

Posted on 2013-11-27
20
Medium Priority
?
315 Views
Last Modified: 2013-11-30
I need something that will read a textfile and pull out all of the text that DOES NOT have a duplaicate line in the file. I would like to see a .bat file do this so I can understand how to set and reset variables within a for loop. I am not quite getting what is going on.

I would appreciate any help here.

Thanks,
Jim
0
Comment
Question by:scuzz1
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 7
  • 6
  • 5
  • +2
20 Comments
 
LVL 35

Expert Comment

by:Dan Craciun
ID: 39682728
A simpler way (but manual) is to import your text file into Excel and use conditional formatting to highlight all your duplicate lines.

HTH,
Dan
0
 

Author Comment

by:scuzz1
ID: 39682732
There are almost 200000 lines.

That would take a while. I am also hoping to learn something from this solution.

Thanks,
Jim
0
 
LVL 35

Expert Comment

by:Dan Craciun
ID: 39682738
I don't know how one would implement a hashtable/array in batch or how efficient that would be.

But if you have access to a linux box the following will copy only lines without duplicates to output.txt:

uniq -u input.txt > output.txt

Gotta love Linux sometimes :)
0
Microsoft Certification Exam 74-409

Veeam® is happy to provide the Microsoft community with a study guide prepared by MVP and MCT, Orin Thomas. This guide will take you through each of the exam objectives, helping you to prepare for and pass the examination.

 

Author Comment

by:scuzz1
ID: 39682744
That only tells me it can be done....Only thing I need now is an algorithm that works in Winders...
0
 
LVL 35

Expert Comment

by:Dan Craciun
ID: 39682749
The algorithm is simple:
1. create a sort file that will contain unique lines from input and no of repetitions:
create sort file
read line from input file
if line not in sort file
  add it with repetition number 1
else
  read repetition number
  increment repetition number
  replace repetition number with incremented one
end if

2. read sort file and output only lines with repetition number 1

The problem is how efficient that would be in batch with a 200 000 lines input. Especially if duplicates are scarce.
0
 
LVL 43

Expert Comment

by:Steve Knight
ID: 39682815
could we see a sample few lines please?  also is duplicate based on a completely identical line or a certain part of it?

as long as it is the whole line, a field at the start of the line, or certain chars from the left then a sort solution would work.

this would leave your output in alphabetical order, and any blank lines etc. will have gone.

Are there likely to be any non batch friendly chars like <>|& etc. in there?

steve
0
 
LVL 35

Assisted Solution

by:Dan Craciun
Dan Craciun earned 200 total points
ID: 39682826
If you're stuck on Windows, you can use a more modern approach, in Powershell:

Get-Content test.txt | Select-Object -Unique > out.txt
0
 
LVL 27

Expert Comment

by:tliotta
ID: 39683250
Since this is in the MS DOS topic, are you seriously asking if this can be done in an actual DOS .BAT file? So far, it sounds like you are. It's possible that (1) it can't be reasonably done in a DOS .BAT file, and (2) no one remembers how to do it assuming that it's even possible.

Tom
0
 
LVL 35

Expert Comment

by:Dan Craciun
ID: 39683265
@Tom: I'm sure oBda and probably others can prove that they have a long memory :)

The OP did say this is a learning exercise. Why would someone put a lot of time and effort in learning batch in 2013, I don't know, but I guess some things never die.
0
 
LVL 43

Expert Comment

by:Steve Knight
ID: 39683361
If the OP can come back with the info, mentioned before it is not that difficult in batch, all depending upon what is classed as unique.  Can be as simple as sort it and loop through lines, outputting if different from the last one, 2-3 lines of batch:

For a one off job I wouldn't do it in batch mind, and if there was any possibility of any control characters in there then you are much better off doing similar job in VBScript, or even easier PS as has been shown already.

In batch though, roughly something like this... sort it to temp.txt, loop over lines in temp.txt and if the lines are different to the last then echo it, redirect all that to output.txt

Steve

@echo off
setlocal enabledelayedexpansion
sort < input.txt > temp.txt
set lastline=###
(for /f "tokens=*" %%a in (temp.txt) do (
  if NOT "!lastline!"=="%%~a" echo %%~a
)) > output.txt

Steve
0
 

Author Comment

by:scuzz1
ID: 39683513
? is duplicate based on a completely identical line or a certain part of it?
? long as it is the whole line

a. Lines are completely identical

? Are there likely to be any non batch friendly chars like <>|& etc. in there

a. No special characters

Text is very short. Maybe an average of 10 chars per line.

I am going to try Steve's approach.

Thanks for all of your input.

Jim
0
 
LVL 43

Expert Comment

by:Steve Knight
ID: 39683527
It would help if it kept note of the last line...

Add
set lastline=%%~a

Open in new window


before the last line, i.e.

@echo off
setlocal enabledelayedexpansion
sort < input.txt > temp.txt
set lastline=###
(for /f "tokens=*" %%a in (temp.txt) do (
  if NOT "!lastline!"=="%%~a" echo %%~a
  set lastline=%%~a
)) > output.txt

Open in new window

0
 

Author Comment

by:scuzz1
ID: 39683561
Ok. That only deleted one of the duplicated lines. I need to delete delete bolth lines.

i.e.

a99552
a99553
a99554 < --- Delete
a99554 < --- Delete
a99555
a99556
a9955g
a9955r < --- Delete
a9955r < --- Delete
a9956
a99568

There will only be pairs like that too.

Jim
0
 
LVL 43

Expert Comment

by:Steve Knight
ID: 39683674
Oh, then I misunderstood, I thought you wanted a list just made unique, will consider now.

Steve
0
 
LVL 35

Expert Comment

by:Dan Craciun
ID: 39683677
Just as a curiosity: what's wrong with the one line in PS approach?
0
 
LVL 56

Expert Comment

by:Bill Prew
ID: 39683720
Here's an approach, adjust to your file names and give it a try.

@echo off
setlocal EnableDelayedExpansion

set InFile=in.txt
set Outfile=out.txt

(
  for /f "usebackq tokens=*" %%A in ("%InFile%") do (
    for /f %%B in ('find /c "%%~A" ^< "%InFile%"') do (
      if %%B EQU 1 echo %%~A
    )
  )
) > "%OutFile%"

Open in new window

~bp
0
 
LVL 43

Accepted Solution

by:
Steve Knight earned 1800 total points
ID: 39684322
This is what I came up with earlier but didn't have time to post, bit more complicated than Bill's but doesn't need to run a "find" for each line.

Steve

@echo off
setlocal enabledelayedexpansion
sort < input.txt > temp.txt

Set lastline=#FIRST#
set Dup=#NO#

(for /f "tokens=*" %%a in (temp.txt) do (
  if "!DUP!!lastline!"=="#NO#%%~a" set Dup=#YES#
  if "!DUP!!lastline!"=="#YES#%%~a" set Dup=#YES#
  if "!lastline!"=="#FIRST#" set lastline=%%~a
  if NOT "!lastline!"=="%%~a" (
    IF "!Dup!"=="#NO#" echo !lastline!
    set Dup=#NO#
  )
  set lastline=%%~a
)
if "!Dup!"=="#NO#" echo !lastline!
) > output.txt
start output.txt

Open in new window

0
 

Author Closing Comment

by:scuzz1
ID: 39687245
That is it Steve. I knew it could be done. I just could not get it in my head. Thanks.

Dan, This is just something that bothered me. Your solution did work. I am not too familiar with PowerShell. I guess I should be. I am going to take Steve version and see if I can convert it to vbs. I just like to play around. That is the only way I can learn.

Thank you all.
Jim
0
 
LVL 43

Expert Comment

by:Steve Knight
ID: 39687284
No problem, EE is where I learnt most of the fancier batch stuff and VBScript from people like SteveGTR, Bill Prew, Qlemo and many others and come in so useful frankly... good luck with it.

Now to get the kids asleep...
0
 
LVL 43

Expert Comment

by:Steve Knight
ID: 39687287
Is there an EE zone I have missed for teaching kids how to sleep at night, get up in the morning and eat the food you put in front of them?
0

Featured Post

Microsoft Certification Exam 74-409

Veeam® is happy to provide the Microsoft community with a study guide prepared by MVP and MCT, Orin Thomas. This guide will take you through each of the exam objectives, helping you to prepare for and pass the examination.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

The following is a collection of cases for strange behaviour when using advanced techniques in DOS batch files. You should have some basic experience in batch "programming", as I'm assuming some knowledge and not further explain the basics. For some…
Introduction: Recently, I got a requirement to zip all files individually with batch file script in Windows OS. I don't know much about scripting, but I searched Google and found a lot of examples and websites to complete my task. Finally, I was ab…
Michael from AdRem Software outlines event notifications and Automatic Corrective Actions in network monitoring. Automatic Corrective Actions are scripts, which can automatically run upon discovery of a certain undesirable condition in your network.…
In this video, Percona Solution Engineer Rick Golba discuss how (and why) you implement high availability in a database environment. To discuss how Percona Consulting can help with your design and architecture needs for your database and infrastr…
Suggested Courses
Course of the Month9 days, 17 hours left to enroll

762 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question