Solved

Getting rid of duplicate lines

Posted on 2013-11-27
20
310 Views
Last Modified: 2013-11-30
I need something that will read a textfile and pull out all of the text that DOES NOT have a duplaicate line in the file. I would like to see a .bat file do this so I can understand how to set and reset variables within a for loop. I am not quite getting what is going on.

I would appreciate any help here.

Thanks,
Jim
0
Comment
Question by:scuzz1
  • 7
  • 6
  • 5
  • +2
20 Comments
 
LVL 34

Expert Comment

by:Dan Craciun
ID: 39682728
A simpler way (but manual) is to import your text file into Excel and use conditional formatting to highlight all your duplicate lines.

HTH,
Dan
0
 

Author Comment

by:scuzz1
ID: 39682732
There are almost 200000 lines.

That would take a while. I am also hoping to learn something from this solution.

Thanks,
Jim
0
 
LVL 34

Expert Comment

by:Dan Craciun
ID: 39682738
I don't know how one would implement a hashtable/array in batch or how efficient that would be.

But if you have access to a linux box the following will copy only lines without duplicates to output.txt:

uniq -u input.txt > output.txt

Gotta love Linux sometimes :)
0
Free Tool: Port Scanner

Check which ports are open to the outside world. Helps make sure that your firewall rules are working as intended.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

 

Author Comment

by:scuzz1
ID: 39682744
That only tells me it can be done....Only thing I need now is an algorithm that works in Winders...
0
 
LVL 34

Expert Comment

by:Dan Craciun
ID: 39682749
The algorithm is simple:
1. create a sort file that will contain unique lines from input and no of repetitions:
create sort file
read line from input file
if line not in sort file
  add it with repetition number 1
else
  read repetition number
  increment repetition number
  replace repetition number with incremented one
end if

2. read sort file and output only lines with repetition number 1

The problem is how efficient that would be in batch with a 200 000 lines input. Especially if duplicates are scarce.
0
 
LVL 43

Expert Comment

by:Steve Knight
ID: 39682815
could we see a sample few lines please?  also is duplicate based on a completely identical line or a certain part of it?

as long as it is the whole line, a field at the start of the line, or certain chars from the left then a sort solution would work.

this would leave your output in alphabetical order, and any blank lines etc. will have gone.

Are there likely to be any non batch friendly chars like <>|& etc. in there?

steve
0
 
LVL 34

Assisted Solution

by:Dan Craciun
Dan Craciun earned 50 total points
ID: 39682826
If you're stuck on Windows, you can use a more modern approach, in Powershell:

Get-Content test.txt | Select-Object -Unique > out.txt
0
 
LVL 27

Expert Comment

by:tliotta
ID: 39683250
Since this is in the MS DOS topic, are you seriously asking if this can be done in an actual DOS .BAT file? So far, it sounds like you are. It's possible that (1) it can't be reasonably done in a DOS .BAT file, and (2) no one remembers how to do it assuming that it's even possible.

Tom
0
 
LVL 34

Expert Comment

by:Dan Craciun
ID: 39683265
@Tom: I'm sure oBda and probably others can prove that they have a long memory :)

The OP did say this is a learning exercise. Why would someone put a lot of time and effort in learning batch in 2013, I don't know, but I guess some things never die.
0
 
LVL 43

Expert Comment

by:Steve Knight
ID: 39683361
If the OP can come back with the info, mentioned before it is not that difficult in batch, all depending upon what is classed as unique.  Can be as simple as sort it and loop through lines, outputting if different from the last one, 2-3 lines of batch:

For a one off job I wouldn't do it in batch mind, and if there was any possibility of any control characters in there then you are much better off doing similar job in VBScript, or even easier PS as has been shown already.

In batch though, roughly something like this... sort it to temp.txt, loop over lines in temp.txt and if the lines are different to the last then echo it, redirect all that to output.txt

Steve

@echo off
setlocal enabledelayedexpansion
sort < input.txt > temp.txt
set lastline=###
(for /f "tokens=*" %%a in (temp.txt) do (
  if NOT "!lastline!"=="%%~a" echo %%~a
)) > output.txt

Steve
0
 

Author Comment

by:scuzz1
ID: 39683513
? is duplicate based on a completely identical line or a certain part of it?
? long as it is the whole line

a. Lines are completely identical

? Are there likely to be any non batch friendly chars like <>|& etc. in there

a. No special characters

Text is very short. Maybe an average of 10 chars per line.

I am going to try Steve's approach.

Thanks for all of your input.

Jim
0
 
LVL 43

Expert Comment

by:Steve Knight
ID: 39683527
It would help if it kept note of the last line...

Add
set lastline=%%~a

Open in new window


before the last line, i.e.

@echo off
setlocal enabledelayedexpansion
sort < input.txt > temp.txt
set lastline=###
(for /f "tokens=*" %%a in (temp.txt) do (
  if NOT "!lastline!"=="%%~a" echo %%~a
  set lastline=%%~a
)) > output.txt

Open in new window

0
 

Author Comment

by:scuzz1
ID: 39683561
Ok. That only deleted one of the duplicated lines. I need to delete delete bolth lines.

i.e.

a99552
a99553
a99554 < --- Delete
a99554 < --- Delete
a99555
a99556
a9955g
a9955r < --- Delete
a9955r < --- Delete
a9956
a99568

There will only be pairs like that too.

Jim
0
 
LVL 43

Expert Comment

by:Steve Knight
ID: 39683674
Oh, then I misunderstood, I thought you wanted a list just made unique, will consider now.

Steve
0
 
LVL 34

Expert Comment

by:Dan Craciun
ID: 39683677
Just as a curiosity: what's wrong with the one line in PS approach?
0
 
LVL 53

Expert Comment

by:Bill Prew
ID: 39683720
Here's an approach, adjust to your file names and give it a try.

@echo off
setlocal EnableDelayedExpansion

set InFile=in.txt
set Outfile=out.txt

(
  for /f "usebackq tokens=*" %%A in ("%InFile%") do (
    for /f %%B in ('find /c "%%~A" ^< "%InFile%"') do (
      if %%B EQU 1 echo %%~A
    )
  )
) > "%OutFile%"

Open in new window

~bp
0
 
LVL 43

Accepted Solution

by:
Steve Knight earned 450 total points
ID: 39684322
This is what I came up with earlier but didn't have time to post, bit more complicated than Bill's but doesn't need to run a "find" for each line.

Steve

@echo off
setlocal enabledelayedexpansion
sort < input.txt > temp.txt

Set lastline=#FIRST#
set Dup=#NO#

(for /f "tokens=*" %%a in (temp.txt) do (
  if "!DUP!!lastline!"=="#NO#%%~a" set Dup=#YES#
  if "!DUP!!lastline!"=="#YES#%%~a" set Dup=#YES#
  if "!lastline!"=="#FIRST#" set lastline=%%~a
  if NOT "!lastline!"=="%%~a" (
    IF "!Dup!"=="#NO#" echo !lastline!
    set Dup=#NO#
  )
  set lastline=%%~a
)
if "!Dup!"=="#NO#" echo !lastline!
) > output.txt
start output.txt

Open in new window

0
 

Author Closing Comment

by:scuzz1
ID: 39687245
That is it Steve. I knew it could be done. I just could not get it in my head. Thanks.

Dan, This is just something that bothered me. Your solution did work. I am not too familiar with PowerShell. I guess I should be. I am going to take Steve version and see if I can convert it to vbs. I just like to play around. That is the only way I can learn.

Thank you all.
Jim
0
 
LVL 43

Expert Comment

by:Steve Knight
ID: 39687284
No problem, EE is where I learnt most of the fancier batch stuff and VBScript from people like SteveGTR, Bill Prew, Qlemo and many others and come in so useful frankly... good luck with it.

Now to get the kids asleep...
0
 
LVL 43

Expert Comment

by:Steve Knight
ID: 39687287
Is there an EE zone I have missed for teaching kids how to sleep at night, get up in the morning and eat the food you put in front of them?
0

Featured Post

Free Tool: ZipGrep

ZipGrep is a utility that can list and search zip (.war, .ear, .jar, etc) archives for text patterns, without the need to extract the archive's contents.

One of a set of tools we're offering as a way to say thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Title # Comments Views Activity
Script to copy every 7th file 10 60
Data copy batch script 21 62
Printing Folders, SubFolders, Files. 16 208
Network Opened Files : Script or Tool without Server access 4 77
Using dates in 'DOS' batch files has always been tricky as it has no built in ways of extracting date information.  There are many tricks using string manipulation to pull out parts of the %date% variable or output of the date /t command but these r…
YESTERDAY YESTERDAY.BAT is inspired by a previous article I wrote entitled: TOMORROW.BAT (http://www.experts-exchange.com/OS/Microsoft_Operating_Systems/MS_DOS/A_4196-Advanced-Batch-File-Programming-TOMORROW-BAT.html). The crux of this batch f…
Nobody understands Phishing better than an anti-spam company. That’s why we are providing Phishing Awareness Training to our customers. According to a report by Verizon, only 3% of targeted users report malicious emails to management. With compan…
A short tutorial showing how to set up an email signature in Outlook on the Web (previously known as OWA). For free email signatures designs, visit https://www.mail-signatures.com/articles/signature-templates/?sts=6651 If you want to manage em…

808 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question