Solved

Getting rid of duplicate lines

Posted on 2013-11-27
20
306 Views
Last Modified: 2013-11-30
I need something that will read a textfile and pull out all of the text that DOES NOT have a duplaicate line in the file. I would like to see a .bat file do this so I can understand how to set and reset variables within a for loop. I am not quite getting what is going on.

I would appreciate any help here.

Thanks,
Jim
0
Comment
Question by:scuzz1
  • 7
  • 6
  • 5
  • +2
20 Comments
 
LVL 34

Expert Comment

by:Dan Craciun
ID: 39682728
A simpler way (but manual) is to import your text file into Excel and use conditional formatting to highlight all your duplicate lines.

HTH,
Dan
0
 

Author Comment

by:scuzz1
ID: 39682732
There are almost 200000 lines.

That would take a while. I am also hoping to learn something from this solution.

Thanks,
Jim
0
 
LVL 34

Expert Comment

by:Dan Craciun
ID: 39682738
I don't know how one would implement a hashtable/array in batch or how efficient that would be.

But if you have access to a linux box the following will copy only lines without duplicates to output.txt:

uniq -u input.txt > output.txt

Gotta love Linux sometimes :)
0
 

Author Comment

by:scuzz1
ID: 39682744
That only tells me it can be done....Only thing I need now is an algorithm that works in Winders...
0
 
LVL 34

Expert Comment

by:Dan Craciun
ID: 39682749
The algorithm is simple:
1. create a sort file that will contain unique lines from input and no of repetitions:
create sort file
read line from input file
if line not in sort file
  add it with repetition number 1
else
  read repetition number
  increment repetition number
  replace repetition number with incremented one
end if

2. read sort file and output only lines with repetition number 1

The problem is how efficient that would be in batch with a 200 000 lines input. Especially if duplicates are scarce.
0
 
LVL 43

Expert Comment

by:Steve Knight
ID: 39682815
could we see a sample few lines please?  also is duplicate based on a completely identical line or a certain part of it?

as long as it is the whole line, a field at the start of the line, or certain chars from the left then a sort solution would work.

this would leave your output in alphabetical order, and any blank lines etc. will have gone.

Are there likely to be any non batch friendly chars like <>|& etc. in there?

steve
0
 
LVL 34

Assisted Solution

by:Dan Craciun
Dan Craciun earned 50 total points
ID: 39682826
If you're stuck on Windows, you can use a more modern approach, in Powershell:

Get-Content test.txt | Select-Object -Unique > out.txt
0
 
LVL 27

Expert Comment

by:tliotta
ID: 39683250
Since this is in the MS DOS topic, are you seriously asking if this can be done in an actual DOS .BAT file? So far, it sounds like you are. It's possible that (1) it can't be reasonably done in a DOS .BAT file, and (2) no one remembers how to do it assuming that it's even possible.

Tom
0
 
LVL 34

Expert Comment

by:Dan Craciun
ID: 39683265
@Tom: I'm sure oBda and probably others can prove that they have a long memory :)

The OP did say this is a learning exercise. Why would someone put a lot of time and effort in learning batch in 2013, I don't know, but I guess some things never die.
0
 
LVL 43

Expert Comment

by:Steve Knight
ID: 39683361
If the OP can come back with the info, mentioned before it is not that difficult in batch, all depending upon what is classed as unique.  Can be as simple as sort it and loop through lines, outputting if different from the last one, 2-3 lines of batch:

For a one off job I wouldn't do it in batch mind, and if there was any possibility of any control characters in there then you are much better off doing similar job in VBScript, or even easier PS as has been shown already.

In batch though, roughly something like this... sort it to temp.txt, loop over lines in temp.txt and if the lines are different to the last then echo it, redirect all that to output.txt

Steve

@echo off
setlocal enabledelayedexpansion
sort < input.txt > temp.txt
set lastline=###
(for /f "tokens=*" %%a in (temp.txt) do (
  if NOT "!lastline!"=="%%~a" echo %%~a
)) > output.txt

Steve
0
Threat Intelligence Starter Resources

Integrating threat intelligence can be challenging, and not all companies are ready. These resources can help you build awareness and prepare for defense.

 

Author Comment

by:scuzz1
ID: 39683513
? is duplicate based on a completely identical line or a certain part of it?
? long as it is the whole line

a. Lines are completely identical

? Are there likely to be any non batch friendly chars like <>|& etc. in there

a. No special characters

Text is very short. Maybe an average of 10 chars per line.

I am going to try Steve's approach.

Thanks for all of your input.

Jim
0
 
LVL 43

Expert Comment

by:Steve Knight
ID: 39683527
It would help if it kept note of the last line...

Add
set lastline=%%~a

Open in new window


before the last line, i.e.

@echo off
setlocal enabledelayedexpansion
sort < input.txt > temp.txt
set lastline=###
(for /f "tokens=*" %%a in (temp.txt) do (
  if NOT "!lastline!"=="%%~a" echo %%~a
  set lastline=%%~a
)) > output.txt

Open in new window

0
 

Author Comment

by:scuzz1
ID: 39683561
Ok. That only deleted one of the duplicated lines. I need to delete delete bolth lines.

i.e.

a99552
a99553
a99554 < --- Delete
a99554 < --- Delete
a99555
a99556
a9955g
a9955r < --- Delete
a9955r < --- Delete
a9956
a99568

There will only be pairs like that too.

Jim
0
 
LVL 43

Expert Comment

by:Steve Knight
ID: 39683674
Oh, then I misunderstood, I thought you wanted a list just made unique, will consider now.

Steve
0
 
LVL 34

Expert Comment

by:Dan Craciun
ID: 39683677
Just as a curiosity: what's wrong with the one line in PS approach?
0
 
LVL 51

Expert Comment

by:Bill Prew
ID: 39683720
Here's an approach, adjust to your file names and give it a try.

@echo off
setlocal EnableDelayedExpansion

set InFile=in.txt
set Outfile=out.txt

(
  for /f "usebackq tokens=*" %%A in ("%InFile%") do (
    for /f %%B in ('find /c "%%~A" ^< "%InFile%"') do (
      if %%B EQU 1 echo %%~A
    )
  )
) > "%OutFile%"

Open in new window

~bp
0
 
LVL 43

Accepted Solution

by:
Steve Knight earned 450 total points
ID: 39684322
This is what I came up with earlier but didn't have time to post, bit more complicated than Bill's but doesn't need to run a "find" for each line.

Steve

@echo off
setlocal enabledelayedexpansion
sort < input.txt > temp.txt

Set lastline=#FIRST#
set Dup=#NO#

(for /f "tokens=*" %%a in (temp.txt) do (
  if "!DUP!!lastline!"=="#NO#%%~a" set Dup=#YES#
  if "!DUP!!lastline!"=="#YES#%%~a" set Dup=#YES#
  if "!lastline!"=="#FIRST#" set lastline=%%~a
  if NOT "!lastline!"=="%%~a" (
    IF "!Dup!"=="#NO#" echo !lastline!
    set Dup=#NO#
  )
  set lastline=%%~a
)
if "!Dup!"=="#NO#" echo !lastline!
) > output.txt
start output.txt

Open in new window

0
 

Author Closing Comment

by:scuzz1
ID: 39687245
That is it Steve. I knew it could be done. I just could not get it in my head. Thanks.

Dan, This is just something that bothered me. Your solution did work. I am not too familiar with PowerShell. I guess I should be. I am going to take Steve version and see if I can convert it to vbs. I just like to play around. That is the only way I can learn.

Thank you all.
Jim
0
 
LVL 43

Expert Comment

by:Steve Knight
ID: 39687284
No problem, EE is where I learnt most of the fancier batch stuff and VBScript from people like SteveGTR, Bill Prew, Qlemo and many others and come in so useful frankly... good luck with it.

Now to get the kids asleep...
0
 
LVL 43

Expert Comment

by:Steve Knight
ID: 39687287
Is there an EE zone I have missed for teaching kids how to sleep at night, get up in the morning and eat the food you put in front of them?
0

Featured Post

Why You Should Analyze Threat Actor TTPs

After years of analyzing threat actor behavior, it’s become clear that at any given time there are specific tactics, techniques, and procedures (TTPs) that are particularly prevalent. By analyzing and understanding these TTPs, you can dramatically enhance your security program.

Join & Write a Comment

TOMORROW TOMORROW.BAT is inspired by a question I get asked over and over again; that is, "How can I use batch file commands to obtain tomorrow's date?" The crux of this batch file revolves around the XCOPY command - a technique I discovered w…
Being a system administrator some time we require to do things remotely, one of them is installing software. Here I am going to tell you how to install software through wmic (Windows management instrument console). I am not at all saying that this i…
Get a first impression of how PRTG looks and learn how it works.   This video is a short introduction to PRTG, as an initial overview or as a quick start for new PRTG users.
This tutorial demonstrates a quick way of adding group price to multiple Magento products.

758 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

23 Experts available now in Live!

Get 1:1 Help Now