Link to home
Start Free TrialLog in
Avatar of matthewdacruz
matthewdacruzFlag for United Kingdom of Great Britain and Northern Ireland

asked on

Remove duplicate lines .bat file

Hi Experts

I have created a .bat file that merges all .csv files into one master file to import.

I have used

copy *.csv importfile.csv

I want to add a function that once the master file is created it the removes all duplicate lines from the master file.

Is this possible in a .bat file?
If it is how would it be written?
Avatar of Bill Prew
Bill Prew

Will the duplicate lines already be right next to each other, or would they need to be sorted first, and if so is it okay for the ordering of the file to change?

BAT scripts are somewhat limited in this area, but we might be able to work something out.

Can you provide a sample before and after file?

~bp
Does this work for you?

@echo off

copy nul importfile.tmp >nul

for /f "tokens=*" %%a in (importfile.csv) do (
  find "%%a" importfile.tmp >nul || echo %%a>>importfile.tmp
)

move /y importfile.tmp importfile.csv

Open in new window

I think FINDSTR is needed here, rather than FIND, since FIND will remove substring matches.

@echo off

copy NUL in.tmp >NUL

for /f "tokens=*" %%A in (in.txt) do (
  findstr /B /I /C:"%%~A" in.tmp >NUL || echo %%A>>in.tmp
)

move /y in.tmp in.txt

Open in new window

~bp
Also be aware that in both these approach, all blank lines are removed.

~bp
Do you really want to consider the whole line for detecting duplicates? Or is there a key column? If the complete line "is key", and the order of entries (lines) does not matter:
@echo off
setlocal EnableDelayedExpansion

set src=importfile.csv
set dst=unique.csv

set line=
(for /F "delims=" %%L in ('sort %src%') do (
   if not ~!line!~ == ~%%L~ echo %%L
   set line=%%L
)) > "%dst%"

Open in new window

In line 6 of my code above (http:#a37902584) you could insert an '/i' to ignore case-sensitivity, like this:

find /i "%%a" importfile.tmp >nul || echo %%a>>importfile.tmp

Open in new window

Avatar of matthewdacruz

ASKER

Hi
Thanks for the responses

Ordering is not an issue.
the key column would be: Keyword

The column header is merged into the file so we would want to preserve the first row and then remove the iterations or header that follow as all files are merged into one file.

Sometime we end up with over 100 .csv files and we want to merge into one file to analyse.

So ideally we would want the following format

Column header
Keywords & keyword data for columns
bridal-makeup-artist-creamy-keyw.csv
bill
since FIND will remove substring matches
I did consider using FINDSTR however, in the probable case where all lines follow the same CSV format and consist of the same number of fields, using FIND will suffice.

However, if FIND is not sufficient then the following modified line 6 should be used instead:

findstr /b /c:"%%a" /i importfile.tmp >nul || echo %%a>>importfile.tmp

Open in new window

Try this:

Drop this batch file into the same folder as all your CSV files and run it from there. It will process all your CSV files and produce a master file named IMPORTFILE.TXT.

@echo off

copy nul importfile.txt >nul

for %%a in (*.csv) do (
  for /f "tokens=* usebackq" %%b in ("%%a") do (
    find "%%b" /i importfile.txt>nul || echo %%b>>importfile.txt
  )
)

Open in new window

Will make one LARGE improvement in a mo....
Hi
So far none of the .bats return any data.
I added to notepad, saved as test.bat in the folder with all the files and run as an administrator. Only blank .csv created

the only code that returned data was this but it was the file names and not the actual content in the file.
===========================================

@echo off

copy nul importfile.csv >nul

for %%a in (*.csv) do (
  for /f "tokens=* usebackq" %%b in ("%%a") do (
    findstr /b /c:"%%b" /i importfile.csv>nul || echo %%a>>importfile.csv
  )
)

echo See results in IMPORTFILE.CSV

===========================================

I have run

copy *.csv importfile.csv

and it created the attached importfile.csv.

All that needs to be sorted out in the .csv is the duplicates.
So if you open in excel you will see the duplicates when you sort by keyword.
importfile.csv
matthewdacruz
the only code that returned data was this but it was the file names and not the actual content
Yes, it should have been 'echo %%b', not 'echo %%a'. Try the code above (http:#a37903099).

Or try this code here, it sorts the data as well. The file is saved as IMPORTFILE.TXT (not CSV) to prevent this file from being processed along with all the others on subsequent runs).

@echo off

copy nul importfile.tmp >nul

for %%a in (*.csv) do (
  for /f "tokens=* usebackq" %%b in ("%%a") do (
    find "%%b" /i importfile.tmp>nul || echo %%b>>importfile.tmp
  )
)

(set /p header=)<importfile.tmp
echo %header%>importfile.txt

more +1 importfile.tmp|sort>>importfile.txt

Open in new window

So, are you just trying to remove the duplicated header lines, or also any data lines that turned up identical in multiple CSVs?

~bp
matthewdacruz

The attached file is the result after processing your attached CSV file in http:#a37903131.

This was processed using the code in http:#a37903305.
imortfile.txt
And I'm pretty sure that 37902667 was a working solution, as long as you changed the in.txt filename to your merged filename (2 places).

~bp
If you really only want to eliminate the header, try this:
@echo off

set header=
(for %%F in (*.csv) do (
  if not defined header (
    set header=y
    type %%F
  ) else (
    more +1 %%F <nul
  )
)) > result.txt

Open in new window

The output file is named *.txt intentionally - keeps the batch easier, as else we would have to exclude that file from the FOR loop. If you have unique names, you can also use those; just make sure the file mask in FOR does not apply to the output file.
Hi Paul
Your latest file was great the way to built out the file and ordered it was perfect.
We just got duplicate rows

 
So, are you just trying to remove the duplicated header lines, or also any data lines that turned up identical in multiple CSVs?

~bp

Yes I am wanting to get rid of the duplicate rows that have been carried across.
See the results of Pauls .bat code.

It has dealt with the duplicate headers.
Just the rows are now needing to be  de duped.

If this can be achieved that will be brilliant.

Thanks for the help so far
importfile.txt
And if you wanted a speedy and simple solution to code, get yourself a copy of gnu sort for windows (freeware) at:

http://gnuwin32.sourceforge.net/packages/coreutils.htm

and do:

sort -u *.csv > all.txt

~bp
I used ID: 37903305 code
matthewdacruz

Here are the first few lines from the output file after processing. Please advise on the following comments (I have used code-view for displaying the lines so that they do not wrap around on the screen):

The original file (which you posted) contains 1249 lines. This includes 35 header lines. After processing the file we end up with 929 lines. So even after removing 34 (because we need to keep one) header lines, we have still ended up having removed a further 286 duplicate lines.

So, please tell us where the other duplicate lines are. Below is a sample of the file after processing it.

This is the first line. It is the header.
Keyword,Lower Cost/Click,Upper Cost/click,Lower Clicks/day,Upper Clicks/day,Google Searches/day,Google Competing Pages,Yahoo Competing Pages,LSI,Bayes,Uses,Type,LARI

Open in new window

These two lines do not have the same keyword.
airbrush make-up,0.55,0.55,0.04,0.04,325.48,118000,500000,13.959,-9.06,1,pure diverse,66.39785444
airbrush makeup,0.52,0.52,7.66,7.66,325.48,64500,247000,13.959,-9.06,1,partial diverse,66.39785444

Open in new window

The following three lines have duplicate keynames but the data differs in the 9th field onwards. If only ONE had to be kept, which two would you want deleting, and on what do you base this decision?
airbrushing,0.74,0.74,17.9,17.9,2975.34,493000,700000,2.927,-5.966,2,pure diverse,45.19468688
airbrushing,0.74,0.74,17.9,17.9,2975.34,493000,700000,3.388,-6.551,2,pure diverse,44.92439007
airbrushing,0.74,0.74,17.9,17.9,2975.34,493000,700000,3.474,-6.397,2,pure diverse,45.36898628

Open in new window

No problems here.
art,2.51,2.51,936.44,936.44,364931.51,631000000,142000000,2.057,-6.879,1,pure diverse,41.33708945

Open in new window

Again, multiple keynames however, the data differs in field 9 onwards. Which to keep and which to delete?
artist,0.71,0.71,9.04,9.04,134465.75,184000000,50700000,11.154,-3.35,2,partial diverse,65.8745245
artist,0.71,0.71,9.04,9.04,134465.75,184000000,50700000,11.165,-3.547,8,partial diverse,66.91089178
artist,0.71,0.71,9.04,9.04,134465.75,184000000,50700000,11.251,-3.305,1,partial diverse,66.48910367
artist,0.71,0.71,9.04,9.04,134465.75,184000000,50700000,11.783,-3.871,3,pure diverse,63.42166535
artist,0.71,0.71,9.04,9.04,134465.75,184000000,50700000,12.011,-3.111,9,partial diverse,67.69931044
artist,0.71,0.71,9.04,9.04,134465.75,184000000,50700000,8.431,-4.047,4,pure diverse,59.302896
artist,0.71,0.71,9.04,9.04,134465.75,184000000,50700000,8.766,-4.467,3,pure diverse,64.26229594
artist,0.71,0.71,9.04,9.04,134465.75,184000000,50700000,9.918,-3.108,5,partial diverse,66.94341154

Open in new window

Same problem here.
artistic,0.97,1.21,18,27,3616.44,16600000,12100000,8.431,-4.047,1,pure diverse,59.302896
artistic,0.97,1.21,18,27,3616.44,16600000,12100000,9.918,-3.108,1,pure diverse,66.94341154

Open in new window

etc...
Thanks for the feedback Paul.

Can we filter by column "keyword"
I see what you are saying and those numbers are not relevant to what we are doing.
All we want is a unique keyword

Thanks
So what should the correct output be then?

~bp
matthewdacruz

Please try this code then...

@echo off

copy nul importfile.tmp >nul

for %%a in (orig.csv) do (
  for /f "tokens=1* usebackq delims=," %%b in ("%%a") do (
    findstr /b /c:"%%b" /i importfile.tmp>nul || echo %%b,%%c>>importfile.tmp
  )
)

(set /p header=)<importfile.tmp
echo %header%>importfile.txt

more +1 importfile.tmp|sort>>importfile.txt

Open in new window

Hi Paul

the importfile.txt just has

ECHO is off.

in the file once it is run. No other data.
Oh, that's strange, I must have changed something prior to copy & pasting. Let me check...
matthewdacruz

In line 5, did you change 'orig.csv' to the filename of your own csv file?

Does it work if it's changed to importfile.csv? Try this:

@echo off

copy nul importfile.tmp >nul

for %%a in (importfile.csv) do (
  for /f "tokens=1* usebackq delims=," %%b in ("%%a") do (
    findstr /b /c:"%%b" /i importfile.tmp>nul || echo %%b,%%c>>importfile.tmp
  )
)

(set /p header=)<importfile.tmp
echo %header%>importfile.txt

more +1 importfile.tmp|sort>>importfile.txt

Open in new window

Incidentally, if you change the filespec 'importfile.csv' in line 5 to just *.* then it will process all your CSV files without you needing to create a 'master' importfile.csv.

In fact, if you are happy to compile your own importfile.csv and all you want to do is process that one file then the following code (taken from above) is all you really need:

@echo off

copy nul importfile.tmp >nul

for /f "tokens=1* delims=," %%b in (importfile.csv) do findstr /b /c:"%%b" /i importfile.tmp>nul || echo %%b,%%c>>importfile.tmp

(set /p header=)<importfile.tmp
echo %header%>importfile.txt

more +1 importfile.tmp|sort>>importfile.txt

Open in new window

Hi Paul
Thanks for the help

That file removed the duplicates but it has also removed the header.
Is it possible to still keep the header?

Here is what I have done.
I ran merger.bat

@echo off

copy nul importfile.tmp >nul

for %%a in (*.csv) do (
  for /f "tokens=* usebackq" %%b in ("%%a") do (
    find "%%b" /i importfile.tmp>nul || echo %%b>>importfile.tmp
  )
)

(set /p header=)<importfile.tmp
echo %header%>importfile.txt

more +1 importfile.tmp|sort>>importfile.txt

Open in new window


Then I copied and renamed importlist.txt, remaned it to importfile.csv

I then ran

@echo off

copy nul importfile.tmp >nul

for %%a in (importfile.csv) do (
  for /f "tokens=1* usebackq delims=," %%b in ("%%a") do (
    findstr /b /c:"%%b" /i importfile.tmp>nul || echo %%b,%%c>>importfile.tmp
  )
)

(set /p header=)<importfile.tmp
echo %header%>importfile.txt

more +1 importfile.tmp|sort>>importfile_cleaned.txt

Open in new window


I ran this and it removed all the duplicates perfectly but the column headers where missing.

Is it possible to merge the to files into one .bat so it creates the importlist.csv and then runs the dedupe process?

the procedure
1) merge all .csv files
2) create importlist.csv from importlist.txt
3) dedupe file
4) save as importlist_cleaned.txt.
Apologies for the dely. Was out.
1) merge all .csv files
2) create importlist.csv from importlist.txt
3) dedupe file
4) save as importlist_cleaned.txt.
Do you mean:

Step 1. merge all CSV files ---> importlist.csv
Step 2. dedupe importlist.csv ---> importlist_cleaned.txt
Do you actually need a 'importlist.txt' file?
Do you actually need the separate 'importlist.csv' file left over from Step.2?
Would you prefer the final output file named 'importlist_cleaned.csv' instead of '.txt'?
HI Paul
No I don't need the import.txt file

In an ideal situation if we can end up just with importlist_cleaned.csv whicj is deduped that would be brilliant.

Saves unnecessary files being in the folder.
Thank you for your reply. No probs...
Is this possible?
Yes of course it is... unfortunately, I am not able to respond to this question until tomorrow mid-morning....
ASKER CERTIFIED SOLUTION
Avatar of Paul Tomasi
Paul Tomasi
Flag of United Kingdom of Great Britain and Northern Ireland image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Wow that ran perfect.
Thanks for the help. This is a massive help.

Thanks Paul