asked on

How to remove duplication within a combined text file?

The following script is what I am starting with:

setlocal

REM Defile file location
set File1=C:\worktemp\newfile.txt
set Workfile=C:\worktemp\_workfile_.txt

REM Error if file does not exist
if not exist "%File1%" (
goto :next_Script
)

REM Remove duplicate lines from the file
copy NUL "%Workfile%" >NUL
for /f "tokens=* usebackq" %%B in ("%File1%") do (
findstr /b /e /c:"%%B" /i "%Workfile%">NUL || echo.%%B>>"%Workfile%"
)
copy /y "%Workfile%" "%File1%" >NUL
if exist "%Workfile%" del "%Workfile%"
)

My goal is to look at a file in a specific folder, ie Dir1=E:\Folder, assuming that it's already a combined file called newfile.txt, how can the script be modified to simply eliminate duplication based on the following conditions:
1. Each set of data within the file will start with a line which has the characters ISA*00* ....and will end with a line which starts with data IEA*1*,,,There can be several sets of data which start and end this way, one after another. The script needs to look at all the sets of data and remove any sets of data which are duplicates, simply leaving one copy of the data without duplication. The script above is good at eliminating duplicates however if there is an exact line which is duplicated in the various sets of data it will remove it at times, which is not the goal.

Summarized short form of what a file could contain..
ISA*00*...
..data here
..data here
..data here
IEA*1*....
ISA*00*...
...data here..
...data here..
...data here..
IEA*1*....
ISA*00*...
...
..
...
IEA*1*....

becraig

If this is related to the other question, this should work without resorting:

Get-Content file.txt | Select-Object -Unique | out-file newfile.txt

Open in new window

E=mc2

ASKER

Thanks, however this script does not specifically compare the data from the start of an ISA line to the end of an IEA line, it simply looks for duplicate lines and if finds any duplicate lines then it removes them. The problem is that there may be exactly the same line within another set of data which starts with ISA and ends with the IEA line.
Please see above, I hope this clarifies matters further.

becraig

ok so you only want the duplicates remove if they fall between:
ISA*00*...

IEA*1*....

any duplicates within another block
ISA*00*...

IEA*1*....

is allowed ?

If so I can make some modifications and give you what you need.

E=mc2

ASKER

I will clarify.
Within a text file, there is the possibility that there will be various sets of data which start with ISA and end with a line which starts with IEA...
The script would need to look at all the sets of data which start with ISA and end with the line IEA and determine if that set of data already exists in the file. If it does it needs to delete it..

I will oversimplify as an example:

ISA*00*...
1...
2...
3...
IEA*1*...
ISA*00*...
1...
4...
5...
6...
2...
3...
IEA*1*...
ISA*00*...
1...
2...
3...
IEA*1*...

In the above oversimplified example, you'll notice that there two sets of data which are exactly the same - the first set and the third set. Therefore the script should only keep one set of data and not both of the ones which are duplicates.

The resulting file would like this, again oversimplified.

ISA*00*...
1...
2...
3...
IEA*1*....
ISA*00*...
1...
4...
5...
6...
2...
3...
IEA*1*...

The constants in the text file is the fact that each set of data is separated by a line starting with ISA and a line ending in IEA.

E=mc2

ASKER

If this cannot be done in batch, even a Powershell script would be fine.

becraig

I was waiting to see if one of the batch gurus gave you something (since batch is not my strength), I might be a little busy today but will take a few minutes to script something out.

SOLUTION

footech

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

ASKER CERTIFIED SOLUTION

becraig

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

E=mc2

ASKER

@becraig - Thanks. Is this a batch script or Powershell? When I enter the info in a .bat file and double click it it simply opens and closes and does nothing.

E=mc2

ASKER

@becraig - I see it's Powershell. It worked good from what I can see. Now how can I change the script so that it will produce a file in the directory which the script it currently found or in a certain path. Say in c:\data\newfiles\ ?

E=mc2

ASKER

@footech - Thanks, your script works good too. How can I change your script so that the new file is created where I want and not in the root? For instance at c:\data\newfiles\ ?

becraig

Just change out-file cleanfile

To

out-file c:\folder\file.txt

E=mc2

ASKER

Thanks. How do you change the input path for instance of the file I want to use is found at c:\data\input\ ?

becraig

Change the first line as below.

$input = gc c:\data\input\

E=mc2

ASKER

Thanks. One last question, these scripts will run and create a file even if the file does not exist. I don't want the script to run if the input file does not exist. Or if it runs, an output file shoukd not be created if an input file does not exist.

becraig

$file = "C:\path-to-file\file.txt"
if (!(Test-Path $file)) {write-host "Input file not valid !!! ... Exiting..." -fore Red;exit}
else
{
$input = gc $file
$tarray = @()
[regex]::Matches(($input | Out-String),"(?s)(ISA.*?IEA.*?)\r?\n") | % {$_.groups[1]} | % {$_.value} | Group | Select -ExpandProperty Values | Out-File dedupe.txt -Encoding ASCII
}

Open in new window

In this version simply change the value for file:
$file = C:\path-to-file\file.txt

E=mc2

ASKER

Thank you.