How to remove duplication within a combined text file?

The following script is what I am starting with:

setlocal

REM Defile file location
set File1=C:\worktemp\newfile.txt
set Workfile=C:\worktemp\_workfile_.txt

REM Error if file does not exist
if not exist "%File1%" (
  goto :next_Script
)

REM Remove duplicate lines from the file
  copy NUL "%Workfile%" >NUL
  for /f "tokens=* usebackq" %%B in ("%File1%") do (
    findstr /b /e /c:"%%B" /i "%Workfile%">NUL || echo.%%B>>"%Workfile%"
  )
  copy /y "%Workfile%" "%File1%" >NUL
  if exist "%Workfile%" del "%Workfile%"
)

My goal is to look at a file in a specific folder, ie Dir1=E:\Folder, assuming that it's already a combined file called newfile.txt, how can the script be modified to simply eliminate duplication based on the following conditions:
1. Each set of data within the file will start with a line which has the characters ISA*00* ....and will end with a line which starts with data IEA*1*,,,There can be several sets of data which start and end this way, one after another.  The script needs to look at all the sets of data and remove any sets of data which are duplicates, simply leaving one copy of the data without duplication.    The script above is good at eliminating duplicates however if there is an exact line which is duplicated in the various sets of data it will remove it at times, which is not the goal.

Summarized short form of what a file could contain..
ISA*00*...
..data here
..data here
..data here
IEA*1*....
ISA*00*...
...data here..
...data here..
...data here..
IEA*1*....
ISA*00*...
...
..
...
IEA*1*....
100questionsAsked:
Who is Participating?

[Product update] Infrastructure Analysis Tool is now available with Business Accounts.Learn More

x
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

becraigCommented:
If this is related to the other question, this should work without resorting:


Get-Content file.txt | Select-Object -Unique | out-file newfile.txt

Open in new window

100questionsAuthor Commented:
Thanks, however this script does not specifically compare the data from the start of an ISA line to the end of an IEA line, it simply looks for duplicate lines and if finds any duplicate lines then it removes them.  The problem is that there may be exactly the same line within another set of data which starts with ISA and ends with the IEA line.
Please see above, I hope this clarifies matters further.
becraigCommented:
ok so you only want the duplicates remove if they fall between:
ISA*00*...

IEA*1*....

any duplicates within another block
ISA*00*...

IEA*1*....

is allowed ?

If so I can make some modifications and give you what you need.
Simplify Active Directory Administration

Administration of Active Directory does not have to be hard.  Too often what should be a simple task is made more difficult than it needs to be.The solution?  Hyena from SystemTools Software.  With ease-of-use as well as powerful importing and bulk updating capabilities.

100questionsAuthor Commented:
I will clarify.
Within a text file, there is the possibility that there will be various sets of data which start with ISA and end with a line which starts with IEA...
The script would need to look at all the sets of data which start with ISA and end with the line IEA and determine if that set of data already exists in the file.  If it does it needs to delete it..  

I will oversimplify as an example:

ISA*00*...
1...
2...
3...
IEA*1*...
ISA*00*...
1...
4...
5...
6...
2...
3...
IEA*1*...
ISA*00*...
1...
2...
3...
IEA*1*...

In the above oversimplified example, you'll notice that there two sets of data which are exactly the same - the first set and the third set.  Therefore the script should only keep one set of data and not both of the ones which are duplicates.

The resulting file would like this, again oversimplified.

ISA*00*...
1...
2...
3...
IEA*1*....
ISA*00*...
1...
4...
5...
6...
2...
3...
IEA*1*...

The constants in the text file is the fact that each set of data is separated by a line starting with ISA and a line ending in IEA.
100questionsAuthor Commented:
If this cannot be done in batch, even a Powershell script would be fine.
becraigCommented:
I was waiting to see if one of the batch gurus gave you something (since batch is not my strength), I might be a little busy today but will take a few minutes to script something out.
footechCommented:
I have no idea how well this will perform as file size goes up.
$data = Get-Content yourfile.txt
[regex]::Matches(($data | Out-String),"(?s)(ISA.*?IEA.*?)\r?\n") | % {$_.groups[1]} | % {$_.value} | Group | Select -ExpandProperty Values | Out-File dedupe.txt -Encoding ASCII

Open in new window

becraigCommented:
It looks like foo has beat me to it, I was going down the same path but saved the input as an array then deduped the array.

$input = gc subtest.txt
$tarray = @()
[regex]::Matches(($input | Out-String),"(?s)(ISA.*?IEA.*?)") | % {$tarray += $_.value.trim()}
$tarray | Select-Object  -Unique | out-file cleanfile.txt

Open in new window


I think as far as execution goes it depends on which has less overhead but they are both pretty much the same.

Based on the size on your files you can check either for execution time.

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
100questionsAuthor Commented:
@becraig - Thanks. Is this a batch script or Powershell?  When I enter the info in a .bat file and double click it it simply opens and closes and does nothing.
100questionsAuthor Commented:
@becraig - I see it's Powershell.  It worked good from what I can see.  Now how can I change the script so that it will produce a file in the directory which the script it currently found or in a certain path.  Say in c:\data\newfiles\  ?
100questionsAuthor Commented:
@footech - Thanks, your script works good too.  How can I change your script so that the new file is created where I want and not in the root?  For instance at c:\data\newfiles\ ?
becraigCommented:
Just change out-file cleanfile

To

out-file c:\folder\file.txt
100questionsAuthor Commented:
Thanks. How do you change the input path for instance of the file I want to use is found at c:\data\input\ ?
becraigCommented:
Change the first line as below.

$input = gc c:\data\input\
100questionsAuthor Commented:
Thanks.  One last question, these scripts will run and create a file even if the file does not exist.  I don't want the script to run if the input file does not exist.  Or if it runs, an output file shoukd not be created if an input file does not exist.
becraigCommented:
$file = "C:\path-to-file\file.txt"
if (!(Test-Path $file)) {write-host "Input file not valid !!! ... Exiting..." -fore Red;exit}
else
{
$input = gc $file
$tarray = @()
[regex]::Matches(($input | Out-String),"(?s)(ISA.*?IEA.*?)\r?\n") | % {$_.groups[1]} | % {$_.value} | Group | Select -ExpandProperty Values | Out-File dedupe.txt -Encoding ASCII
}

Open in new window


In this version simply change the value for file:
$file = C:\path-to-file\file.txt
100questionsAuthor Commented:
Thank you.
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Windows Batch

From novice to tech pro — start learning today.