Solved

How to remove duplication within a combined text file?

Posted on 2014-09-24
17
261 Views
Last Modified: 2014-09-26
The following script is what I am starting with:

setlocal

REM Defile file location
set File1=C:\worktemp\newfile.txt
set Workfile=C:\worktemp\_workfile_.txt

REM Error if file does not exist
if not exist "%File1%" (
  goto :next_Script
)

REM Remove duplicate lines from the file
  copy NUL "%Workfile%" >NUL
  for /f "tokens=* usebackq" %%B in ("%File1%") do (
    findstr /b /e /c:"%%B" /i "%Workfile%">NUL || echo.%%B>>"%Workfile%"
  )
  copy /y "%Workfile%" "%File1%" >NUL
  if exist "%Workfile%" del "%Workfile%"
)

My goal is to look at a file in a specific folder, ie Dir1=E:\Folder, assuming that it's already a combined file called newfile.txt, how can the script be modified to simply eliminate duplication based on the following conditions:
1. Each set of data within the file will start with a line which has the characters ISA*00* ....and will end with a line which starts with data IEA*1*,,,There can be several sets of data which start and end this way, one after another.  The script needs to look at all the sets of data and remove any sets of data which are duplicates, simply leaving one copy of the data without duplication.    The script above is good at eliminating duplicates however if there is an exact line which is duplicated in the various sets of data it will remove it at times, which is not the goal.

Summarized short form of what a file could contain..
ISA*00*...
..data here
..data here
..data here
IEA*1*....
ISA*00*...
...data here..
...data here..
...data here..
IEA*1*....
ISA*00*...
...
..
...
IEA*1*....
0
Comment
Question by:100questions
  • 9
  • 7
17 Comments
 
LVL 28

Expert Comment

by:becraig
Comment Utility
If this is related to the other question, this should work without resorting:


Get-Content file.txt | Select-Object -Unique | out-file newfile.txt

Open in new window

0
 

Author Comment

by:100questions
Comment Utility
Thanks, however this script does not specifically compare the data from the start of an ISA line to the end of an IEA line, it simply looks for duplicate lines and if finds any duplicate lines then it removes them.  The problem is that there may be exactly the same line within another set of data which starts with ISA and ends with the IEA line.
Please see above, I hope this clarifies matters further.
0
 
LVL 28

Expert Comment

by:becraig
Comment Utility
ok so you only want the duplicates remove if they fall between:
ISA*00*...

IEA*1*....

any duplicates within another block
ISA*00*...

IEA*1*....

is allowed ?

If so I can make some modifications and give you what you need.
0
 

Author Comment

by:100questions
Comment Utility
I will clarify.
Within a text file, there is the possibility that there will be various sets of data which start with ISA and end with a line which starts with IEA...
The script would need to look at all the sets of data which start with ISA and end with the line IEA and determine if that set of data already exists in the file.  If it does it needs to delete it..  

I will oversimplify as an example:

ISA*00*...
1...
2...
3...
IEA*1*...
ISA*00*...
1...
4...
5...
6...
2...
3...
IEA*1*...
ISA*00*...
1...
2...
3...
IEA*1*...

In the above oversimplified example, you'll notice that there two sets of data which are exactly the same - the first set and the third set.  Therefore the script should only keep one set of data and not both of the ones which are duplicates.

The resulting file would like this, again oversimplified.

ISA*00*...
1...
2...
3...
IEA*1*....
ISA*00*...
1...
4...
5...
6...
2...
3...
IEA*1*...

The constants in the text file is the fact that each set of data is separated by a line starting with ISA and a line ending in IEA.
0
 

Author Comment

by:100questions
Comment Utility
If this cannot be done in batch, even a Powershell script would be fine.
0
 
LVL 28

Expert Comment

by:becraig
Comment Utility
I was waiting to see if one of the batch gurus gave you something (since batch is not my strength), I might be a little busy today but will take a few minutes to script something out.
0
 
LVL 39

Assisted Solution

by:footech
footech earned 250 total points
Comment Utility
I have no idea how well this will perform as file size goes up.
$data = Get-Content yourfile.txt
[regex]::Matches(($data | Out-String),"(?s)(ISA.*?IEA.*?)\r?\n") | % {$_.groups[1]} | % {$_.value} | Group | Select -ExpandProperty Values | Out-File dedupe.txt -Encoding ASCII

Open in new window

0
 
LVL 28

Accepted Solution

by:
becraig earned 250 total points
Comment Utility
It looks like foo has beat me to it, I was going down the same path but saved the input as an array then deduped the array.

$input = gc subtest.txt
$tarray = @()
[regex]::Matches(($input | Out-String),"(?s)(ISA.*?IEA.*?)") | % {$tarray += $_.value.trim()}
$tarray | Select-Object  -Unique | out-file cleanfile.txt

Open in new window


I think as far as execution goes it depends on which has less overhead but they are both pretty much the same.

Based on the size on your files you can check either for execution time.
0
IT, Stop Being Called Into Every Meeting

Highfive is so simple that setting up every meeting room takes just minutes and every employee will be able to start or join a call from any room with ease. Never be called into a meeting just to get it started again. This is how video conferencing should work!

 

Author Comment

by:100questions
Comment Utility
@becraig - Thanks. Is this a batch script or Powershell?  When I enter the info in a .bat file and double click it it simply opens and closes and does nothing.
0
 

Author Comment

by:100questions
Comment Utility
@becraig - I see it's Powershell.  It worked good from what I can see.  Now how can I change the script so that it will produce a file in the directory which the script it currently found or in a certain path.  Say in c:\data\newfiles\  ?
0
 

Author Comment

by:100questions
Comment Utility
@footech - Thanks, your script works good too.  How can I change your script so that the new file is created where I want and not in the root?  For instance at c:\data\newfiles\ ?
0
 
LVL 28

Expert Comment

by:becraig
Comment Utility
Just change out-file cleanfile

To

out-file c:\folder\file.txt
0
 

Author Comment

by:100questions
Comment Utility
Thanks. How do you change the input path for instance of the file I want to use is found at c:\data\input\ ?
0
 
LVL 28

Expert Comment

by:becraig
Comment Utility
Change the first line as below.

$input = gc c:\data\input\
0
 

Author Comment

by:100questions
Comment Utility
Thanks.  One last question, these scripts will run and create a file even if the file does not exist.  I don't want the script to run if the input file does not exist.  Or if it runs, an output file shoukd not be created if an input file does not exist.
0
 
LVL 28

Expert Comment

by:becraig
Comment Utility
$file = "C:\path-to-file\file.txt"
if (!(Test-Path $file)) {write-host "Input file not valid !!! ... Exiting..." -fore Red;exit}
else
{
$input = gc $file
$tarray = @()
[regex]::Matches(($input | Out-String),"(?s)(ISA.*?IEA.*?)\r?\n") | % {$_.groups[1]} | % {$_.value} | Group | Select -ExpandProperty Values | Out-File dedupe.txt -Encoding ASCII
}

Open in new window


In this version simply change the value for file:
$file = C:\path-to-file\file.txt
0
 

Author Comment

by:100questions
Comment Utility
Thank you.
0

Featured Post

Do You Know the 4 Main Threat Actor Types?

Do you know the main threat actor types? Most attackers fall into one of four categories, each with their own favored tactics, techniques, and procedures.

Join & Write a Comment

In this previous article (https://oddytee.wordpress.com/2016/05/05/provision-new-office-365-user-and-mailbox-from-exchange-hybrid-via-powershell/), we made basic license assignments to users in O365. When I say basic, the method is the simplest way …
If you need to start windows update installation remotely or as a scheduled task you will find this very helpful.
Polish reports in Access so they look terrific. Take yourself to another level. Equations, Back Color, Alternate Back Color. Write easy VBA Code. Tighten space to use less pages. Launch report from a menu, considering criteria only when it is filled…
When you create an app prototype with Adobe XD, you can insert system screens -- sharing or Control Center, for example -- with just a few clicks. This video shows you how. You can take the full course on Experts Exchange at http://bit.ly/XDcourse.

771 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

15 Experts available now in Live!

Get 1:1 Help Now