Solved

How to remove duplication within a combined text file?

Posted on 2014-09-24
17
264 Views
Last Modified: 2014-09-26
The following script is what I am starting with:

setlocal

REM Defile file location
set File1=C:\worktemp\newfile.txt
set Workfile=C:\worktemp\_workfile_.txt

REM Error if file does not exist
if not exist "%File1%" (
  goto :next_Script
)

REM Remove duplicate lines from the file
  copy NUL "%Workfile%" >NUL
  for /f "tokens=* usebackq" %%B in ("%File1%") do (
    findstr /b /e /c:"%%B" /i "%Workfile%">NUL || echo.%%B>>"%Workfile%"
  )
  copy /y "%Workfile%" "%File1%" >NUL
  if exist "%Workfile%" del "%Workfile%"
)

My goal is to look at a file in a specific folder, ie Dir1=E:\Folder, assuming that it's already a combined file called newfile.txt, how can the script be modified to simply eliminate duplication based on the following conditions:
1. Each set of data within the file will start with a line which has the characters ISA*00* ....and will end with a line which starts with data IEA*1*,,,There can be several sets of data which start and end this way, one after another.  The script needs to look at all the sets of data and remove any sets of data which are duplicates, simply leaving one copy of the data without duplication.    The script above is good at eliminating duplicates however if there is an exact line which is duplicated in the various sets of data it will remove it at times, which is not the goal.

Summarized short form of what a file could contain..
ISA*00*...
..data here
..data here
..data here
IEA*1*....
ISA*00*...
...data here..
...data here..
...data here..
IEA*1*....
ISA*00*...
...
..
...
IEA*1*....
0
Comment
Question by:100questions
  • 9
  • 7
17 Comments
 
LVL 29

Expert Comment

by:becraig
ID: 40342157
If this is related to the other question, this should work without resorting:


Get-Content file.txt | Select-Object -Unique | out-file newfile.txt

Open in new window

0
 

Author Comment

by:100questions
ID: 40342240
Thanks, however this script does not specifically compare the data from the start of an ISA line to the end of an IEA line, it simply looks for duplicate lines and if finds any duplicate lines then it removes them.  The problem is that there may be exactly the same line within another set of data which starts with ISA and ends with the IEA line.
Please see above, I hope this clarifies matters further.
0
 
LVL 29

Expert Comment

by:becraig
ID: 40342257
ok so you only want the duplicates remove if they fall between:
ISA*00*...

IEA*1*....

any duplicates within another block
ISA*00*...

IEA*1*....

is allowed ?

If so I can make some modifications and give you what you need.
0
Netscaler Common Configuration How To guides

If you use NetScaler you will want to see these guides. The NetScaler How To Guides show administrators how to get NetScaler up and configured by providing instructions for common scenarios and some not so common ones.

 

Author Comment

by:100questions
ID: 40342291
I will clarify.
Within a text file, there is the possibility that there will be various sets of data which start with ISA and end with a line which starts with IEA...
The script would need to look at all the sets of data which start with ISA and end with the line IEA and determine if that set of data already exists in the file.  If it does it needs to delete it..  

I will oversimplify as an example:

ISA*00*...
1...
2...
3...
IEA*1*...
ISA*00*...
1...
4...
5...
6...
2...
3...
IEA*1*...
ISA*00*...
1...
2...
3...
IEA*1*...

In the above oversimplified example, you'll notice that there two sets of data which are exactly the same - the first set and the third set.  Therefore the script should only keep one set of data and not both of the ones which are duplicates.

The resulting file would like this, again oversimplified.

ISA*00*...
1...
2...
3...
IEA*1*....
ISA*00*...
1...
4...
5...
6...
2...
3...
IEA*1*...

The constants in the text file is the fact that each set of data is separated by a line starting with ISA and a line ending in IEA.
0
 

Author Comment

by:100questions
ID: 40344141
If this cannot be done in batch, even a Powershell script would be fine.
0
 
LVL 29

Expert Comment

by:becraig
ID: 40344293
I was waiting to see if one of the batch gurus gave you something (since batch is not my strength), I might be a little busy today but will take a few minutes to script something out.
0
 
LVL 39

Assisted Solution

by:footech
footech earned 250 total points
ID: 40345238
I have no idea how well this will perform as file size goes up.
$data = Get-Content yourfile.txt
[regex]::Matches(($data | Out-String),"(?s)(ISA.*?IEA.*?)\r?\n") | % {$_.groups[1]} | % {$_.value} | Group | Select -ExpandProperty Values | Out-File dedupe.txt -Encoding ASCII

Open in new window

0
 
LVL 29

Accepted Solution

by:
becraig earned 250 total points
ID: 40345375
It looks like foo has beat me to it, I was going down the same path but saved the input as an array then deduped the array.

$input = gc subtest.txt
$tarray = @()
[regex]::Matches(($input | Out-String),"(?s)(ISA.*?IEA.*?)") | % {$tarray += $_.value.trim()}
$tarray | Select-Object  -Unique | out-file cleanfile.txt

Open in new window


I think as far as execution goes it depends on which has less overhead but they are both pretty much the same.

Based on the size on your files you can check either for execution time.
0
 

Author Comment

by:100questions
ID: 40345903
@becraig - Thanks. Is this a batch script or Powershell?  When I enter the info in a .bat file and double click it it simply opens and closes and does nothing.
0
 

Author Comment

by:100questions
ID: 40345910
@becraig - I see it's Powershell.  It worked good from what I can see.  Now how can I change the script so that it will produce a file in the directory which the script it currently found or in a certain path.  Say in c:\data\newfiles\  ?
0
 

Author Comment

by:100questions
ID: 40345923
@footech - Thanks, your script works good too.  How can I change your script so that the new file is created where I want and not in the root?  For instance at c:\data\newfiles\ ?
0
 
LVL 29

Expert Comment

by:becraig
ID: 40346212
Just change out-file cleanfile

To

out-file c:\folder\file.txt
0
 

Author Comment

by:100questions
ID: 40346341
Thanks. How do you change the input path for instance of the file I want to use is found at c:\data\input\ ?
0
 
LVL 29

Expert Comment

by:becraig
ID: 40346393
Change the first line as below.

$input = gc c:\data\input\
0
 

Author Comment

by:100questions
ID: 40346406
Thanks.  One last question, these scripts will run and create a file even if the file does not exist.  I don't want the script to run if the input file does not exist.  Or if it runs, an output file shoukd not be created if an input file does not exist.
0
 
LVL 29

Expert Comment

by:becraig
ID: 40346420
$file = "C:\path-to-file\file.txt"
if (!(Test-Path $file)) {write-host "Input file not valid !!! ... Exiting..." -fore Red;exit}
else
{
$input = gc $file
$tarray = @()
[regex]::Matches(($input | Out-String),"(?s)(ISA.*?IEA.*?)\r?\n") | % {$_.groups[1]} | % {$_.value} | Group | Select -ExpandProperty Values | Out-File dedupe.txt -Encoding ASCII
}

Open in new window


In this version simply change the value for file:
$file = C:\path-to-file\file.txt
0
 

Author Comment

by:100questions
ID: 40346593
Thank you.
0

Featured Post

Netscaler Common Configuration How To guides

If you use NetScaler you will want to see these guides. The NetScaler How To Guides show administrators how to get NetScaler up and configured by providing instructions for common scenarios and some not so common ones.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

In this previous article (https://oddytee.wordpress.com/2016/05/05/provision-new-office-365-user-and-mailbox-from-exchange-hybrid-via-powershell/), we made basic license assignments to users in O365. When I say basic, the method is the simplest way …
I thought I'd write this up for anyone who has a request to create an anonymous whistle-blower-type submission form created using SharePoint 2010 (this would probably work the same for 2013). It's not 100% fool-proof but it's as close as you can get…
Two types of users will appreciate AOMEI Backupper Pro: 1 - Those with PCIe drives (and haven't found cloning software that works on them). 2 - Those who want a fast clone of their boot drive (no re-boots needed) and it can clone your drive wh…

823 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question