Solved

How to remove duplication within a combined text file?

Posted on 2014-09-24
17
269 Views
Last Modified: 2014-09-26
The following script is what I am starting with:

setlocal

REM Defile file location
set File1=C:\worktemp\newfile.txt
set Workfile=C:\worktemp\_workfile_.txt

REM Error if file does not exist
if not exist "%File1%" (
  goto :next_Script
)

REM Remove duplicate lines from the file
  copy NUL "%Workfile%" >NUL
  for /f "tokens=* usebackq" %%B in ("%File1%") do (
    findstr /b /e /c:"%%B" /i "%Workfile%">NUL || echo.%%B>>"%Workfile%"
  )
  copy /y "%Workfile%" "%File1%" >NUL
  if exist "%Workfile%" del "%Workfile%"
)

My goal is to look at a file in a specific folder, ie Dir1=E:\Folder, assuming that it's already a combined file called newfile.txt, how can the script be modified to simply eliminate duplication based on the following conditions:
1. Each set of data within the file will start with a line which has the characters ISA*00* ....and will end with a line which starts with data IEA*1*,,,There can be several sets of data which start and end this way, one after another.  The script needs to look at all the sets of data and remove any sets of data which are duplicates, simply leaving one copy of the data without duplication.    The script above is good at eliminating duplicates however if there is an exact line which is duplicated in the various sets of data it will remove it at times, which is not the goal.

Summarized short form of what a file could contain..
ISA*00*...
..data here
..data here
..data here
IEA*1*....
ISA*00*...
...data here..
...data here..
...data here..
IEA*1*....
ISA*00*...
...
..
...
IEA*1*....
0
Comment
Question by:100questions
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 9
  • 7
17 Comments
 
LVL 29

Expert Comment

by:becraig
ID: 40342157
If this is related to the other question, this should work without resorting:


Get-Content file.txt | Select-Object -Unique | out-file newfile.txt

Open in new window

0
 

Author Comment

by:100questions
ID: 40342240
Thanks, however this script does not specifically compare the data from the start of an ISA line to the end of an IEA line, it simply looks for duplicate lines and if finds any duplicate lines then it removes them.  The problem is that there may be exactly the same line within another set of data which starts with ISA and ends with the IEA line.
Please see above, I hope this clarifies matters further.
0
 
LVL 29

Expert Comment

by:becraig
ID: 40342257
ok so you only want the duplicates remove if they fall between:
ISA*00*...

IEA*1*....

any duplicates within another block
ISA*00*...

IEA*1*....

is allowed ?

If so I can make some modifications and give you what you need.
0
Prepare for your VMware VCP6-DCV exam.

Josh Coen and Jason Langer have prepared the latest edition of VCP study guide. Both authors have been working in the IT field for more than a decade, and both hold VMware certifications. This 163-page guide covers all 10 of the exam blueprint sections.

 

Author Comment

by:100questions
ID: 40342291
I will clarify.
Within a text file, there is the possibility that there will be various sets of data which start with ISA and end with a line which starts with IEA...
The script would need to look at all the sets of data which start with ISA and end with the line IEA and determine if that set of data already exists in the file.  If it does it needs to delete it..  

I will oversimplify as an example:

ISA*00*...
1...
2...
3...
IEA*1*...
ISA*00*...
1...
4...
5...
6...
2...
3...
IEA*1*...
ISA*00*...
1...
2...
3...
IEA*1*...

In the above oversimplified example, you'll notice that there two sets of data which are exactly the same - the first set and the third set.  Therefore the script should only keep one set of data and not both of the ones which are duplicates.

The resulting file would like this, again oversimplified.

ISA*00*...
1...
2...
3...
IEA*1*....
ISA*00*...
1...
4...
5...
6...
2...
3...
IEA*1*...

The constants in the text file is the fact that each set of data is separated by a line starting with ISA and a line ending in IEA.
0
 

Author Comment

by:100questions
ID: 40344141
If this cannot be done in batch, even a Powershell script would be fine.
0
 
LVL 29

Expert Comment

by:becraig
ID: 40344293
I was waiting to see if one of the batch gurus gave you something (since batch is not my strength), I might be a little busy today but will take a few minutes to script something out.
0
 
LVL 40

Assisted Solution

by:footech
footech earned 250 total points
ID: 40345238
I have no idea how well this will perform as file size goes up.
$data = Get-Content yourfile.txt
[regex]::Matches(($data | Out-String),"(?s)(ISA.*?IEA.*?)\r?\n") | % {$_.groups[1]} | % {$_.value} | Group | Select -ExpandProperty Values | Out-File dedupe.txt -Encoding ASCII

Open in new window

0
 
LVL 29

Accepted Solution

by:
becraig earned 250 total points
ID: 40345375
It looks like foo has beat me to it, I was going down the same path but saved the input as an array then deduped the array.

$input = gc subtest.txt
$tarray = @()
[regex]::Matches(($input | Out-String),"(?s)(ISA.*?IEA.*?)") | % {$tarray += $_.value.trim()}
$tarray | Select-Object  -Unique | out-file cleanfile.txt

Open in new window


I think as far as execution goes it depends on which has less overhead but they are both pretty much the same.

Based on the size on your files you can check either for execution time.
0
 

Author Comment

by:100questions
ID: 40345903
@becraig - Thanks. Is this a batch script or Powershell?  When I enter the info in a .bat file and double click it it simply opens and closes and does nothing.
0
 

Author Comment

by:100questions
ID: 40345910
@becraig - I see it's Powershell.  It worked good from what I can see.  Now how can I change the script so that it will produce a file in the directory which the script it currently found or in a certain path.  Say in c:\data\newfiles\  ?
0
 

Author Comment

by:100questions
ID: 40345923
@footech - Thanks, your script works good too.  How can I change your script so that the new file is created where I want and not in the root?  For instance at c:\data\newfiles\ ?
0
 
LVL 29

Expert Comment

by:becraig
ID: 40346212
Just change out-file cleanfile

To

out-file c:\folder\file.txt
0
 

Author Comment

by:100questions
ID: 40346341
Thanks. How do you change the input path for instance of the file I want to use is found at c:\data\input\ ?
0
 
LVL 29

Expert Comment

by:becraig
ID: 40346393
Change the first line as below.

$input = gc c:\data\input\
0
 

Author Comment

by:100questions
ID: 40346406
Thanks.  One last question, these scripts will run and create a file even if the file does not exist.  I don't want the script to run if the input file does not exist.  Or if it runs, an output file shoukd not be created if an input file does not exist.
0
 
LVL 29

Expert Comment

by:becraig
ID: 40346420
$file = "C:\path-to-file\file.txt"
if (!(Test-Path $file)) {write-host "Input file not valid !!! ... Exiting..." -fore Red;exit}
else
{
$input = gc $file
$tarray = @()
[regex]::Matches(($input | Out-String),"(?s)(ISA.*?IEA.*?)\r?\n") | % {$_.groups[1]} | % {$_.value} | Group | Select -ExpandProperty Values | Out-File dedupe.txt -Encoding ASCII
}

Open in new window


In this version simply change the value for file:
$file = C:\path-to-file\file.txt
0
 

Author Comment

by:100questions
ID: 40346593
Thank you.
0

Featured Post

Want Experts Exchange at your fingertips?

With Experts Exchange’s latest app release, you can now experience our most recent features, updates, and the same community interface while on-the-go. Download our latest app release at the Android or Apple stores today!

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

This article will help you understand what HashTables are and how to use them in PowerShell.
Auditing domain password hashes is a commonly overlooked but critical requirement to ensuring secure passwords practices are followed. Methods exist to extract hashes directly for a live domain however this article describes a process to extract u…
Exchange organizations may use the Journaling Agent of the Transport Service to archive messages going through Exchange. However, if the Transport Service is integrated with some email content management application (such as an antispam), the admini…
Do you want to know how to make a graph with Microsoft Access? First, create a query with the data for the chart. Then make a blank form and add a chart control. This video also shows how to change what data is displayed on the graph as well as form…

630 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question