batch file(s) to manipulate text file 2 different ways

I have a text file which has duplicate lines... I need to remove any dupes and leave just one of them. The file also needs  2 different strings of words replaced with different strings of words IF either of the 2 original strings exist.

Is this even possible via Windows batch?
LVL 1
BobbyAsked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

NVITEnd-user supportCommented:
If you post a real sample containing several records each, it should expedite a solution.
BobbyAuthor Commented:
It s a pretty big file with lots of sensitive info, so maybe this explanation will help:

The line that's duplicated, sometimes twice, is:
ACCNT Accounts Receivable - KM         OCCASET

There are no numbers attached to it or anything, that’s the entire line.

The other things are text strings that need to be replaced… if the file says “Inventory Dropship” I have to change that to “Cost Of Goods Drop Ship”, and if it says “Inventory Consignment(Problem)” I have to change that to ”Cost Of Goods Sold-Consignment”.
Joe Winograd, Fellow&MVEDeveloperCommented:
Hi Bobby,
The line that's duplicated, sometimes twice, is: ACCNT Accounts Receivable - KM OCCASET
Do you mean that it's duplicated on consecutive lines or that the duplicates can be anywhere in the file? For example, does the input file look like this:

ACCNT Accounts Receivable - KM         OCCASET
ACCNT Accounts Receivable - KM         OCCASET
another line
and another line
Inventory Dropship
another one
ACCNT Accounts Receivable - KM         OCCASET
ACCNT Accounts Receivable - KM         OCCASET
Inventory Consignment(Problem)

Or like this:

ACCNT Accounts Receivable - KM         OCCASET
another line
and another line
ACCNT Accounts Receivable - KM         OCCASET
ACCNT Accounts Receivable - KM         OCCASET
Inventory Dropship
another one
ACCNT Accounts Receivable - KM         OCCASET
Inventory Consignment(Problem)
ACCNT Accounts Receivable - KM         OCCASET

If the former, it's clear what you want, i.e., for each pair of duplicate lines, remove one of them. But if it's the latter, what should the output file look like? Regards, Joe
Maximize Customer Retention with Superior Service

The IT Service Excellence Tool Kit delivers expert advice for technology solution providers. Get your free copy for valuable how-to assets including sample agreements, checklists, flowcharts, and more to help build customer satisfaction and retention.

Qlemo"Batchelor", Developer and EE Topic AdvisorCommented:
In case duplicates are not on consecutive lines, does sequence of lines matter?
BobbyAuthor Commented:
Joe, it's the second one, where the duplicate line can be a few lines away from the other. The file should look the exact same as it does originally, except any duplicate lines removed and the resulting blank line stripped, so there's not a gap where the removed duplicate line was.

Qlemo, yes, the sequence of lines matters.
Joe Winograd, Fellow&MVEDeveloperCommented:
it's the second one
OK. Next questions:

(1) What should the resulting file be? Like this:

ACCNT Accounts Receivable - KM         OCCASET
another line
and another line
Cost Of Goods Drop Ship
another one
Cost Of Goods Sold-Consignment

Or something else?

(2) Is "ACCNT Accounts Receivable - KM         OCCASET" the only text (line) that is duplicated in the original file or are there other duplicate lines (with text that is different from that line) that also need to be removed?

It's difficult to understand what you're looking for without a sample file. I realize that you said it's a pretty big file with lots of sensitive info, but maybe you can create a small subset of it and remove the sensitive info...should be fairly easy to do with a plain text file. Regards, Joe
Qlemo"Batchelor", Developer and EE Topic AdvisorCommented:
Having to keep the sequence and having non-consequtive duplicates combined with a big file is an issue. That requires us to keep all content in memory, or check against the current new file for each line. If we use a higher level language like PowerShell, this should be feasible with less effort (e.g. lines could be hashed to detect dups, and we then only need to keep hashes in memory).
BobbyAuthor Commented:
Joe,

I made a version of the file to attach here. You will see the line "ACCNT Accounts Receivable - KM         OCCASET" 3 times... I need to just retain the first one and then remove any line breaks left by the removed 2. You will also see “Inventory Dropship”... I have to change that to “Cost Of Goods Drop Ship”, and you will also see “Inventory Consignment(Problem)” ... I have to change that to ”Cost Of Goods Sold-Consignment”.
example.txt
Joe Winograd, Fellow&MVEDeveloperCommented:
Thanks for the sample file, Bobby...very helpful! I can now see that there aren't spaces between "ACCNT" and "Accounts" or between "KM" and "OCASSET"...they each have a tab char separating them.
BobbyAuthor Commented:
Yes, it's a tab delimited txt file.
Joe Winograd, Fellow&MVEDeveloperCommented:
Yes, I can also see that your posts with the duplicated text are wrong...it is OCASSET, not OCCASET.
BobbyAuthor Commented:
Oops. sorry. Yet another reason i attached the file, so you could see the real deal (with dummy sensitive data, of course).
Joe Winograd, Fellow&MVEDeveloperCommented:
another reason i attached the file
And that's one of the reasons why multiple experts asked for it...always helpful on questions like this one.

For a Windows batch file, I'll leave that to the folks who are real experts in it...I'm not. If you're willing to use a different scripting language, here's a solution written in the AutoHotkey language:

#Warn,UseUnsetLocal ; warn when local variable is used before it is set
#NoEnv ; do not check empty variables to see if they are environment variables
#SingleInstance force ; skip dialog box and replace old instance immediately
SetBatchLines,-1 ; run at maximum speed

; change these to whatever you want
InputFile:="c:\temp\bobby-example.txt"
OutputFile:="c:\temp\bobby-output.txt"

; delete output file if it exists so that appending creates new file
FileDelete,%OutputFile%

; made this a variable in case the duplicate text ever needs changing
DuplicateText:="ACCNT`tAccounts Receivable - KM`tOCASSET"

; read entire input file and check for failure
FileRead,InputFileVar,%InputFile% ; read entire input file
If (ErrorLevel!=0)
{
  MsgBox,4112,Fatal Error 01,Error Level %ErrorLevel% trying to read input file`n`n%InputFile%
  ExitApp
}

; change all occurrences of "Inventory Dropship" to "Cost Of Goods Drop Ship"
InputFileVar:=StrReplace(InputFileVar,"Inventory Dropship","Cost Of Goods Drop Ship")

; change all occurrences of "Inventory Consignment(Problem)" to "Cost Of Goods Sold-Consignment"
InputFileVar:=StrReplace(InputFileVar,"Inventory Consignment(Problem)","Cost Of Goods Sold-Consignment")

FoundFirstOccurrence:=False ; flag for first occurrence found
; loop through all lines in file
Loop,Parse,InputFileVar,`n,`r
{
  CurrentLine:=A_LoopField
  If (A_Index!=1) ; prefix output lines with a new line char unless it's the first line
    CurrentLineOut:="`n" . CurrentLine
  Else
    CurrentLineOut:=CurrentLine
  If (CurrentLine!=DuplicateText)
  {
    ; current line is not a dup, so write it out and keep looking
    FileAppend,%CurrentLineOut%,%OutputFile%
    If (ErrorLevel!=0)
    {
      MsgBox,4112,Fatal Error 02,Error Level %ErrorLevel% trying to append to output file`n`n%OutputFile%
      ExitApp
    }
    Continue
  }
  ; current line contains the duplicate text - see if it is the first occurrence
  If (FoundFirstOccurrence)
    Continue ; first occurrence already found, so don't append this one and keep looking
  ; first occurrence - write this one out and set FoundFirstOccurrence flag
  FileAppend,%CurrentLineOut%,%OutputFile%
  If (ErrorLevel!=0)
  {
    MsgBox,4112,Fatal Error 03,Error Level %ErrorLevel% trying to append to file`n`n%OutputFile%
    ExitApp
  }
  FoundFirstOccurrence:=True
}
MsgBox,4100,Successfully Finished,Created output file:`n%OutputFile%`n`nDo you want to open it before exiting?
IfMsgBox,Yes
  Run,%OutputFile%
ExitApp

Open in new window

I put in lots of comments and gave the variables long, self-descriptive names as a form of documentation for the program.

Change the InputFile and OutputFile assignment statements to whatever you want. When the program is finished, it gives this dialog:

bobby finished
The result of running the script on your sample input is attached.

If you're not familiar with AutoHotkey, this Experts Exchange article will get you going on it:
AutoHotkey - Getting Started

These other EE articles have more AutoHotkey code that should be helpful in learning the language, if you are interested:
https://www.experts-exchange.com/articles/29416/
https://www.experts-exchange.com/articles/29533/
https://www.experts-exchange.com/articles/31560/

If you're not interested in the language, just run the script that I wrote for you as is, simply changing the input and output files. Regards, Joe
bobby-output.txt
BobbyAuthor Commented:
Joe, thanks very much. Ill look into the auto hot key thing, but how would I run this as is? The reason I asked for a batch is because this needs to be able to be run on it's own, without user intervention.
BobbyAuthor Commented:
oh, I think you meant i must use AHK with this, but I also have the option of learning more. Correct?
Joe Winograd, Fellow&MVEDeveloperCommented:
Hi Bobby,
This doesn't require any user intervention. It doesn't even require AutoHotkey to be installed on the target machine. As my AutoHotkey article shows, you can compile it into an EXE file on your dev machine that can then run on any machine...no AutoHotkey needed on the other machines. In fact, I can compile it here for you and send you the EXE, but I think it's better if you install AutoHotkey on your dev machine and compile it yourself.

If you want other users to be able to set the input and output files easily, I would change the assignment statements to Browse-for-File statements. If you're using it just for yourself, simply install AutoHotkey, as shown in my article, and double-click the AHK file that I wrote for you (that works because AutoHotkey will own the AHK file extension after you install it). Regards, Joe

Edit#1: Leaving my office for a while. In case you want Browse-for-File dialogs instead of hard-coded assignment statements, change these lines:

InputFile:="c:\temp\bobby-example.txt"
OutputFile:="c:\temp\bobby-output.txt"

Open in new window

to these lines:

FileSelectFile,InputFile,1,,Browse to input file and click Open (or click Cancel to exit),Text files (*.txt)
If (ErrorLevel=1)
  ExitApp
FileSelectFile,OutputFile,S24,,Browse to output file and click Open (or click Cancel to exit),Text files (*.txt)
If (ErrorLevel=1)
  ExitApp

Open in new window

Edit#2: One other thing that just occurred to me...it would be a good idea to make sure the input file and output aren't the same, especially since the code deletes the output file at the beginning so that the appends, in essence, create a new file. I attached an updated AutoHotkey script to this post with such a check (and it also contains the Browse-for-File code instead of the hard-coded assignment statements for the input and output files). When it determines that they're the same file, it gives this error dialog:

bobby input output same
Edit#3: Although the Browse-for-File approach is a more general solution, it does require "user intervention" to browse to the files. There are many ways to avoid that, such as implementing a configuration file (config.ini), but if you want a simple solution without user intervention, then go back to the hard-coded assignment statements for the input and output files.
BobbyDup.ahk
BobbyAuthor Commented:
Joe,

Good golly miss molly, that works great! One thing though... after running it, Cost Of Goods Drop Ship only has a single space after it, where before it had at least two spaces (like a tab delimiter). The other one looks like the original tab spacing after it, and  the duplicate line thing works fine. I don't really know the science of tabs... doesn't a tab delimiter have to have at least two spaces? i guess the bigger question is: why did that one change the tab after it but the other one didn't?
BobbyAuthor Commented:
btw, I did look at your code and there's nothing in either of those lines that have anything to do with a tab, so I'm not sure how this could even happen.
BobbyAuthor Commented:
and Joe, I assume this will be ok to install on a Windows 2008 Terminal Server (which our staff RDP into to do various work), and it won't bother all the end users? I assume if I install it as the server admin, it will just be available there, or at least won't prompt other users about anything. I need this to be invisible, so i don't have to explain to all the users to ignore prompts from some new program they'll never use (it's sales staff).
Joe Winograd, Fellow&MVEDeveloperCommented:
> Good golly miss molly, that works great!

Glad to hear it!

> Cost Of Goods Drop Ship only has a single space after it

No, it doesn't. It has a tab character after it (x'09'), same as before. What it looks like in a text editor depends on where your tab stops are. Here's what a hex editor reveals:

C  o  s  t     O  f     G  o  o  d  s     D  r  o  p     S  h  i  p     -  3  4  0  8  .  6  4
43 6f 73 74 20 4f 66 20 47 6f 6f 64 73 20 44 72 6f 70 20 53 68 69 70 09 2d 33 34 30 38 2e 36 34

Open in new window

Note the hex 09 character as opposed to the hex 20 character (see comment below for further explanation).

> I don't really know the science of tabs... doesn't a tab delimiter have to have at least two spaces?

No. Tab is simply a character...ASCII code x'09' (a space is ASCII code x'20'). What the tab character spacing looks like depends on the text viewer you're using and the tab settings in effect.

> why did that one change the tab after it but the other one didn't?

Neither changed. It just so happens that the spacing of the two changes look different in your text editor due to tab settings.

> btw, I did look at your code and there's nothing in either of those lines that have anything to do with a tab

Correct! Nothing in my AutoHotkey code changes the tabs that are in the lines of the input file. Regards, Joe

Edit: Our messages crossed...just saw your last question. Yes, it's fine to install on a Windows 2008 Terminal Server. It won't "bother" the end-users, but how is it going to run? Or who is going to run it?
BobbyAuthor Commented:
It will run via an existing VB routine in our Access 2003 program... user hits an existing button which creates the source file, and immediately after that, your code will run to clean up the source file and then save it to overwrite the original source file. So, the input and output file paths will be the exact same. That should be fine, correct?
Bill PrewIT / Software Engineering ConsultantCommented:
Why not just clean up the data in Access via VBA, or queries to replace data, and remove dupes, etc.  Seems like it would be a simpler approach, but since you asked for a Batch solution we didn't know the data was coming out of Access.


»bp
BobbyAuthor Commented:
Bill,

I did exactly that to start, on another thread, and it didnt get me where I needed to be. I didnt create the VB for this, and I'm not looking forward to recreating it. I know batch better than I know VB :-)
Joe Winograd, Fellow&MVEDeveloperCommented:
> That should be fine, correct?

No, not as it is written now. In fact, I put in an error check to make sure that the input and output files are not the same file. I'll remove that code and put in some new code to allow the input and output files to be the same file. And I'll remove all dialog boxes except for Fatal Errors, so that there is never any user intervention needed (except for those Fatal Errors, which should never happen, but proper coding always checks for error returns). Give me a half-hour or so to revise and test. Regards, Joe
Joe Winograd, Fellow&MVEDeveloperCommented:
When I hit submit, I saw Bill's comment. I think he is spot-on. I certainly didn't realize that this would be called from Access. I see your response to Bill about the failed attempt at a VBA solution in another thread, but I have no doubts that Bill can do it. In the meantime, I'll fix the AutoHotkey script for you, but Bill's approach is likely to be the better way to go. Regards, Joe
BobbyAuthor Commented:
Both of you,

Thanks, and I do know that fixing the source is the right way, but at this point after failing to get it fixed after a few days of poking at it, I just need to get this fixed so my job stays intact :-) I'll learn how to do this in VB at some point. Plus, if we wouldn't have gone this route I wouldn't know about AHK! I've already recommended it to a few colleagues elsewhere.
Joe Winograd, Fellow&MVEDeveloperCommented:
Bobby,
OK, I commented out the Browse-for-File code rather than deleting it (in case you ever need it). I put back in the hard-coded assignment statement for the input file, which you can use so that there's no user intervention required. I also commented out the "Successfully Finished" dialog box (same reason...so that no user intervention is required). Updated AutoHotkey script is attached. Regards, Joe

Edit: Updated script with an error check on the FileMove command. Use V3, not V2.
BobbyDupV3.ahk

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
BobbyAuthor Commented:
Joe,

You are truly amazing. Thank you so very much.
Joe Winograd, Fellow&MVEDeveloperCommented:
You're welcome, Bobby, I'm glad that does the trick for you. And thanks back at you for the compliment...I really appreciate hearing it! Regards, Joe
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Programming

From novice to tech pro — start learning today.