Link to home
Create AccountLog in
Avatar of dwcronin
dwcroninFlag for United States of America

asked on

what is a Perl command make a substitution

I have a bunch of Microsoft Word files, and I would like to make substitutions that are common to each of them .  I remember that I used Perl before and did something similar to this I don't remember the command.  Does anyone know what to type?
Avatar of wilcoxon
wilcoxon
Flag of United States of America image

Are these Word 2003 or Word 2007 format (eg doc or docx)?

In either case, it won't be a simple command as you'll have to use modules to properly handle Word docs.
Avatar of dwcronin

ASKER

.doc files.  I think that it is actually an older version of Microsoft Word than what you quoted.I have no idea what a module is.  I thought you could use an asterisk.
To give some background...  Basically, there have been (at least) 3 different incompatible Word document formats - 2007-2010, 1996-2003 (may be off by a year on the early end), and per-1996.

The 2007+ version (docx) has the advantage of being zipped XML which can be handled via normal XML means (but is non-trivial due to the terrible design Microsoft used for the XML).

The 2003 version is a binary format but has the advantage of having been around a long time so quite a bit of detail has been worked out about it.

A module is basically a Perl library that adds support for something (in this case, working with Word docs).  I may not have time to look at this until Saturday but I'll try to throw something together.
Thank you.  I didn't know this was so involved.  After what you said, I started thinking that maybe I didn't apply this to multiple files but rather that I used an external file that had multiple substitutions in.  I'm sorry but this was about four years ago and I seriously don't remember.  I will start going through the Perl manual also.  If I find anything out, I'll post it.
for a little background, but I don't think I'm going to is:
I have about 600 files, and I want to substitute postal abbreviation for the state names in each of them.  I really don't want to do this individually, and I was thinking that maybe Pearl would do it.
Perl should be able to do it.  I'm just not sure how involved it will be yet (I've done lots with Excel files but not much with Word files).
Avatar of dagesi
dagesi

Are the documents REALLY Word documents or do you just use Word to view/work on them...?
Meaning do they view textually correctly in Wordpad, for instance...?
I never thought about trying it.  I just tried it and the files did not show up in WordPad.
specifically: I have an old book on my family history.  I bought Dragon naturally speaking, and have been playing with it and typed in the contents of the book.  I have about 600 pages of files I wrote in Microsoft Word.  I wanted to neaten them up by abbreviating the states names with their postal abbreviations but I don't want to go through each file and do them individually.  That's while I was trying to use Perl. I am essentially just learning Perl/Dragon/webpages.
Sorry it took so long.  Unfortunately this seems to be a very tricky problem.  Word modules are much newer and more primitive than Excel modules.

If you are okay with losing any formatting in the documents, it's fairly straight-forward (use Text::Extract::Word to get all the text, modify it, and write a new file using Win32::Word::Writer).

However, if you want/need to alter the existing documents in-place then your options are either use Win32::Word::Declarative or go directly to Win32::OLE - both of which will require significant reading of the Microsoft VBA references and lots of trial-and-error (hence make sure to backup the docs).
I'm not too concerned about the formatting.  I did very little of that when I realized how many files there are.  I was simply trying to go through and do all of the files en masse when I realized there were many similar substitutions.  The big one that I noticed was the state names were spelled out, and I realized I could just use the two letter postal codes.  I  thought  Perl could do that type of substitution pretty easily.

Can I use your method of Text:: Extract:: Word with the whole group or do I have to do the files individually?  I think there are 540 files.  My knowledge of Perl is rather minimal  and this seemed like good excuse to learn it  better.   That, coupled with the fact that the family album files are rather unimportant, made it seem a good place to learn Perl platter.

ASKER CERTIFIED SOLUTION
Avatar of wilcoxon
wilcoxon
Flag of United States of America image

Link to home
membership
Create a free account to see this answer
Signing up is free and takes 30 seconds. No credit card required.
See answer
That was great!!!  I misunderstood that what you meant.  I thought I would have to go back into word somehow.  This is exactly what I was looking for!
I got a little overanxious when I saw the solution.  I've copied the file in the script does not work.Microsoft Windows XP [Version 5.1.2600]
(C) Copyright 1985-2001 Microsoft Corp.

C:\Documents and Settings\Danny>dir
 Volume in drive C has no label.
 Volume Serial Number is 0842-5308

 Directory of C:\Documents and Settings\Danny

04/04/2011  02:53 PM    <DIR>          .
04/04/2011  02:53 PM    <DIR>          ..
02/22/2011  09:25 PM                 0 .javafx_eula_accepted
03/05/2011  06:44 PM    <DIR>          .nbi
02/22/2011  09:53 PM    <DIR>          .netbeans
03/06/2011  02:09 PM    <DIR>          .netbeans-derby
02/22/2011  09:51 PM    <DIR>          .netbeans-registration
02/15/2011  08:49 AM         2,521,132 asound.wav
04/02/2011  10:08 AM    <DIR>          Desktop
02/08/2011  05:15 PM    <DIR>          Favorites
04/02/2011  03:30 PM    <DIR>          My Documents
12/22/2010  12:41 PM    <DIR>          Start Menu
03/01/2011  04:21 PM    <DIR>          wurm
               2 File(s)      2,521,132 bytes
              11 Dir(s)   4,510,662,656 bytes free

C:\Documents and Settings\Danny> CD "my documents"

C:\Documents and Settings\Danny\My Documents> DIR
 Volume in drive C has no label.
 Volume Serial Number is 0842-5308

 Directory of C:\Documents and Settings\Danny\My Documents

04/02/2011  03:30 PM    <DIR>          .
04/02/2011  03:30 PM    <DIR>          ..
02/04/2011  06:49 PM        25,561,546 AbsoluteBeginnersGuidetoiPodandiTunesThir
dEdition.pdf
03/05/2011  05:50 PM            64,000 alice.doc
01/02/2011  07:11 PM    <DIR>          books on tape
04/02/2011  10:00 AM    <DIR>          Downloads
01/07/2011  11:33 PM         1,790,803 Dragon manual.pdf
04/01/2011  02:25 PM    <DIR>          family album
01/06/2011  05:44 PM    <DIR>          Freecorder 4
02/09/2011  03:36 PM    <DIR>          InnerWorkings
02/08/2011  03:49 PM    <DIR>          InnerWorkings Content
02/16/2011  05:37 PM    <DIR>          Java how to program books
03/29/2011  10:38 AM    <DIR>          Michael and Priscilla Cronin
02/22/2011  01:23 PM    <DIR>          mini recorder training
03/29/2011  10:31 AM            14,848 my accounts.xls
03/11/2011  05:42 PM    <DIR>          My Music
04/02/2011  10:07 AM    <DIR>          My Pictures
01/10/2011  07:51 PM    <DIR>          My Videos
02/21/2011  06:46 PM    <DIR>          NetBeans IDE
03/06/2011  05:26 PM    <DIR>          NetBeansProjects
04/04/2011  02:59 PM    <DIR>          Perl programs
03/05/2011  02:56 PM    <DIR>          recipes
02/08/2011  11:26 AM            29,696 resume_programming_focus.doc
               5 File(s)     27,460,893 bytes
              18 Dir(s)   4,519,280,640 bytes free

C:\Documents and Settings\Danny\My Documents> CD "Perl programs"

C:\Documents and Settings\Danny\My Documents\Perl programs> DIR
 Volume in drive C has no label.
 Volume Serial Number is 0842-5308

 Directory of C:\Documents and Settings\Danny\My Documents\Perl programs

04/04/2011  02:59 PM    <DIR>          .
04/04/2011  02:59 PM    <DIR>          ..
03/30/2011  09:10 AM            25,088 Aaron_AR_3_1.doc
03/31/2011  09:45 AM            21,504 Aaron_E_AR_5_101.doc
03/31/2011  03:17 PM            19,968 Agnes_Christena_AR_6_1032.doc
03/21/2011  01:16 PM            19,968 Agnes_Crawford_AR_6_143.doc
03/21/2011  03:31 PM            20,480 Alberta_AR_6_1510.doc
03/31/2011  03:21 PM            19,968 Alexander_AR_4_10.doc
03/21/2011  04:11 PM            19,968 Alice_AR_6_178.doc
03/22/2011  03:11 PM            39,424 all relatives.doc
03/31/2011  03:27 PM            19,968 Anne_Belle_AR_5_92.doc
03/31/2011  03:29 PM            19,968 Arnie_Oscar_AR_6_171.doc
03/31/2011  03:31 PM            19,968 Arthur_Rosendale_AR_6_141.doc
03/31/2011  03:34 PM            19,968 Barbara_Jean_AR_7_10141.doc
03/31/2011  03:35 PM            19,968 Beatrice_AR_6_1026.doc
03/31/2011  03:38 PM            19,968 Bernadine_Mae_AR_7_7335.doc
03/22/2011  03:10 PM            19,968 Bernice_AR_6_1024.doc
03/21/2011  04:03 PM            19,968 Bessie_AR_6_176.doc
03/31/2011  03:42 PM            19,968 Betty_Malone_AR_7_1442.doc
03/22/2011  01:13 PM            19,968 Beulah_AR_6_944.doc
03/23/2011  08:41 AM            19,968 Bonnie_Jean_AR_7_1584.doc
03/22/2011  03:44 PM            19,968 Burdett_Alexander_AR_6_1025.doc
03/23/2011  11:54 AM            19,968 Carl_Eric_AR_7_1734.doc
03/24/2011  07:27 AM            19,968 Carolyn_Sue_AR_7_10256.doc
03/21/2011  03:49 PM            19,968 Casper_AR_6_172.doc
03/21/2011  01:35 PM            22,528 Casper_Carl_AR_6_153.doc
03/23/2011  07:09 AM            20,480 Casper_Carl_AR_7_1533.doc
03/22/2011  01:47 PM            20,480 Cecil_AR_6_1014.doc
03/21/2011  01:26 PM            20,480 Cedric_Isaac_AR_6_144.doc
03/21/2011  03:57 PM            19,968 Charles_AR_6_174.doc
04/04/2011  02:15 PM            19,968 Charlotte_June_AR_7_1791.doc
03/22/2011  06:31 AM            19,968 Chester_AR_6_174.doc
03/21/2011  06:10 PM            19,968 Chester_L_AR_6_718.doc
03/24/2011  06:04 AM            19,968 choice_Mae_AR_7_7341.doc
03/21/2011  06:18 PM            19,968 Clara_Millie_AR_6_731.doc
03/21/2011  06:28 PM            19,968 Clarence_I_AR_6_733.doc
03/24/2011  05:59 AM            19,968 Clarence_I_AR_7_7336.doc
03/21/2011  05:35 PM            19,968 Cora_Agnes_AR_6_712.doc
03/21/2011  10:42 AM            19,968 Crawford_AR_5_14.doc
03/21/2011  12:00 PM            19,968 Crawford_E_AR_5_76.doc
03/21/2011  03:54 PM            19,968 Curtis_AR_6_173.doc
03/24/2011  06:41 AM            19,968 Curtis_Lee_AR_7_10223.doc
03/23/2011  06:17 AM            20,480 Cyril_AR_6_1031.doc
03/21/2011  05:45 PM            19,968 Daisy_AR_6_713.doc
03/21/2011  03:41 PM            19,968 Daniel_Webster_AR_6_161.doc
03/24/2011  06:53 AM            19,968 David_Lee_AR_7_10251.doc
03/23/2011  08:27 AM            19,968 Donald_Joseph_AR_7_1583.doc
03/24/2011  07:06 AM            19,968 Donald_Ray_AR_7_10252.doc
03/21/2011  01:36 PM            20,480 Edith_Emma_AR_6_154.doc
03/21/2011  06:05 PM            19,968 Edna_AR_6_716.doc
03/22/2011  02:42 PM            19,968 Elbert_W_AR_6_1023.doc
03/21/2011  10:40 AM            19,968 Elizabeth_E_AR_5_11.doc
03/21/2011  11:03 AM            19,968 Eliza_E_AR_5_72.doc
04/04/2011  02:20 PM            19,968 Elsie_Luria_AR_6_151.doc
03/24/2011  07:49 AM            19,968 Ernest_Dee_AR_7_10272.doc
03/30/2011  10:35 AM            26,624 essay -- Aaron Cronin (AR 3-1).doc
03/22/2011  02:25 PM            19,968 Estel_E_AR_6_1021.doc
03/24/2011  07:13 AM            19,968 Estel_Wayne_AR_7_10253.doc
03/22/2011  07:19 AM            24,064 Ethel_AR_6_736.doc
03/21/2011  06:00 PM            19,968 Eva_Marie_AR_6_715.doc
04/04/2011  12:48 PM             1,115 formatting_script.pl
03/23/2011  07:24 AM            19,968 Francis_Letoice_AR_7_1535.doc
03/24/2011  08:14 AM            19,968 George_Paul_AR_8_15813.doc
03/23/2011  02:42 PM            19,968 Georgia_Lee_AR_7_7182 okay.doc
03/23/2011  10:16 AM            19,968 Gertrude_C_AR_7_1613.doc
03/22/2011  02:32 PM            19,968 Glenn_Elwood_AR_6_1022.doc
03/21/2011  01:29 PM            19,968 Hallie_Myrtle_AR_6_145.doc
03/21/2011  05:55 PM            19,968 Hazel_AR_6_714.doc
03/21/2011  12:24 PM            20,480 Herbert_AR_5_102.doc
03/22/2011  01:42 PM            19,968 Homer_Lessel_AR_6_1011.doc
03/21/2011  06:22 PM            19,968 Howard_Addison_AR_6_732.doc
03/21/2011  10:26 AM            20,480 Howard_AR_4_7.doc
03/21/2011  10:44 AM            24,576 Howard_R_AR_5_15.doc
03/22/2011  10:17 AM            19,968 ida_AR_6_941.doc
03/23/2011  08:17 AM            20,992 Irvin_Howard_AR_7_1581.doc
03/21/2011  10:13 AM            20,480 Isaac_AR_4_1.doc
03/21/2011  11:01 AM            20,480 Isaac_W_AR_5_71.doc
03/24/2011  05:51 AM            19,968 James_Edward_AR_7_7334.doc
03/21/2011  12:25 PM            19,968 James_F_AR_5_103.doc
03/23/2011  09:56 AM            20,480 James_W._AR_7_1611.doc
03/23/2011  08:59 AM            19,968 Jerry_Keith_AR_7_1588.doc
03/21/2011  02:49 PM            20,992 Jesse_Roscoe_AR_6_158.doc
03/23/2011  08:53 AM            19,968 Jimmy_Joe_AR_7_1587.doc
03/23/2011  07:15 AM            19,968 Joan_Arc_AR_7-1534.doc
03/23/2011  06:47 AM            19,968 Joe_Arthur_AR_7_1443.doc
03/23/2011  06:53 AM            19,968 Judith_Ann_AR_7_1444.doc
03/22/2011  07:29 AM            23,552 Kenneth_AR_6_737.doc
03/23/2011  04:44 PM            19,968 Kenneth_A_AR_7_7322.doc
03/21/2011  04:16 PM            19,968 Kenneth_H_AR_6_179.doc
03/24/2011  08:08 AM            19,968 Kenneth_Irvin_AR_8_15811.doc
03/23/2011  08:00 AM            19,968 Leitha_Laveena_AR_7_1538.doc
03/22/2011  07:00 AM            20,480 Leonard_AR_6_734.doc
03/22/2011  02:10 PM            19,968 Leonard_Earl_AR_6_1016.doc
03/24/2011  06:07 AM            19,968 Leonard_Edward_AR_7_7342.doc
03/21/2011  03:01 PM            19,968 Leona_AR_6_159.doc
03/21/2011  02:41 PM            19,968 Lester_Edward_AR_6_155.doc
03/21/2011  01:33 PM            20,480 Lindsay_Bryant_AR_6_152.doc
03/24/2011  06:25 AM            19,968 Lois_Lorene_AR_7_10221.doc
03/23/2011  06:59 AM            19,968 Lois_Pearl_AR_7_1531.doc
03/23/2011  01:35 PM            20,992 Lorene_AR_7_7111.doc
03/21/2011  11:59 AM            19,968 Louisa_AR_5_75.doc
03/21/2011  03:34 PM            19,968 Luetta_AR_6_1511.doc
03/23/2011  07:53 AM            19,968 Marietta_Bretina_AR_7_1538.doc
03/24/2011  07:57 AM            19,968 Marilyn_E_AR_7_10311.doc
03/23/2011  02:19 PM            19,968 Martha_Fay_AR_7_7114.doc
03/24/2011  05:46 AM            19,968 Mary_Alma_AR_7_7333.doc
03/23/2011  06:36 AM            19,968 Mary_Ann_AR_7_1422.doc
03/21/2011  10:16 AM            19,968 Mary_AR_4_2.doc
03/21/2011  12:27 PM            20,480 Mary_AR_5_105.doc
03/21/2011  12:16 PM            19,968 Mary_Eliza_AR_5_93.doc
03/23/2011  03:08 PM            20,480 Mary_Helen_AR_7_7321.doc
03/22/2011  02:16 PM            19,968 Mary_Mabel_AR_6_1017.doc
03/22/2011  10:26 AM            19,968 Mary_Susanna_AR_6_942.doc
03/23/2011  10:23 AM            19,968 May_Louise_AR_7_1614.doc
03/21/2011  12:04 PM            19,968 Minnie_C_AR_5_78.doc
03/21/2011  10:46 AM            19,968 Montville_AR_5_16.doc
03/22/2011  02:03 PM            19,968 Nellie_AR_6_1015.doc
03/21/2011  04:24 PM            19,968 Noah_AR_6_711.doc
03/23/2011  12:24 PM            19,968 Opal_May_AR_7_1736.doc
03/21/2011  10:55 AM            20,480 Otha_AR_5_17.doc
03/23/2011  11:24 AM            19,968 Paul_Cooper_AR_7_1733.doc
03/21/2011  10:18 AM            19,968 Priscilla_AR_4_3.doc
03/21/2011  12:08 PM            19,968 Priscilla_J_AR_5_91.doc
03/21/2011  10:27 AM            19,968 Rachel_AR_4_8.doc
03/23/2011  08:50 AM            19,968 Ralph_Curtis_AR_7_1585.doc
03/22/2011  02:20 PM            19,968 Ralph_E_AR_6_1018.doc
03/24/2011  06:18 AM            19,968 Ralph_Leon_AR_7_10161.doc
03/23/2011  02:05 PM            19,968 Raymond_Vincent_AR 7_7113.doc
03/22/2011  06:13 AM            24,064 Rebecca_Ann_AR_5_74.doc
03/21/2011  10:19 AM            19,968 Rebecca_AR_4_4.doc
03/24/2011  07:38 AM            19,968 Ricky_Eugene_AR_7_102510.doc
03/23/2011  11:11 AM            19,968 Robert_AR_7_1732.doc
03/23/2011  02:28 PM            19,968 Robert_Dewey_AR_7_7181.doc
03/24/2011  07:21 AM            19,968 Robert_Earl_AR_7_10254.doc
03/24/2011  06:33 AM            19,968 Robert_Elwood_AR_7_10222.doc
03/24/2011  07:44 AM            19,968 Ronald_E_AR_7_10271.doc
03/24/2011  08:32 AM            19,968 Ronald_Lee_AR_8_73222.doc
03/21/2011  12:29 PM            19,968 Russell_Winfred_AR_6_142.doc
03/23/2011  06:30 AM            19,968 Russell_Winfred_AR_7_1421.doc
03/21/2011  12:02 PM            20,480 Samuel_Guy_AR_5_77.doc
03/21/2011  12:19 PM            19,968 Sarah_Elizabeth_AR_5_95.doc
03/21/2011  10:24 AM            19,968 Sarah_J_AR_4_6.doc
03/23/2011  06:09 AM            19,968 Selmer_AR_6_1027.doc
03/23/2011  12:19 PM            20,480 Shirley_Allen_AR_7_1735.doc
03/24/2011  08:01 AM            19,968 Suzanne_Virginia_AR_a_14212.doc
03/21/2011  10:31 AM            20,992 Sylvester_AR_4_9.doc
03/21/2011  04:06 PM            19,968 Sylvia_AR_6_177.doc
03/23/2011  10:56 AM            19,968 Vernon_AR_7_1731.doc
03/23/2011  07:48 AM            19,968 Vesta_Conchita_AR_7_1536.doc
03/24/2011  08:25 AM            19,968 Vicky_Kay_AR_8_73221.doc
03/21/2011  11:41 AM            19,968 Walter_AR_5_74.doc
03/21/2011  12:17 PM            19,968 Walter_R_AR_5_94.doc
03/23/2011  06:29 PM            19,968 Wanda_Lee_AR_7_7331.doc
03/24/2011  08:18 AM            19,968 Wanda_Lee_AR_8_17321.doc
03/21/2011  11:55 AM            20,992 Willard_AR_5_73.doc
03/21/2011  10:21 AM            20,992 William_AR_4_5.doc
03/23/2011  07:50 PM            19,968 William_B_AR_7_7332.doc
03/23/2011  06:12 PM            19,968 William_Donald_AR_7_7323.doc
04/01/2011  08:07 PM            19,968 William_Walter_AR_5_104.doc
             157 File(s)      3,183,195 bytes
               2 Dir(s)   4,519,145,472 bytes free

C:\Documents and Settings\Danny\My Documents\Perl programs>ls formatting_script.
pl
'ls' is not recognized as an internal or external command,
operable program or batch file.

C:\Documents and Settings\Danny\My Documents\Perl programs>dir formatting_script
.pl
 Volume in drive C has no label.
 Volume Serial Number is 0842-5308

 Directory of C:\Documents and Settings\Danny\My Documents\Perl programs

04/04/2011  12:48 PM             1,115 formatting_script.pl
               1 File(s)          1,115 bytes
               0 Dir(s)   4,518,797,312 bytes free

C:\Documents and Settings\Danny\My Documents\Perl programs> Perl formatting_scri
pt.pl
Can't locate Text/Extract/Word.pm in @INC (@INC contains: C:/Perl/site/lib C:/Pe
rl/lib .) at formatting_script.pl line 6.
BEGIN failed--compilation aborted at formatting_script.pl line 6.

C:\Documents and Settings\Danny\My Documents\Perl programs>
That message is saying you need to install Text::Extract::Word.  If you are using ActiveState Perl, check their site for how to install modules (or the standard way may work as well).  The normal way to install a module is using CPAN:

perl -MCPAN -e 'install Text::Extract::Word'

I think that's right - if it doesn't work, try:

perl -MCPAN -e shell
install Text::Extract::Word