Solved

How do you find and replace just the text in Rich Text Format files, using sed, ignoring useless formatting codes?

Posted on 2004-09-07
10
1,019 Views
Last Modified: 2010-04-21
Here is an outline of the problem:

I am building an application to search and replace rich text files, for example, being able to search for some text in bold formatting and being able to replace it with text in italic formatting.

I looked into using Sed Addressing to solve the problem, whereby sed searches only between curly backets { and } within the rich text document file, ***but this doesnt work where the curly brackets span multiple lines***.  For example, the sed script would look like the following:

/{/,/}/{
s/search text/replace text/g
}

I am using a Windows Port of the unix utility sed, but that should make little difference.  I think.  I am building the actual application itself in the Windows environment, I posted this question in the UNIX area because sed is a UNIX tool, and it is likely the solution to my problem will involve the use of this tool.

I really want some sed script code (or otherwise) that can search through just the bulk of the actual printed text, and replace with specific words.  Just searching and replacing outright on an RTF file is a bad idea, as you can end up searching and replacing the rtf codes you dont want to.

Any suggestions, ideas, or new approaches to this problem?

Thanks,
Matt (ANSI C++/ANSI C/VB Programmer)
0
Comment
Question by:amadataset
  • 5
  • 2
  • 2
  • +1
10 Comments
 
LVL 20

Expert Comment

by:Gns
ID: 11996374
Hm, strange. With GNU sed on it works as expected....
With file aaa:
$ cat aaa
kjashjkasdhdjklhaskldaskl
söajdklasjdlkjaskl{ ldfkjgkldfj text lslösdflösd}lkdfsljfkals
söajdklasjdlkjaskl{ ldfkjgkldfj text lslösdflösdlkdfsljfkals
söajdklasjdlkjaskl ldfkjgkldfj text lslösdflösd}lkdfsljfkals
kjashjkasdhdjklhaskldaskl
kjashjkasdhdjklhas text kldaskl
söajdklasjdlkjaskl{ ldfkjgkldfj text lstextlösdflösdlkdfsljfkals
söajdklasjdlkjaskl ldfkjgkldfj text lslösdftextlösdlkdfsljfkals
söajdklasjdlkjaskl{ ldfkjgkldfj text lslösdflösd}lkdfsljfkals
$ sed -e '/{/,/}/ s/text/TEXT/g' aaa
kjashjkasdhdjklhaskldaskl
söajdklasjdlkjaskl{ ldfkjgkldfj TEXT lslösdflösd}lkdfsljfkals
söajdklasjdlkjaskl{ ldfkjgkldfj TEXT lslösdflösdlkdfsljfkals
söajdklasjdlkjaskl ldfkjgkldfj TEXT lslösdflösd}lkdfsljfkals
kjashjkasdhdjklhaskldaskl
kjashjkasdhdjklhas text kldaskl
söajdklasjdlkjaskl{ ldfkjgkldfj TEXT lsTEXTlösdflösdlkdfsljfkals
söajdklasjdlkjaskl ldfkjgkldfj TEXT lslösdfTEXTlösdlkdfsljfkals
söajdklasjdlkjaskl{ ldfkjgkldfj TEXT lslösdflösd}lkdfsljfkals
$
... Which is equivalent to what you're trying to do. If your sed implementation doesn't want to play, get the cygwin one from http://www.cygwin.com ... That is GNU sed...

-- Glenn
0
 
LVL 48

Expert Comment

by:Tintin
ID: 12001057
Using Perl would be more portable.
0
 
LVL 38

Expert Comment

by:yuzh
ID: 12002353
if you have perl installed (most system have them these days), you can do:


perl -i -pe "s/oldstr/newstr/" $file

you can put it in a shell script if you wanted.
0
 
LVL 20

Expert Comment

by:Gns
ID: 12004004
Perl would be fine... as portable as GNU sed, in a way:-):-)
Note the example above Greg, We need find starting { and ending } ... and only replace the re pattern between those... So one might do something like
perl -pe '$f=1 if(/{/); s/text/TEXT/ if(defined($f)); undef($f) if(defined($f && /}/);' aaa
(which might actually fail if you have a line "skaalas} saas text jkas{jsdksd")

-- Glenn
0
 
LVL 38

Expert Comment

by:yuzh
ID: 12004097
Good point Glenn, thanks for the correction!
0
Is Your Active Directory as Secure as You Think?

More than 75% of all records are compromised because of the loss or theft of a privileged credential. Experts have been exploring Active Directory infrastructure to identify key threats and establish best practices for keeping data safe. Attend this month’s webinar to learn more.

 

Author Comment

by:amadataset
ID: 12004688
OK

Heres an update to the question.

Technically the above should work, and therefore its probably jsut a problem with me port of sed.  Not that it matters.  Newllines in rich-text files are purely optional - they have no effect on the Rich Text Format file whatsoever, so I can remove all the newlines in the file and just use sed that way.

However, still this doesn't solve the problem of finding and replacing only the words themselves within the RTF file (none of the formatting codes etc.), which is what the essence of this problem was.

QUOTE ------------>
     I really want some sed script code (or otherwise) that can search through just
     the bulk of the actual printed text, and replace with specific words.  Just searching
     and replacing outright on an RTF file is a bad idea, as you can end up searching and
     replacing the rtf codes you dont want to.
<-------------------

Thanks for the input though...

-Matt
0
 
LVL 20

Expert Comment

by:Gns
ID: 12005151
Ah, we're fighting the "lineorientationed-ness" of the tools, sort of... With sed, we can use "Newllines in rich-text files are purely optional" to insert some wellplaced newlines... Like:
---------------------------------------------------------
#### Indata
$ cat aaa
kjashjkasdhdjklhaskldaskl
söajdkl text asjdlkjaskl{ ldfkjgkldfj text lslösdflösd}lkdfstextljfkals
söajdklasjdlkjaskl{ ldfkjgkldfj text lslösdflösdlkdfsljfkals
söajdklasjdlkjaskl ldfkjgkldfj text lslösdflösd}lkdfsljfkals
kjashjkasdhdjklhaskldaskl
kjashjkasdhdjklhas text kldaskl
söajdklasjdlkjaskl{ ldfkjgkldfj text lstextlösdflösdlkdfsljfkals
söajdklasjdlkjaskl ldfkjgkldfj text lslösdftextlösdlkdfsljfkals
söajdklasjdlkjaskl{ ldfkjg}kldfj text lslösdflösd}lkdfsljfkals

#### Bad one, look at the first line.
$ sed -e '/{/,/}/ s/text/TEXT/g' aaa / if(def
kjashjkasdhdjklhaskldaskl
söajdkl TEXT asjdlkjaskl{ ldfkjgkldfj TEXT lslösdflösd}lkdfsTEXTljfkals
söajdklasjdlkjaskl{ ldfkjgkldfj TEXT lslösdflösdlkdfsljfkals
söajdklasjdlkjaskl ldfkjgkldfj TEXT lslösdflösd}lkdfsljfkals
kjashjkasdhdjklhaskldaskl
kjashjkasdhdjklhas text kldaskl
söajdklasjdlkjaskl{ ldfkjgkldfj TEXT lsTEXTlösdflösdlkdfsljfkals
söajdklasjdlkjaskl ldfkjgkldfj TEXT lslösdfTEXTlösdlkdfsljfkals
söajdklasjdlkjaskl{ ldfkjg}kldfj TEXT lslösdflösd}lkdfsljfkals

#### "good" one
$ sed -e 's/{/\
{\
/; s/}/\
}\
/' aaa | sed -e '/{/,/}/ s/text/TEXT/g'
kjashjkasdhdjklhaskldaskl
söajdkl text asjdlkjaskl
{
 ldfkjgkldfj TEXT lslösdflösd
}
lkdfstextljfkals
söajdklasjdlkjaskl
{
 ldfkjgkldfj TEXT lslösdflösdlkdfsljfkals
söajdklasjdlkjaskl ldfkjgkldfj TEXT lslösdflösd
}
lkdfsljfkals
kjashjkasdhdjklhaskldaskl
kjashjkasdhdjklhas text kldaskl
söajdklasjdlkjaskl
{
 ldfkjgkldfj TEXT lsTEXTlösdflösdlkdfsljfkals
söajdklasjdlkjaskl ldfkjgkldfj TEXT lslösdfTEXTlösdlkdfsljfkals
söajdklasjdlkjaskl
{
 ldfkjg
}
kldfj text lslösdflösd}lkdfsljfkals
-------------------------------------------------------------------------
Now, with a more proper Perl program we could do this without inserting newlines... Let me frob/tweak a bit and I'll get back to you.

-- Glenn
0
 
LVL 20

Expert Comment

by:Gns
ID: 12005696
Oh so very crude, but working:-). Large files puts a bit of a load on the memory:-):-)...

$ cat aaa
kjashjkasdhdjklhaskldaskl
söajdkl text asjdlkjaskl{ ldfkjgkldfj text lslösdflösd}lkdfstextljfkals
söajdklasjdlkjaskl{ ldfkjgkldfj text lslösdflösdlkdfsljfkals
söajdklasjdlkjaskl ldfkjgkldfj text lslösdflösd}lkdfsljfkals
kjashjkasdhdjklhaskldaskl
kjashjkasdhdjklhas text kldaskl
söajdklasjdlkjaskl{ ldfkjgkldfj text lstextlösdflösdlkdfsljfkals
söajdklasjdlkjaskl ldfkjgkldfj text lslösdftextlösdlkdfsljfkals
söajdklasjdlkjaskl{ ldfkjg}kldfj text lslösdflösd}lkdfsljfkals

$ cat aa.pl
#!/usr/bin/perl -0
$match = "text";
$replace = "TEXT";
open(R,aaa);
$str=<R>;
@s=split(//,$str);
$f=0;
for($i=0;$i<=$#s;$i++)
{
  if($s[$i] =~ /{/) { $f=1; };
  if($s[$i] =~ /}/) { $f=0; };
  if(($f == 1) and ($s[$i] == 't')) {
    $tmp=substr($str,$i,length($match));
    if($tmp eq $match) {
      print $replace;
      $i+=length($match)-1;
      next;
    }
  }
  print $s[$i];
}


$ ./aa.pl
kjashjkasdhdjklhaskldaskl
söajdkl text asjdlkjaskl{ ldfkjgkldfj TEXT lslösdflösd}lkdfstextljfkals
söajdklasjdlkjaskl{ ldfkjgkldfj TEXT lslösdflösdlkdfsljfkals
söajdklasjdlkjaskl ldfkjgkldfj TEXT lslösdflösd}lkdfsljfkals
kjashjkasdhdjklhaskldaskl
kjashjkasdhdjklhas text kldaskl
söajdklasjdlkjaskl{ ldfkjgkldfj TEXT lsTEXTlösdflösdlkdfsljfkals
söajdklasjdlkjaskl ldfkjgkldfj TEXT lslösdfTEXTlösdlkdfsljfkals
söajdklasjdlkjaskl{ ldfkjg}kldfj text lslösdflösd}lkdfsljfkals

$

Enjoy
-- Glenn
0
 
LVL 20

Accepted Solution

by:
Gns earned 500 total points
ID: 12005764
Argh. The script should've been:
#!/usr/bin/perl -0
$match = "text";
$replace = "TEXT";
$firstchar = substr($match,0,1);
open(R,aaa);
$str=<R>;
@s=split(//,$str);
$f=0;
for($i=0;$i<=$#s;$i++)
{
  if($s[$i] =~ /{/) { $f=1; };
  if($s[$i] =~ /}/) { $f=0; };
  if(($f == 1) and ($s[$i] == $firstchar)) {
    $tmp=substr($str,$i,length($match));
#print "<$tmp>";
    if($tmp eq $match) {
      print $replace;
      $i+=length($match)-1;
      next;
    }
  }
  print $s[$i];
}

Sorry for that .... (the hardcoded 't' compare...).

-- Glenn
0
 

Author Comment

by:amadataset
ID: 12006189
This is, by far, not the most elegant of solutions, but probably the only practical way.  Well done, Glenn.

FYI:  I am giving up search and replacing RTFs as they are a maze of codes, making the task near-impossible without some long winded program I don't have time to create.

- Matt
0

Featured Post

Is Your Active Directory as Secure as You Think?

More than 75% of all records are compromised because of the loss or theft of a privileged credential. Experts have been exploring Active Directory infrastructure to identify key threats and establish best practices for keeping data safe. Attend this month’s webinar to learn more.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Title # Comments Views Activity
Where to get php 5.6 for AIX 7.1? 5 82
Skill Development 15 163
cron job says it ran, no results 25 115
AIX 5.x set up arrow to recall 11 46
Hello fellow BSD lovers, I've created a patch process for patching openjdk6 for BSD (FreeBSD specifically), although I tried to keep all BSD versions in mind when creating my patch. Welcome to OpenJDK6 on BSD First let me start with a little …
In tuning file systems on the Solaris Operating System, changing some parameters of a file system usually destroys the data on it. For instance, changing the cache segment block size in the volume of a T3 requires that you delete the existing volu…
Learn how to navigate the file tree with the shell. Use pwd to print the current working directory: Use ls to list a directory's contents: Use cd to change to a new directory: Use wildcards instead of typing out long directory names: Use ../ to move…
In a previous video, we went over how to export a DynamoDB table into Amazon S3.  In this video, we show how to load the export from S3 into a DynamoDB table.

867 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

17 Experts available now in Live!

Get 1:1 Help Now