Solved

How do you find and replace just the text in Rich Text Format files, using sed, ignoring useless formatting codes?

Posted on 2004-09-07
10
1,025 Views
Last Modified: 2010-04-21
Here is an outline of the problem:

I am building an application to search and replace rich text files, for example, being able to search for some text in bold formatting and being able to replace it with text in italic formatting.

I looked into using Sed Addressing to solve the problem, whereby sed searches only between curly backets { and } within the rich text document file, ***but this doesnt work where the curly brackets span multiple lines***.  For example, the sed script would look like the following:

/{/,/}/{
s/search text/replace text/g
}

I am using a Windows Port of the unix utility sed, but that should make little difference.  I think.  I am building the actual application itself in the Windows environment, I posted this question in the UNIX area because sed is a UNIX tool, and it is likely the solution to my problem will involve the use of this tool.

I really want some sed script code (or otherwise) that can search through just the bulk of the actual printed text, and replace with specific words.  Just searching and replacing outright on an RTF file is a bad idea, as you can end up searching and replacing the rtf codes you dont want to.

Any suggestions, ideas, or new approaches to this problem?

Thanks,
Matt (ANSI C++/ANSI C/VB Programmer)
0
Comment
Question by:amadataset
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 5
  • 2
  • 2
  • +1
10 Comments
 
LVL 20

Expert Comment

by:Gns
ID: 11996374
Hm, strange. With GNU sed on it works as expected....
With file aaa:
$ cat aaa
kjashjkasdhdjklhaskldaskl
söajdklasjdlkjaskl{ ldfkjgkldfj text lslösdflösd}lkdfsljfkals
söajdklasjdlkjaskl{ ldfkjgkldfj text lslösdflösdlkdfsljfkals
söajdklasjdlkjaskl ldfkjgkldfj text lslösdflösd}lkdfsljfkals
kjashjkasdhdjklhaskldaskl
kjashjkasdhdjklhas text kldaskl
söajdklasjdlkjaskl{ ldfkjgkldfj text lstextlösdflösdlkdfsljfkals
söajdklasjdlkjaskl ldfkjgkldfj text lslösdftextlösdlkdfsljfkals
söajdklasjdlkjaskl{ ldfkjgkldfj text lslösdflösd}lkdfsljfkals
$ sed -e '/{/,/}/ s/text/TEXT/g' aaa
kjashjkasdhdjklhaskldaskl
söajdklasjdlkjaskl{ ldfkjgkldfj TEXT lslösdflösd}lkdfsljfkals
söajdklasjdlkjaskl{ ldfkjgkldfj TEXT lslösdflösdlkdfsljfkals
söajdklasjdlkjaskl ldfkjgkldfj TEXT lslösdflösd}lkdfsljfkals
kjashjkasdhdjklhaskldaskl
kjashjkasdhdjklhas text kldaskl
söajdklasjdlkjaskl{ ldfkjgkldfj TEXT lsTEXTlösdflösdlkdfsljfkals
söajdklasjdlkjaskl ldfkjgkldfj TEXT lslösdfTEXTlösdlkdfsljfkals
söajdklasjdlkjaskl{ ldfkjgkldfj TEXT lslösdflösd}lkdfsljfkals
$
... Which is equivalent to what you're trying to do. If your sed implementation doesn't want to play, get the cygwin one from http://www.cygwin.com ... That is GNU sed...

-- Glenn
0
 
LVL 48

Expert Comment

by:Tintin
ID: 12001057
Using Perl would be more portable.
0
 
LVL 38

Expert Comment

by:yuzh
ID: 12002353
if you have perl installed (most system have them these days), you can do:


perl -i -pe "s/oldstr/newstr/" $file

you can put it in a shell script if you wanted.
0
Get 15 Days FREE Full-Featured Trial

Benefit from a mission critical IT monitoring with Monitis Premium or get it FREE for your entry level monitoring needs.
-Over 200,000 users
-More than 300,000 websites monitored
-Used in 197 countries
-Recommended by 98% of users

 
LVL 20

Expert Comment

by:Gns
ID: 12004004
Perl would be fine... as portable as GNU sed, in a way:-):-)
Note the example above Greg, We need find starting { and ending } ... and only replace the re pattern between those... So one might do something like
perl -pe '$f=1 if(/{/); s/text/TEXT/ if(defined($f)); undef($f) if(defined($f && /}/);' aaa
(which might actually fail if you have a line "skaalas} saas text jkas{jsdksd")

-- Glenn
0
 
LVL 38

Expert Comment

by:yuzh
ID: 12004097
Good point Glenn, thanks for the correction!
0
 

Author Comment

by:amadataset
ID: 12004688
OK

Heres an update to the question.

Technically the above should work, and therefore its probably jsut a problem with me port of sed.  Not that it matters.  Newllines in rich-text files are purely optional - they have no effect on the Rich Text Format file whatsoever, so I can remove all the newlines in the file and just use sed that way.

However, still this doesn't solve the problem of finding and replacing only the words themselves within the RTF file (none of the formatting codes etc.), which is what the essence of this problem was.

QUOTE ------------>
     I really want some sed script code (or otherwise) that can search through just
     the bulk of the actual printed text, and replace with specific words.  Just searching
     and replacing outright on an RTF file is a bad idea, as you can end up searching and
     replacing the rtf codes you dont want to.
<-------------------

Thanks for the input though...

-Matt
0
 
LVL 20

Expert Comment

by:Gns
ID: 12005151
Ah, we're fighting the "lineorientationed-ness" of the tools, sort of... With sed, we can use "Newllines in rich-text files are purely optional" to insert some wellplaced newlines... Like:
---------------------------------------------------------
#### Indata
$ cat aaa
kjashjkasdhdjklhaskldaskl
söajdkl text asjdlkjaskl{ ldfkjgkldfj text lslösdflösd}lkdfstextljfkals
söajdklasjdlkjaskl{ ldfkjgkldfj text lslösdflösdlkdfsljfkals
söajdklasjdlkjaskl ldfkjgkldfj text lslösdflösd}lkdfsljfkals
kjashjkasdhdjklhaskldaskl
kjashjkasdhdjklhas text kldaskl
söajdklasjdlkjaskl{ ldfkjgkldfj text lstextlösdflösdlkdfsljfkals
söajdklasjdlkjaskl ldfkjgkldfj text lslösdftextlösdlkdfsljfkals
söajdklasjdlkjaskl{ ldfkjg}kldfj text lslösdflösd}lkdfsljfkals

#### Bad one, look at the first line.
$ sed -e '/{/,/}/ s/text/TEXT/g' aaa / if(def
kjashjkasdhdjklhaskldaskl
söajdkl TEXT asjdlkjaskl{ ldfkjgkldfj TEXT lslösdflösd}lkdfsTEXTljfkals
söajdklasjdlkjaskl{ ldfkjgkldfj TEXT lslösdflösdlkdfsljfkals
söajdklasjdlkjaskl ldfkjgkldfj TEXT lslösdflösd}lkdfsljfkals
kjashjkasdhdjklhaskldaskl
kjashjkasdhdjklhas text kldaskl
söajdklasjdlkjaskl{ ldfkjgkldfj TEXT lsTEXTlösdflösdlkdfsljfkals
söajdklasjdlkjaskl ldfkjgkldfj TEXT lslösdfTEXTlösdlkdfsljfkals
söajdklasjdlkjaskl{ ldfkjg}kldfj TEXT lslösdflösd}lkdfsljfkals

#### "good" one
$ sed -e 's/{/\
{\
/; s/}/\
}\
/' aaa | sed -e '/{/,/}/ s/text/TEXT/g'
kjashjkasdhdjklhaskldaskl
söajdkl text asjdlkjaskl
{
 ldfkjgkldfj TEXT lslösdflösd
}
lkdfstextljfkals
söajdklasjdlkjaskl
{
 ldfkjgkldfj TEXT lslösdflösdlkdfsljfkals
söajdklasjdlkjaskl ldfkjgkldfj TEXT lslösdflösd
}
lkdfsljfkals
kjashjkasdhdjklhaskldaskl
kjashjkasdhdjklhas text kldaskl
söajdklasjdlkjaskl
{
 ldfkjgkldfj TEXT lsTEXTlösdflösdlkdfsljfkals
söajdklasjdlkjaskl ldfkjgkldfj TEXT lslösdfTEXTlösdlkdfsljfkals
söajdklasjdlkjaskl
{
 ldfkjg
}
kldfj text lslösdflösd}lkdfsljfkals
-------------------------------------------------------------------------
Now, with a more proper Perl program we could do this without inserting newlines... Let me frob/tweak a bit and I'll get back to you.

-- Glenn
0
 
LVL 20

Expert Comment

by:Gns
ID: 12005696
Oh so very crude, but working:-). Large files puts a bit of a load on the memory:-):-)...

$ cat aaa
kjashjkasdhdjklhaskldaskl
söajdkl text asjdlkjaskl{ ldfkjgkldfj text lslösdflösd}lkdfstextljfkals
söajdklasjdlkjaskl{ ldfkjgkldfj text lslösdflösdlkdfsljfkals
söajdklasjdlkjaskl ldfkjgkldfj text lslösdflösd}lkdfsljfkals
kjashjkasdhdjklhaskldaskl
kjashjkasdhdjklhas text kldaskl
söajdklasjdlkjaskl{ ldfkjgkldfj text lstextlösdflösdlkdfsljfkals
söajdklasjdlkjaskl ldfkjgkldfj text lslösdftextlösdlkdfsljfkals
söajdklasjdlkjaskl{ ldfkjg}kldfj text lslösdflösd}lkdfsljfkals

$ cat aa.pl
#!/usr/bin/perl -0
$match = "text";
$replace = "TEXT";
open(R,aaa);
$str=<R>;
@s=split(//,$str);
$f=0;
for($i=0;$i<=$#s;$i++)
{
  if($s[$i] =~ /{/) { $f=1; };
  if($s[$i] =~ /}/) { $f=0; };
  if(($f == 1) and ($s[$i] == 't')) {
    $tmp=substr($str,$i,length($match));
    if($tmp eq $match) {
      print $replace;
      $i+=length($match)-1;
      next;
    }
  }
  print $s[$i];
}


$ ./aa.pl
kjashjkasdhdjklhaskldaskl
söajdkl text asjdlkjaskl{ ldfkjgkldfj TEXT lslösdflösd}lkdfstextljfkals
söajdklasjdlkjaskl{ ldfkjgkldfj TEXT lslösdflösdlkdfsljfkals
söajdklasjdlkjaskl ldfkjgkldfj TEXT lslösdflösd}lkdfsljfkals
kjashjkasdhdjklhaskldaskl
kjashjkasdhdjklhas text kldaskl
söajdklasjdlkjaskl{ ldfkjgkldfj TEXT lsTEXTlösdflösdlkdfsljfkals
söajdklasjdlkjaskl ldfkjgkldfj TEXT lslösdfTEXTlösdlkdfsljfkals
söajdklasjdlkjaskl{ ldfkjg}kldfj text lslösdflösd}lkdfsljfkals

$

Enjoy
-- Glenn
0
 
LVL 20

Accepted Solution

by:
Gns earned 500 total points
ID: 12005764
Argh. The script should've been:
#!/usr/bin/perl -0
$match = "text";
$replace = "TEXT";
$firstchar = substr($match,0,1);
open(R,aaa);
$str=<R>;
@s=split(//,$str);
$f=0;
for($i=0;$i<=$#s;$i++)
{
  if($s[$i] =~ /{/) { $f=1; };
  if($s[$i] =~ /}/) { $f=0; };
  if(($f == 1) and ($s[$i] == $firstchar)) {
    $tmp=substr($str,$i,length($match));
#print "<$tmp>";
    if($tmp eq $match) {
      print $replace;
      $i+=length($match)-1;
      next;
    }
  }
  print $s[$i];
}

Sorry for that .... (the hardcoded 't' compare...).

-- Glenn
0
 

Author Comment

by:amadataset
ID: 12006189
This is, by far, not the most elegant of solutions, but probably the only practical way.  Well done, Glenn.

FYI:  I am giving up search and replacing RTFs as they are a maze of codes, making the task near-impossible without some long winded program I don't have time to create.

- Matt
0

Featured Post

Get HTML5 Certified

Want to be a web developer? You'll need to know HTML. Prepare for HTML5 certification by enrolling in July's Course of the Month! It's free for Premium Members, Team Accounts, and Qualified Experts.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

A metadevice consists of one or more devices (slices). It can be expanded by adding slices. Then, it can be grown to fill a larger space while the file system is in use. However, not all UNIX file systems (UFS) can be expanded this way. The conca…
Every server (virtual or physical) needs a console: and the console can be provided through hardware directly connected, software for remote connections, local connections, through a KVM, etc. This document explains the different types of consol…
Learn several ways to interact with files and get file information from the bash shell. ls lists the contents of a directory: Using the -a flag displays hidden files: Using the -l flag formats the output in a long list: The file command gives us mor…
This video shows how to set up a shell script to accept a positional parameter when called, pass that to a SQL script, accept the output from the statement back and then manipulate it in the Shell.
Suggested Courses

624 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question