Go Premium for a chance to win a PS4. Enter to Win

x
  • Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 1035
  • Last Modified:

How do you find and replace just the text in Rich Text Format files, using sed, ignoring useless formatting codes?

Here is an outline of the problem:

I am building an application to search and replace rich text files, for example, being able to search for some text in bold formatting and being able to replace it with text in italic formatting.

I looked into using Sed Addressing to solve the problem, whereby sed searches only between curly backets { and } within the rich text document file, ***but this doesnt work where the curly brackets span multiple lines***.  For example, the sed script would look like the following:

/{/,/}/{
s/search text/replace text/g
}

I am using a Windows Port of the unix utility sed, but that should make little difference.  I think.  I am building the actual application itself in the Windows environment, I posted this question in the UNIX area because sed is a UNIX tool, and it is likely the solution to my problem will involve the use of this tool.

I really want some sed script code (or otherwise) that can search through just the bulk of the actual printed text, and replace with specific words.  Just searching and replacing outright on an RTF file is a bad idea, as you can end up searching and replacing the rtf codes you dont want to.

Any suggestions, ideas, or new approaches to this problem?

Thanks,
Matt (ANSI C++/ANSI C/VB Programmer)
0
amadataset
Asked:
amadataset
  • 5
  • 2
  • 2
  • +1
1 Solution
 
GnsCommented:
Hm, strange. With GNU sed on it works as expected....
With file aaa:
$ cat aaa
kjashjkasdhdjklhaskldaskl
söajdklasjdlkjaskl{ ldfkjgkldfj text lslösdflösd}lkdfsljfkals
söajdklasjdlkjaskl{ ldfkjgkldfj text lslösdflösdlkdfsljfkals
söajdklasjdlkjaskl ldfkjgkldfj text lslösdflösd}lkdfsljfkals
kjashjkasdhdjklhaskldaskl
kjashjkasdhdjklhas text kldaskl
söajdklasjdlkjaskl{ ldfkjgkldfj text lstextlösdflösdlkdfsljfkals
söajdklasjdlkjaskl ldfkjgkldfj text lslösdftextlösdlkdfsljfkals
söajdklasjdlkjaskl{ ldfkjgkldfj text lslösdflösd}lkdfsljfkals
$ sed -e '/{/,/}/ s/text/TEXT/g' aaa
kjashjkasdhdjklhaskldaskl
söajdklasjdlkjaskl{ ldfkjgkldfj TEXT lslösdflösd}lkdfsljfkals
söajdklasjdlkjaskl{ ldfkjgkldfj TEXT lslösdflösdlkdfsljfkals
söajdklasjdlkjaskl ldfkjgkldfj TEXT lslösdflösd}lkdfsljfkals
kjashjkasdhdjklhaskldaskl
kjashjkasdhdjklhas text kldaskl
söajdklasjdlkjaskl{ ldfkjgkldfj TEXT lsTEXTlösdflösdlkdfsljfkals
söajdklasjdlkjaskl ldfkjgkldfj TEXT lslösdfTEXTlösdlkdfsljfkals
söajdklasjdlkjaskl{ ldfkjgkldfj TEXT lslösdflösd}lkdfsljfkals
$
... Which is equivalent to what you're trying to do. If your sed implementation doesn't want to play, get the cygwin one from http://www.cygwin.com ... That is GNU sed...

-- Glenn
0
 
TintinCommented:
Using Perl would be more portable.
0
 
yuzhCommented:
if you have perl installed (most system have them these days), you can do:


perl -i -pe "s/oldstr/newstr/" $file

you can put it in a shell script if you wanted.
0
What does it mean to be "Always On"?

Is your cloud always on? With an Always On cloud you won't have to worry about downtime for maintenance or software application code updates, ensuring that your bottom line isn't affected.

 
GnsCommented:
Perl would be fine... as portable as GNU sed, in a way:-):-)
Note the example above Greg, We need find starting { and ending } ... and only replace the re pattern between those... So one might do something like
perl -pe '$f=1 if(/{/); s/text/TEXT/ if(defined($f)); undef($f) if(defined($f && /}/);' aaa
(which might actually fail if you have a line "skaalas} saas text jkas{jsdksd")

-- Glenn
0
 
yuzhCommented:
Good point Glenn, thanks for the correction!
0
 
amadatasetAuthor Commented:
OK

Heres an update to the question.

Technically the above should work, and therefore its probably jsut a problem with me port of sed.  Not that it matters.  Newllines in rich-text files are purely optional - they have no effect on the Rich Text Format file whatsoever, so I can remove all the newlines in the file and just use sed that way.

However, still this doesn't solve the problem of finding and replacing only the words themselves within the RTF file (none of the formatting codes etc.), which is what the essence of this problem was.

QUOTE ------------>
     I really want some sed script code (or otherwise) that can search through just
     the bulk of the actual printed text, and replace with specific words.  Just searching
     and replacing outright on an RTF file is a bad idea, as you can end up searching and
     replacing the rtf codes you dont want to.
<-------------------

Thanks for the input though...

-Matt
0
 
GnsCommented:
Ah, we're fighting the "lineorientationed-ness" of the tools, sort of... With sed, we can use "Newllines in rich-text files are purely optional" to insert some wellplaced newlines... Like:
---------------------------------------------------------
#### Indata
$ cat aaa
kjashjkasdhdjklhaskldaskl
söajdkl text asjdlkjaskl{ ldfkjgkldfj text lslösdflösd}lkdfstextljfkals
söajdklasjdlkjaskl{ ldfkjgkldfj text lslösdflösdlkdfsljfkals
söajdklasjdlkjaskl ldfkjgkldfj text lslösdflösd}lkdfsljfkals
kjashjkasdhdjklhaskldaskl
kjashjkasdhdjklhas text kldaskl
söajdklasjdlkjaskl{ ldfkjgkldfj text lstextlösdflösdlkdfsljfkals
söajdklasjdlkjaskl ldfkjgkldfj text lslösdftextlösdlkdfsljfkals
söajdklasjdlkjaskl{ ldfkjg}kldfj text lslösdflösd}lkdfsljfkals

#### Bad one, look at the first line.
$ sed -e '/{/,/}/ s/text/TEXT/g' aaa / if(def
kjashjkasdhdjklhaskldaskl
söajdkl TEXT asjdlkjaskl{ ldfkjgkldfj TEXT lslösdflösd}lkdfsTEXTljfkals
söajdklasjdlkjaskl{ ldfkjgkldfj TEXT lslösdflösdlkdfsljfkals
söajdklasjdlkjaskl ldfkjgkldfj TEXT lslösdflösd}lkdfsljfkals
kjashjkasdhdjklhaskldaskl
kjashjkasdhdjklhas text kldaskl
söajdklasjdlkjaskl{ ldfkjgkldfj TEXT lsTEXTlösdflösdlkdfsljfkals
söajdklasjdlkjaskl ldfkjgkldfj TEXT lslösdfTEXTlösdlkdfsljfkals
söajdklasjdlkjaskl{ ldfkjg}kldfj TEXT lslösdflösd}lkdfsljfkals

#### "good" one
$ sed -e 's/{/\
{\
/; s/}/\
}\
/' aaa | sed -e '/{/,/}/ s/text/TEXT/g'
kjashjkasdhdjklhaskldaskl
söajdkl text asjdlkjaskl
{
 ldfkjgkldfj TEXT lslösdflösd
}
lkdfstextljfkals
söajdklasjdlkjaskl
{
 ldfkjgkldfj TEXT lslösdflösdlkdfsljfkals
söajdklasjdlkjaskl ldfkjgkldfj TEXT lslösdflösd
}
lkdfsljfkals
kjashjkasdhdjklhaskldaskl
kjashjkasdhdjklhas text kldaskl
söajdklasjdlkjaskl
{
 ldfkjgkldfj TEXT lsTEXTlösdflösdlkdfsljfkals
söajdklasjdlkjaskl ldfkjgkldfj TEXT lslösdfTEXTlösdlkdfsljfkals
söajdklasjdlkjaskl
{
 ldfkjg
}
kldfj text lslösdflösd}lkdfsljfkals
-------------------------------------------------------------------------
Now, with a more proper Perl program we could do this without inserting newlines... Let me frob/tweak a bit and I'll get back to you.

-- Glenn
0
 
GnsCommented:
Oh so very crude, but working:-). Large files puts a bit of a load on the memory:-):-)...

$ cat aaa
kjashjkasdhdjklhaskldaskl
söajdkl text asjdlkjaskl{ ldfkjgkldfj text lslösdflösd}lkdfstextljfkals
söajdklasjdlkjaskl{ ldfkjgkldfj text lslösdflösdlkdfsljfkals
söajdklasjdlkjaskl ldfkjgkldfj text lslösdflösd}lkdfsljfkals
kjashjkasdhdjklhaskldaskl
kjashjkasdhdjklhas text kldaskl
söajdklasjdlkjaskl{ ldfkjgkldfj text lstextlösdflösdlkdfsljfkals
söajdklasjdlkjaskl ldfkjgkldfj text lslösdftextlösdlkdfsljfkals
söajdklasjdlkjaskl{ ldfkjg}kldfj text lslösdflösd}lkdfsljfkals

$ cat aa.pl
#!/usr/bin/perl -0
$match = "text";
$replace = "TEXT";
open(R,aaa);
$str=<R>;
@s=split(//,$str);
$f=0;
for($i=0;$i<=$#s;$i++)
{
  if($s[$i] =~ /{/) { $f=1; };
  if($s[$i] =~ /}/) { $f=0; };
  if(($f == 1) and ($s[$i] == 't')) {
    $tmp=substr($str,$i,length($match));
    if($tmp eq $match) {
      print $replace;
      $i+=length($match)-1;
      next;
    }
  }
  print $s[$i];
}


$ ./aa.pl
kjashjkasdhdjklhaskldaskl
söajdkl text asjdlkjaskl{ ldfkjgkldfj TEXT lslösdflösd}lkdfstextljfkals
söajdklasjdlkjaskl{ ldfkjgkldfj TEXT lslösdflösdlkdfsljfkals
söajdklasjdlkjaskl ldfkjgkldfj TEXT lslösdflösd}lkdfsljfkals
kjashjkasdhdjklhaskldaskl
kjashjkasdhdjklhas text kldaskl
söajdklasjdlkjaskl{ ldfkjgkldfj TEXT lsTEXTlösdflösdlkdfsljfkals
söajdklasjdlkjaskl ldfkjgkldfj TEXT lslösdfTEXTlösdlkdfsljfkals
söajdklasjdlkjaskl{ ldfkjg}kldfj text lslösdflösd}lkdfsljfkals

$

Enjoy
-- Glenn
0
 
GnsCommented:
Argh. The script should've been:
#!/usr/bin/perl -0
$match = "text";
$replace = "TEXT";
$firstchar = substr($match,0,1);
open(R,aaa);
$str=<R>;
@s=split(//,$str);
$f=0;
for($i=0;$i<=$#s;$i++)
{
  if($s[$i] =~ /{/) { $f=1; };
  if($s[$i] =~ /}/) { $f=0; };
  if(($f == 1) and ($s[$i] == $firstchar)) {
    $tmp=substr($str,$i,length($match));
#print "<$tmp>";
    if($tmp eq $match) {
      print $replace;
      $i+=length($match)-1;
      next;
    }
  }
  print $s[$i];
}

Sorry for that .... (the hardcoded 't' compare...).

-- Glenn
0
 
amadatasetAuthor Commented:
This is, by far, not the most elegant of solutions, but probably the only practical way.  Well done, Glenn.

FYI:  I am giving up search and replacing RTFs as they are a maze of codes, making the task near-impossible without some long winded program I don't have time to create.

- Matt
0

Featured Post

Concerto's Cloud Advisory Services

Want to avoid the missteps to gaining all the benefits of the cloud? Learn more about the different assessment options from our Cloud Advisory team.

  • 5
  • 2
  • 2
  • +1
Tackle projects and never again get stuck behind a technical roadblock.
Join Now