Link to home
Start Free TrialLog in
Avatar of gvector1
gvector1

asked on

Rich Text Regular Expression

I am trying to do some manual editing to an rtf file, and would like to use regular expressions to accomplish this.  What I am needing to do is to locate certain sections of the file.  IE:

I am trying to find the following type of section:

{\header \trowd \ts11\trgaph108{\trleft0\trkeep\trftsWidth1\trpaddl108{\trpaddr108\trpaddfl3\trpaddfr3}\clvertalt\clbrdrt\brdrtbl \clbrdrl\brdrtbl{ \clbrdrb{\brdrtbl}\clbrdrr}\brdrtbl \cltxlrtb}

The previous was just a rough example.  I would like to match up the whole section.  I know I am looking for this "{\header" and I need to match all the way to the corresponding "}", but there are more embedded {}'s in the text.  Does anyone know how I can build a regex that will match corresponding {}'s??????

Thanks,
Kendal
Avatar of Bob Learned
Bob Learned
Flag of United States of America image

May I be so bold to ask what you are hoping to accomplish with this?
Avatar of gvector1
gvector1

ASKER

Well, I have an application that makes use of reading rich text files that are sent from an off-site location.  These are automatically generated files by this off-site location.  The problem is that these files are being generated with header and footer information embedded.  The problem is that typical richtextboxes and even the extended richtextboxes that I have run across and coded do not handle viewing headers and footers.  Also these rich text files have table embedded in them.  For testing purposed I am using Wordpad to view the file to determine if it is visible with a richtext viewer.  Wordpad will view the embedded tables unless the cell is right justified, then the contents of the cell is not viewable.  All of this would be viewable using MS Word, but I would have to purchase Word for every workstation.  So now you can see my predicament.  I was going to attempt to remove the header indicator(\header)  from the raw file, and move the footer to the end of the file and remove the footer indicator(\footer).   I was also going to remove the right justification formatting code from the cells of the embedded table.  How would you recommend I accomplish this task????

Thanks,
Kendal
Kendal,

The complexity for the regular expressions with the RTF specification would be darn near impossible to implement.  RTF is a horse of a different color, and doesn't have easy mechanisms for parsing.  

I am not sure that I completely understand what you mean by embedded "header" and "footer" information.
Attached is a very basic rtf file that includes a header, footer, and table with the 2nd column center justified and 3rd column right justified.  If you view it with Word, everything is viewed just fine.  If you view it with Wordpad, the header and footer cannot be seen and the 2nd and 3rd column of the table appears to be empty.  I have got to find some way to be able to view this information in my application without having to purchase software for every workstation that will be running this application.  Any suggestions???????
Nothing attached :(
The last post did not take the file.  I have to change the extension to .doc.  When you get it just change the extension to .rtf and you can see what I am talking about.
TestDoc.doc
Are you looking to pull the table text out of the RTF text, without the header or footer?
No, I actually need to be able to view the header and the footer as well as the table data.  What I determined is that if I parse the text in the file and remove the /header switch and the /footer switch from the text, The header and footer now become part of the document.  The only problem is that the footer is at the top of the document.  So I was wanting to use regular expressions to pull the entire footer section from the top of the document and move it to the bottom of the document, but I will have to identify the correct closing bracket to match with the opening bracket of that footer section as there will be other {}s within that footer section.  I was also planning on removing the formatting codes from the table that set a column to center or right justified and let them be all left justified.  Any thoughts???????
I had this started to parse RTF text.  It doesn't really do anything other than build a stack of RTF entries (between braces), and splits each section into separate elements so that you can look for a particular keyword (like footer or header).

using System;
using System.Collections.Generic;
using System.IO;
using System.Text;
 
public class RtfParser
{
 
    private Stack<RtfEntry> _RTFStack = null;
 
    public RtfParser(string fileName)
    {
        _RTFStack = new Stack<RtfEntry>();
 
        this.Parse(File.ReadAllText(fileName));
    }
 
    private void Parse(string fileText)
    {
        StringBuilder sb = new StringBuilder();
        foreach (char ch in fileText)
        {
            if (ch == '{')
            {
                if (sb.Length > 0)
                {
                    _RTFStack.Push(new RtfEntry(sb.ToString()));
                }
                sb.Length = 0;
            } 
            else if (ch == '}')
            {
            }
            else
                sb.Append(ch);
        }
    }
 
    private class RtfEntry
    {
        public string Text = "";
 
        public List<string> Elements = null;
 
        public RtfEntry(string lineText)
        {
            this.Text = lineText;
            this.Elements = new List<string>();
            this.Elements.AddRange(lineText.Split('\\'));
        }
    }
 
}

Open in new window

Okay, 2 things.
1st:  I can't use generics as this code is in studio 2003 and NET 1.1.
2nd:  It does not look like it will catch the large sections in the rtf.

IE:
  "{\header\xx1\xx2{\section1\xx1\xx2}{\section2\xx1\xx2{\section3\xx1}}\xx3}

If I am trying to capture the entire header section, logically I should get the entire line above.  The previous code you posted looks like it will separate the line into the following sections:

\header\xx1\xx2
\section1\xx1\xx2
\section2\xx1\xx2
\section3\xx1\xx3

Ideally, the sections would be separated into groups like this:
\header\xx1\xx2{\section1\xx1\xx2}{\section2\xx1\xx2{\section3\xx1}}\xx3
\section1\xx1\xx2
\section2\xx1\xx2{\section3\xx1}

Ideally I was trying to search for a way to possibly use regex to parse this information to avoid having to process at a character by character basis.  Am I correct in saying that I will probably have to process it char by char????
<P>How do you think RegEx works?  It does a complex, interative character-by-character processing of the text, trying to find matches.  What you probably need to do is to get the text between the braces, and push each section on a stack, when the starting brace { is found.   2003 should have a Stack class, that is not type-specific.</P><P> </P>
So my logic is I will search for the beginning of the section I need(IE: {\header).  Then I will move character by character looking for the ending brace, ignoring opening and closing braces that match until I find the matching brace for that section.  What you think????
Sounds like one direction that may get you were you want to be.  
Okay, I seem to be able to do most of this by following the logic stated above, but I have run into a problem.  When dealing with tables, if a cell has multiple lines contained within, wordpad views it with no problem, but if I were to load it within a richtextbox, the view is all corrupted.  I will attach another test document to show the difference.  View it with wordpad and then load it into a richtextbox and see the difference.  Is there any suggestions on this matter???????
Sorry forgot the file
TestDoc.doc
Any recommendations??
I have been on vacation for a week, so I didn't get a chance to even think about this one.  RTF is very difficult to parse, as far as I am concerned, so while I started to get something that works, it continually gets pushed into the background.  It is nice to get the old code out, and look at it every once in a while to see if there is a course of action.  If you give me a little time to think about this, hopefully we can find a solution.
This last RTF file doesn't have a 'header' or 'footer' section.
Yes, it is to simplify the table formatting issue.  I was able to get the header put in the correct place in the rtf, but the formatting of the table is as stated above.  Just trying to break the problem down to be simplified.  Hope you enjoyed your vacation and thanks again for your assistance.
Ok, I am trying to get back on the problem, so I guessed that I missed the change.

"When dealing with tables, if a cell has multiple lines contained within, wordpad views it with no problem, but if I were to load it within a richtextbox, the view is all corrupted."

Can you attach a .png screenshot, so that I can see what you mean?
Here is a screenshot of the difference between opening the same file in wordpad and loading it into a richtextbox.
ScreenShot.png
Ok, I don't think that there is anything that I can do to correct how the RichTextBox control doesn't recognize the carriage returns in the lines.
Unfortunately, the RichTextBox does not fully support the RTF specification and does some really strange things with the contents.
ASKER CERTIFIED SOLUTION
Avatar of Bob Learned
Bob Learned
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Thanks TheLearnedOne,

That has saved my life.  My final resolution will be to move the header and footer into the document itself and have a custom richtextbox, inherited from msftedit.dll, so it can view table correctly.

Thanks a million,
Kendal
TheLearnedOne,

I have run into one problem I was hoping you could help me with.  When using the RichEdit50W class, it seems like every time I check the .selectedfont property, it is always null, regardless of whether any text is selected or not.  Do you have any suggestions on this??????

Thanks,
Kendal
Is there any way to view the source code of the RichEdit controls so I can see how the .selectionfont property is getting populated????