Solved

Rich Text Regular Expression

Posted on 2008-06-23
27
2,290 Views
Last Modified: 2013-12-16
I am trying to do some manual editing to an rtf file, and would like to use regular expressions to accomplish this.  What I am needing to do is to locate certain sections of the file.  IE:

I am trying to find the following type of section:

{\header \trowd \ts11\trgaph108{\trleft0\trkeep\trftsWidth1\trpaddl108{\trpaddr108\trpaddfl3\trpaddfr3}\clvertalt\clbrdrt\brdrtbl \clbrdrl\brdrtbl{ \clbrdrb{\brdrtbl}\clbrdrr}\brdrtbl \cltxlrtb}

The previous was just a rough example.  I would like to match up the whole section.  I know I am looking for this "{\header" and I need to match all the way to the corresponding "}", but there are more embedded {}'s in the text.  Does anyone know how I can build a regex that will match corresponding {}'s??????

Thanks,
Kendal
0
Comment
Question by:gvector1
  • 14
  • 13
27 Comments
 
LVL 96

Expert Comment

by:Bob Learned
Comment Utility
May I be so bold to ask what you are hoping to accomplish with this?
0
 

Author Comment

by:gvector1
Comment Utility
Well, I have an application that makes use of reading rich text files that are sent from an off-site location.  These are automatically generated files by this off-site location.  The problem is that these files are being generated with header and footer information embedded.  The problem is that typical richtextboxes and even the extended richtextboxes that I have run across and coded do not handle viewing headers and footers.  Also these rich text files have table embedded in them.  For testing purposed I am using Wordpad to view the file to determine if it is visible with a richtext viewer.  Wordpad will view the embedded tables unless the cell is right justified, then the contents of the cell is not viewable.  All of this would be viewable using MS Word, but I would have to purchase Word for every workstation.  So now you can see my predicament.  I was going to attempt to remove the header indicator(\header)  from the raw file, and move the footer to the end of the file and remove the footer indicator(\footer).   I was also going to remove the right justification formatting code from the cells of the embedded table.  How would you recommend I accomplish this task????

Thanks,
Kendal
0
 
LVL 96

Expert Comment

by:Bob Learned
Comment Utility
Kendal,

The complexity for the regular expressions with the RTF specification would be darn near impossible to implement.  RTF is a horse of a different color, and doesn't have easy mechanisms for parsing.  

I am not sure that I completely understand what you mean by embedded "header" and "footer" information.
0
 

Author Comment

by:gvector1
Comment Utility
Attached is a very basic rtf file that includes a header, footer, and table with the 2nd column center justified and 3rd column right justified.  If you view it with Word, everything is viewed just fine.  If you view it with Wordpad, the header and footer cannot be seen and the 2nd and 3rd column of the table appears to be empty.  I have got to find some way to be able to view this information in my application without having to purchase software for every workstation that will be running this application.  Any suggestions???????
0
 
LVL 96

Expert Comment

by:Bob Learned
Comment Utility
Nothing attached :(
0
 

Author Comment

by:gvector1
Comment Utility
The last post did not take the file.  I have to change the extension to .doc.  When you get it just change the extension to .rtf and you can see what I am talking about.
TestDoc.doc
0
 
LVL 96

Expert Comment

by:Bob Learned
Comment Utility
Are you looking to pull the table text out of the RTF text, without the header or footer?
0
 

Author Comment

by:gvector1
Comment Utility
No, I actually need to be able to view the header and the footer as well as the table data.  What I determined is that if I parse the text in the file and remove the /header switch and the /footer switch from the text, The header and footer now become part of the document.  The only problem is that the footer is at the top of the document.  So I was wanting to use regular expressions to pull the entire footer section from the top of the document and move it to the bottom of the document, but I will have to identify the correct closing bracket to match with the opening bracket of that footer section as there will be other {}s within that footer section.  I was also planning on removing the formatting codes from the table that set a column to center or right justified and let them be all left justified.  Any thoughts???????
0
 
LVL 96

Expert Comment

by:Bob Learned
Comment Utility
I had this started to parse RTF text.  It doesn't really do anything other than build a stack of RTF entries (between braces), and splits each section into separate elements so that you can look for a particular keyword (like footer or header).

using System;

using System.Collections.Generic;

using System.IO;

using System.Text;
 

public class RtfParser

{
 

    private Stack<RtfEntry> _RTFStack = null;
 

    public RtfParser(string fileName)

    {

        _RTFStack = new Stack<RtfEntry>();
 

        this.Parse(File.ReadAllText(fileName));

    }
 

    private void Parse(string fileText)

    {

        StringBuilder sb = new StringBuilder();

        foreach (char ch in fileText)

        {

            if (ch == '{')

            {

                if (sb.Length > 0)

                {

                    _RTFStack.Push(new RtfEntry(sb.ToString()));

                }

                sb.Length = 0;

            } 

            else if (ch == '}')

            {

            }

            else

                sb.Append(ch);

        }

    }
 

    private class RtfEntry

    {

        public string Text = "";
 

        public List<string> Elements = null;
 

        public RtfEntry(string lineText)

        {

            this.Text = lineText;

            this.Elements = new List<string>();

            this.Elements.AddRange(lineText.Split('\\'));

        }

    }
 

}

Open in new window

0
 

Author Comment

by:gvector1
Comment Utility
Okay, 2 things.
1st:  I can't use generics as this code is in studio 2003 and NET 1.1.
2nd:  It does not look like it will catch the large sections in the rtf.

IE:
  "{\header\xx1\xx2{\section1\xx1\xx2}{\section2\xx1\xx2{\section3\xx1}}\xx3}

If I am trying to capture the entire header section, logically I should get the entire line above.  The previous code you posted looks like it will separate the line into the following sections:

\header\xx1\xx2
\section1\xx1\xx2
\section2\xx1\xx2
\section3\xx1\xx3

Ideally, the sections would be separated into groups like this:
\header\xx1\xx2{\section1\xx1\xx2}{\section2\xx1\xx2{\section3\xx1}}\xx3
\section1\xx1\xx2
\section2\xx1\xx2{\section3\xx1}

Ideally I was trying to search for a way to possibly use regex to parse this information to avoid having to process at a character by character basis.  Am I correct in saying that I will probably have to process it char by char????
0
 
LVL 96

Expert Comment

by:Bob Learned
Comment Utility
<P>How do you think RegEx works?  It does a complex, interative character-by-character processing of the text, trying to find matches.  What you probably need to do is to get the text between the braces, and push each section on a stack, when the starting brace { is found.   2003 should have a Stack class, that is not type-specific.</P><P> </P>
0
 

Author Comment

by:gvector1
Comment Utility
So my logic is I will search for the beginning of the section I need(IE: {\header).  Then I will move character by character looking for the ending brace, ignoring opening and closing braces that match until I find the matching brace for that section.  What you think????
0
 
LVL 96

Expert Comment

by:Bob Learned
Comment Utility
Sounds like one direction that may get you were you want to be.  
0
Better Security Awareness With Threat Intelligence

See how one of the leading financial services organizations uses Recorded Future as part of a holistic threat intelligence program to promote security awareness and proactively and efficiently identify threats.

 

Author Comment

by:gvector1
Comment Utility
Okay, I seem to be able to do most of this by following the logic stated above, but I have run into a problem.  When dealing with tables, if a cell has multiple lines contained within, wordpad views it with no problem, but if I were to load it within a richtextbox, the view is all corrupted.  I will attach another test document to show the difference.  View it with wordpad and then load it into a richtextbox and see the difference.  Is there any suggestions on this matter???????
0
 

Author Comment

by:gvector1
Comment Utility
Sorry forgot the file
TestDoc.doc
0
 

Author Comment

by:gvector1
Comment Utility
Any recommendations??
0
 
LVL 96

Expert Comment

by:Bob Learned
Comment Utility
I have been on vacation for a week, so I didn't get a chance to even think about this one.  RTF is very difficult to parse, as far as I am concerned, so while I started to get something that works, it continually gets pushed into the background.  It is nice to get the old code out, and look at it every once in a while to see if there is a course of action.  If you give me a little time to think about this, hopefully we can find a solution.
0
 
LVL 96

Expert Comment

by:Bob Learned
Comment Utility
This last RTF file doesn't have a 'header' or 'footer' section.
0
 

Author Comment

by:gvector1
Comment Utility
Yes, it is to simplify the table formatting issue.  I was able to get the header put in the correct place in the rtf, but the formatting of the table is as stated above.  Just trying to break the problem down to be simplified.  Hope you enjoyed your vacation and thanks again for your assistance.
0
 
LVL 96

Expert Comment

by:Bob Learned
Comment Utility
Ok, I am trying to get back on the problem, so I guessed that I missed the change.

"When dealing with tables, if a cell has multiple lines contained within, wordpad views it with no problem, but if I were to load it within a richtextbox, the view is all corrupted."

Can you attach a .png screenshot, so that I can see what you mean?
0
 

Author Comment

by:gvector1
Comment Utility
Here is a screenshot of the difference between opening the same file in wordpad and loading it into a richtextbox.
ScreenShot.png
0
 
LVL 96

Expert Comment

by:Bob Learned
Comment Utility
Ok, I don't think that there is anything that I can do to correct how the RichTextBox control doesn't recognize the carriage returns in the lines.
0
 
LVL 96

Expert Comment

by:Bob Learned
Comment Utility
Unfortunately, the RichTextBox does not fully support the RTF specification and does some really strange things with the contents.
0
 
LVL 96

Accepted Solution

by:
Bob Learned earned 500 total points
Comment Utility
Try this:


// Source:

// http://www.dotnetjunkies.com/WebLog/johnwood/archive/2006/07/04/transparent_richtextbox.aspx
 

// It seems there are 4 versions of the RichEdit control out there - when I'm talking about the 

// RichEdit control, I'm talking about the C DLL that either comes with Windows or some version 

// of Office. The files are named either RICHEDXX.DLL (XX is the version number), or MSFTEDIT.DLL 

// and they're in the System32 folder.
 

// .Net RichTextBox control is bound to version 2. The biggest problem with this version (at least 

// for me) is that it does not render properly if you try to make the window transparent. Later versions, 

// however, do.
 

// We can fix that. If you create a control deriving from the original RichTextBox control, but overriding 

// the CreateParams property, you can put in a new Windows class name (this is the window class name, 

// nothing to do with classes in the C# sense). This effectively gives us a free upgrade. When the .Net 

// RichTextBox control instantiates, it will now use the latest RichEdit control and not the old, archaic, 

// version 2.
 

// There are other benefits too - version 3 and beyond of the RichEdit control support quite an extensive 

// array of layout features, such as tables and full text justification. This is the version of the RichEdit 

// that WordPad uses in Windows XP. To really see what it's capable of displaying you can create documents in 

// Word and save them in RTF, load these into the new RichEdit and in a lot of cases it'll look identical, 

// it's that powerful. A full list of features can be found here:

// http://msdn.microsoft.com/library/default.asp?url=/library/en-us/shellcc/platform/commctls/richedit/richeditcontrols/aboutricheditcontrols.asp
 

// There are a couple of caveats:

// 

// 1. The control that this is bound to was shipped with Windows XP, and so this code won't work in 

//    Windows 2000 or earlier. 

//

// 2. The RichTextBox control in C# only knows about version 2, so the interface doesn't include 

//    all the new features. You can wrap a few of the features yourself through new methods on the 

//    RichEdit class.
 

using System;

using System.Runtime.InteropServices;

using System.Windows.Forms;
 

public class RichEdit : RichTextBox

{
 

    [DllImport("kernel32.dll", CharSet = CharSet.Auto)]

    private static extern IntPtr LoadLibrary(string lpFileName);
 

    protected override CreateParams CreateParams

    {

        get

        {

            CreateParams parameters = base.CreateParams;

            if (LoadLibrary("msftedit.dll") != IntPtr.Zero)

            {

                parameters.ExStyle |= 0x020; // transparent

                parameters.ClassName = "RICHEDIT50W";

            }

            return parameters;

        }

    }

}

Open in new window

0
 

Author Closing Comment

by:gvector1
Comment Utility
Thanks TheLearnedOne,

That has saved my life.  My final resolution will be to move the header and footer into the document itself and have a custom richtextbox, inherited from msftedit.dll, so it can view table correctly.

Thanks a million,
Kendal
0
 

Author Comment

by:gvector1
Comment Utility
TheLearnedOne,

I have run into one problem I was hoping you could help me with.  When using the RichEdit50W class, it seems like every time I check the .selectedfont property, it is always null, regardless of whether any text is selected or not.  Do you have any suggestions on this??????

Thanks,
Kendal
0
 

Author Comment

by:gvector1
Comment Utility
Is there any way to view the source code of the RichEdit controls so I can see how the .selectionfont property is getting populated????
0

Featured Post

Better Security Awareness With Threat Intelligence

See how one of the leading financial services organizations uses Recorded Future as part of a holistic threat intelligence program to promote security awareness and proactively and efficiently identify threats.

Join & Write a Comment

This document covers how to connect to SQL Server and browse its contents.  It is meant for those new to Visual Studio and/or working with Microsoft SQL Server.  It is not a guide to building SQL Server database connections in your code.  This is mo…
Entity Framework is a powerful tool to help you interact with the DataBase but still doesn't help much when we have a Stored Procedure that returns more than one resultset. The solution takes some of out-of-the-box thinking; read on!
Learn how to match and substitute tagged data using PHP regular expressions. Demonstrated on Windows 7, but also applies to other operating systems. Demonstrated technique applies to PHP (all versions) and Firefox, but very similar techniques will w…
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…

744 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

16 Experts available now in Live!

Get 1:1 Help Now