Community Pick: Many members of our community have endorsed this article.
Editor's Choice: This article has been selected by our editors as an exceptional contribution.

Convert RTF to HTML and HTML to RTF

DanRollins
CERTIFIED EXPERT
Published:
This article describes a technique for converting RTF (Rich Text Format) data to HTML and provides C++ source that does it all in just a few lines of code.
GUI version illustrates the conversionAlthough RTF is coming to be considered a "legacy" format, it is still in common use... possibly because of the ease with which a programmer can drop a Rich Edit control onto a form and allow the user to set fonts, colors, text formatting and even embed objects such as pictures and tables.  Also, Windows comes with (and always has, and probably always will come with) the WordPad application program for creating, viewing and modifying RTF data.

But HTML is the lingua franca of the day, especially for email communications.  

Anyway, it's not unusual to have an RTF file and need a way to generate HTML from it.  Office Word automation can do that pretty easily, but one can't be certain that Word will be available on your customer's computer.

There are some utility programs out there that try to convert the RTF tokens directly into HTML, but they are usually very limited; they tend to fail when the RTF data is at all complex (e.g., they can make text bold or italics, but that's about all).

An Easier Easy Way?

In searching for a simple, clean solution, I noticed that WordPad was easily able to convert HTML fragments (e.g., as Crl+C copied from a browser) into RTF.  I wondered if the reverse was true... Could I paste RTF from the clipboard into an HTML editor window?

In fact, WordPad, and the underlying Rich Text 2.0 Control will automatically place HTML data on the clipboard along with the raw RTF and the TEXT-only data.  

So there is a simple solution:
Load the RTF into a Rich Edit Control, copy it all, and paste it into an Html Edit Control, then save it out to a file or otherwise grab the HTML source text.

The Html Edit Control is relatively new... It's supported in .NET Windows Forms, and in MFC in Visual Studio 2008 and later.  Anyway, it's basically just a WebBrowser control with the "Allow Editing" property turned on, so it's no big deal.  For a comprehensive look at how to use the control, see Use CHtmlEditCtrl to Create a Simple HTML Editor

Here's complete C++ Console Application that will do the RTF-to-HTML conversion:
// Rtf2Html.cpp : Defines the entry point for the console application. 
                      // 
                      #include "stdafx.h" 
                      #include "Rtf2Html2.h" 
                      #include "afxhtml.h" 
                       
                      CWinApp theApp;  // Win32 App with MFC support 
                       
                      CRichEditCtrl g_ctlRichEdit;  // the two controls 
                      CHtmlEditCtrl g_ctlEditHtml; 
                       
                      // This is needed to load the RichEdit control from a file 
                      static DWORD CALLBACK  
                      MyStreamInCallback(DWORD dwCookie, LPBYTE pbBuff, LONG cb, LONG *pcb) 
                      { 
                         CFile* pFile = (CFile*) dwCookie; 
                         *pcb = pFile->Read(pbBuff, cb); 
                         return 0; 
                      } 
                       
                      void LoadRtfFile( LPCTSTR pszFilename ) 
                      { 
                          wchar_t szFilter[] = L"RTF files (*.rtf)|*.rtf;|" 
                                               L"All Files (*.*)|*.*||"; 
                          CFileDialog dlg(TRUE,0,pszFilename,6,szFilter ); 
                          if ( dlg.DoModal()!=IDOK ) { 
                              return; 
                          } 
                          CFile cf( dlg.GetPathName(),CFile::modeRead ); 
                       
                          // CFile cf( pszFilename,CFile::modeRead ); 
                          EDITSTREAM es; 
                          es.dwCookie = (DWORD)&cf; 
                          es.pfnCallback = MyStreamInCallback;  
                          g_ctlRichEdit.StreamIn( SF_RTF, es ); // load from the file 
                      } 
                       
                      int _tmain(int argc, TCHAR* argv[], TCHAR* envp[]) 
                      { 
                          AfxWinInit(::GetModuleHandle(NULL), NULL, ::GetCommandLine(), 0); 
                          AfxInitRichEdit2(); // needed for using CRichEditCtrl 
                       
                          CWnd* pWnd = CWnd::GetDesktopWindow(); 
                          CRect r(0,0,200,200); 
                       
                          g_ctlRichEdit.Create( ES_MULTILINE, r, pWnd, 1111); 
                          g_ctlEditHtml.Create( 0,0, r, pWnd, 2222 ); 
                       
                          LoadRtfFile( argv[1] );     // read the RTF file into the ctrl  
                          g_ctlRichEdit.SetSel(0,-1); // select all in the RTF ctrl 
                          g_ctlRichEdit.Copy();       // copy to clipboard 
                          g_ctlEditHtml.Paste();      // paste into the Html Edit ctrl 
                          g_ctlEditHtml.SaveAs( L"C:\\temp\\test.html");  // save HTML 
                       
                          return 0; 
                      }

Open in new window


What's Going On?

All of the magic takes place on line 51 during the call to g_ctlRichEdit.Copy();  If you use a clipboard spy, you'll see that the RE control has placed HTML on the clipboard at that point.  I was unable to locate an API call that would do that, but this clipboard-based technique is so simple to implement that I stopped looking.

The HTML is a "valid HTML fragment."  For instance, the RTF you can see in Fig.1 looks like the following on the clipboard, right after the copy-to-clipboard, and after pasting into the HTML Editor, and in the output file:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
                      <HTML><HEAD>
                      <META content="text/html; charset=unicode" http-equiv=Content-Type>
                      <META name=GENERATOR content="MSHTML 8.00.7600.16535"></HEAD>
                      <BODY><FONT size=3>
                      <P>This is a </FONT><B><FONT color=#ff0000 size=3><FONT color=#ff0000 
                      size=3>test </B></FONT></FONT><FONT size=3>This </FONT><B><I><FONT 
                      size=6>is</B></I></FONT><FONT size=3> a </FONT><B><I><FONT size=4 
                      face="Arial Black"><FONT size=4 
                      face="Arial Black">Test</P></B></I></FONT></FONT><FONT size=3></FONT>
                      <TABLE dir=ltr border=1 cellSpacing=1 borderColor=#000000 cellPadding=7 
                      width=157>
                        <TBODY>
                        <TR>
                          <TD width="58%"><FONT size=3>
                            <P>1a</FONT></P></TD>
                          <TD width="42%"><FONT size=3>
                            <P>1b</FONT></P></TD></TR>
                        <TR>
                          <TD width="58%"><FONT size=3>
                            <P>2a</FONT></P></TD>
                          <TD width="42%"><B><FONT color=#9b00d3 size=3 face=Broadway><FONT 
                            color=#9b00d3 size=3 face=Broadway><FONT color=#9b00d3 size=3 
                            face=Broadway>
                            <P>2b</B></FONT></FONT></FONT></P></TD></TR></TBODY></TABLE><FONT color=#ff0000 
                      size=3><FONT color=#ff0000 size=3></FONT></FONT><FONT size=3>
                      <P>end of test</P></FONT></BODY></HTML>

Open in new window

Note that in addition to setting the fonts and coloring, it has correctly created and populated a <TABLE> element (Incidentally, I used Office Word to generate the RTF because WordPad does not support creating Tables).  In other words, the HTML conversion appears to be quite robust and complete.

Also note that <META> tag on line, 4 which indicates the MSHTML was used to generate the HTML.  Perhaps that's a clue to avoiding the clipboard operations.  Like I said, I stopped researching when I found this simple clipboard-based solution.

HTML to RTF

The opposite conversion also works, though it's not shown in the sample programs.  If you select HTML onto the clipboard, and paste it into a Rich Edit control, the result is RTF data.  You can pull it directly from the clipboard, or use the StreamOut function to save the RTF data.

Sample Project

Here's a link to a complete Visual Studio 2008 project file that includes a GUI that you can use to experiment with this technique:

          Rtf2HtmlProj ZIP file https://filedb.experts-exchange.com/incoming/ee-stuff/7925-Rtf2HtmlProj.zip

Notes:

You will see some of the same anomalies in the RTF conversion that you will see when pasting into Outlook Express or Windows Live Mail; for instance, the line-ends are converted to <P> tags and that ends up with line-break spacing this is greater than seen in the original RTF.   Font typeface may be somewhat off, especially if using the "default" (whatever that is) in the RTF.   Background coloring does not seem to be supported, and the TABLE conversion also has some limitations.
Picture objects don't make it across to the HTML.  There would need to be a IMG source address in HTML, but the RTF data contains the picture as embedded data.  It would take a bit of sophisticated coding to extract the image, save it to disk, and set up the <IMG> tag.
In the attached project, I use an object that I derived from CHtmlEditCtrl.  But it boils down to a thin wrapper that provides a simple means to stick it into a dialog box without the support of the VS Class Wizard.  I understand that VS 2010 provides more direct support for the control.
About the above Console App source:  
My goal was to show the sequence as simply as possible.  If you try to bypass the "Open File" and "Save As" dialogs, the sequence will fail.  That's because in a Console App, there is no message pump that would allow the two control windows to be created normally.  If you want to create a command-line utility that does the conversion without interaction, I suggest that you start with a dialog-based app (as in the attached project) and just automate the process of getting the two filenames from the command line.

Summary:

I searched far and wide for a utility program that would convert RTF to HTML, and could find nothing that was satisfactory.  But it turned out that the ability is built into standard Windows controls and that simple trick with the clipboard was all that's needed.  

References:

   Use CHtmlEditCtrl to Create a Simple HTML Editor

    CHtmlEditCtrl Class  (MFC)
    http://msdn.microsoft.com/en-us/library/h14ht0dh(VS.80).aspx 

    CRichEditCtrl Class  (MFC)
    http://msdn.microsoft.com/en-us/library/76a787xf(VS.80).aspx 


=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
If you liked this article and want to see more from this author, please click the Yes button near the:
      Was this article helpful?
label that is just below and to the right of this text.   Thanks!
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=  
4
21,222 Views
DanRollins
CERTIFIED EXPERT

Comments (0)

Have a question about something in this article? You can receive help directly from the article author. Sign up for a free trial to get started.