• Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 1522
  • Last Modified:

Delphi TEmbeddedWB/TWebBrowser, remove JavaScript from Document Source

Is it possible to get the exact contents of a TEmbeddedWB/TWebBrowser component, as rendered? This means that any JavaScript needs to be excluded, and most certainly any JavaScript that may change the page.

For example, if you go to http://www.cnn.com, you'll receive a version of the page. If you refresh the page, some of the images are rotated (ads, etc). What I'm looking to do is get the exact rendered version of the page, so I can load the contents from a file or string at a later time and have the same data that was there before (but without the JavaScript).

The page must be successfully rendered at least once. This allows the page to get its appearance. Once it has rendered, I want the rendered contents version (which will be different than the Document.DocumentSource version).

I've tried a call like this:

function GetContentWithoutScripts(ASource: IHTMLDocument2): String;
  colScripts: IHTMLElementCollection;
  myScript: IHTMLScriptElement;
  slData: TStringList;
  x: Integer;
  colScripts := ASource.scripts as IHTMLElementCollection;
  for x := 0 to colScripts.length - 1 do
    myScript := colScripts.item(x, '') as IHTMLScriptElement;
    myScript.Text := '';

  slData := TStringList.Create();
    SaveDocToStrings(ASource as IDispatch, TStrings(slData));
    Result := slData.Text;

So after the page is rendered, it gets rid of all the JavaScript data, and returns the contents of the document. However, the myScript.text := ''; doesn't seem to actually update the document contents, as slData.Text still gets a value with all the JavaScript in it. If I loop through the scripts a second time, they're all blank.

How do I get an exact rendered copy? I completely understand that some functionality will be lost with this method, and I'm fine with that. For example, onMouseOver events that call a function in a script block will no longer function. I do not need page functionality, I only need appearance.

Also, a method to remove script data from intrinsic events.. not covered in the test code above, as it wouldn't fit or be under document.scripts.
1 Solution
 You will need to use the SaveDocToStrings and parse out the javascript.  search for <script> </script> using compiler style token recogintion.  As you read from your in stream, write to your out stream only when you are not in a script area.  Make sure you are aware that the script tags are not the only way to mark scripting.  While I do not do javascript often, I believe there is a shortcut something like <% %>.  Take them all into account and you will have your stripped version.

  In general you start by skipping everything that is not a <.  then scan ahead and see if it is one of the tags you are looking for.  If it is then you are IN a code section.  set a boolean variable to indicate not to write to your out stream.  continue reading until you find the matching end tag and reset your boolean variable.

let me know if you need more.
Forced accept.

EE Admin
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

Join & Write a Comment

Featured Post

Cloud Class® Course: Microsoft Office 2010

This course will introduce you to the interfaces and features of Microsoft Office 2010 Word, Excel, PowerPoint, Outlook, and Access. You will learn about the features that are shared between all products in the Office suite, as well as the new features that are product specific.

Tackle projects and never again get stuck behind a technical roadblock.
Join Now