[Last Call] Learn how to a build a cloud-first strategyRegister Now

  • Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 1444
  • Last Modified:

Delphi TEmbeddedWB/TWebBrowser, remove JavaScript from Document Source

Is it possible to get the exact contents of a TEmbeddedWB/TWebBrowser component, as rendered? This means that any JavaScript needs to be excluded, and most certainly any JavaScript that may change the page.

For example, if you go to http://www.cnn.com, you'll receive a version of the page. If you refresh the page, some of the images are rotated (ads, etc). What I'm looking to do is get the exact rendered version of the page, so I can load the contents from a file or string at a later time and have the same data that was there before (but without the JavaScript).

The page must be successfully rendered at least once. This allows the page to get its appearance. Once it has rendered, I want the rendered contents version (which will be different than the Document.DocumentSource version).

I've tried a call like this:

function GetContentWithoutScripts(ASource: IHTMLDocument2): String;
  colScripts: IHTMLElementCollection;
  myScript: IHTMLScriptElement;
  slData: TStringList;
  x: Integer;
  colScripts := ASource.scripts as IHTMLElementCollection;
  for x := 0 to colScripts.length - 1 do
    myScript := colScripts.item(x, '') as IHTMLScriptElement;
    myScript.Text := '';

  slData := TStringList.Create();
    SaveDocToStrings(ASource as IDispatch, TStrings(slData));
    Result := slData.Text;

So after the page is rendered, it gets rid of all the JavaScript data, and returns the contents of the document. However, the myScript.text := ''; doesn't seem to actually update the document contents, as slData.Text still gets a value with all the JavaScript in it. If I loop through the scripts a second time, they're all blank.

How do I get an exact rendered copy? I completely understand that some functionality will be lost with this method, and I'm fine with that. For example, onMouseOver events that call a function in a script block will no longer function. I do not need page functionality, I only need appearance.

Also, a method to remove script data from intrinsic events.. not covered in the test code above, as it wouldn't fit or be under document.scripts.
1 Solution
 You will need to use the SaveDocToStrings and parse out the javascript.  search for <script> </script> using compiler style token recogintion.  As you read from your in stream, write to your out stream only when you are not in a script area.  Make sure you are aware that the script tags are not the only way to mark scripting.  While I do not do javascript often, I believe there is a shortcut something like <% %>.  Take them all into account and you will have your stripped version.

  In general you start by skipping everything that is not a <.  then scan ahead and see if it is one of the tags you are looking for.  If it is then you are IN a code section.  set a boolean variable to indicate not to write to your out stream.  continue reading until you find the matching end tag and reset your boolean variable.

let me know if you need more.
Forced accept.

EE Admin

Featured Post

Technology Partners: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

Tackle projects and never again get stuck behind a technical roadblock.
Join Now