Solved

Download webpage and strip out everything except text

Posted on 2006-11-21
5
180 Views
Last Modified: 2010-04-05
Hi,

Im looking for some code that will download the html of webpage (no images), and then strip out everything except the text. By text I mean the sentences.

Thanks
0
Comment
Question by:zattz
  • 3
  • 2
5 Comments
 
LVL 28

Accepted Solution

by:
TName earned 500 total points
ID: 17985952
Hi,
a very simple example using TWebBrowser. Will write the text to C:\Test.txt:


uses {...}  SHDocVw, mshtml;

{Main form declaration section}  
private
 procedure DocComplete(Sender: TObject; const pDisp: IDispatch; var URL: OleVariant);


{...}

procedure TForm1.Button1Click(Sender: TObject);
var
wb:TWebBrowser;
begin
  wb:= TWebBrowser.Create(nil);
  with wb do begin
     OnDocumentComplete:=DocComplete;
     ParentWindow:=Self.Handle;
     Navigate('www.google.com');
   end;
     while wb.Busy do
        Application.ProcessMessages;
   wb.Free;
end;

procedure TForm1.DocComplete(Sender: TObject;
  const pDisp: IDispatch; var URL: OleVariant);
var
 aText:String;
 fs:TFileStream;
 p:Pointer;
begin
  aText:=IHTMLDocument2(TWebBrowser(Sender).Document).Body.innerText;
  fs:=TFileStream.Create('C:\Test.txt',fmCreate);
  p:=pointer(aText);
  fs.Write(p^, Length(aText));
  fs.Free;
end;
0
 
LVL 28

Expert Comment

by:TName
ID: 17985965
And if you don't want the webbrowser to show up at all, you can say:

with wb do begin
     OnDocumentComplete:=DocComplete;
     ParentWindow:=Self.Handle;
     Left:=-500; //<-------------------Just an example. Not so nice, but it works...
0
 

Author Comment

by:zattz
ID: 17986172
or visible:=false ;)

Thanks for the help
0
 

Author Comment

by:zattz
ID: 18005842
By the way,

do you know if there is a way to filter out all the links before saving the text?
0
 

Author Comment

by:zattz
ID: 18005851
0

Featured Post

Is Your Active Directory as Secure as You Think?

More than 75% of all records are compromised because of the loss or theft of a privileged credential. Experts have been exploring Active Directory infrastructure to identify key threats and establish best practices for keeping data safe. Attend this month’s webinar to learn more.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Hello everybody This Article will show you how to validate number with TEdit control, What's the TEdit control? TEdit is a standard Windows edit control on a form, it allows to user to write, read and copy/paste single line of text. Usua…
In my programming career I have only very rarely run into situations where operator overloading would be of any use in my work.  Normally those situations involved math with either overly large numbers (hundreds of thousands of digits or accuracy re…
This video shows how to remove a single email address from the Outlook 2010 Auto Suggestion memory. NOTE: For Outlook 2016 and 2013 perform the exact same steps. Open a new email: Click the New email button in Outlook. Start typing the address: …
Concerto provides fully managed cloud services and the expertise to provide an easy and reliable route to the cloud. Our best-in-class solutions help you address the toughest IT challenges, find new efficiencies and deliver the best application expe…

930 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

13 Experts available now in Live!

Get 1:1 Help Now