Extract URL, <Title> & <Description>

Hello All;

 I am using the folllowing code to Extract URL's from a page.
=========================================

procedure TForm1.ButtonGetClick(Sender: TObject);
var
  i: Integer;
  URL: OleVariant; // This comes from " WebBrowser1.OnDocumentComplete "
  // Which is not the best way to do ( I do not think ) but works great,
begin
  Memo1.Lines.Clear;
  for i := 0 to WB1.OleObject.Document.links.Length - 1 do
    Memo1.Lines.Add(WB1.OleObject.Document.Links.Item(i));
end;

=========================================

The following link for example:
http://directory.google.com/Top/Kids_and_Teens/Arts/

OK.
I would like to grab not only the URL's from this page
(That are located in the [Web Pages] Area not [Categories] Area).
But also grab the <text> that is located with it.

Example:
the first URL is:
==========================================================
Web Gallery of Art - http://www.wga.hu/ 
Virtual museum of European painting and sculpture of the Gothic, Renaissance and Baroque periods (1100-1800). With commentaries on pictures, biographies of artists, and guided tours.
==========================================================

So in the above. I would like to grab

Web Gallery of Art <--   Title
http://www.wga.hu/  <-- URL
Virtual museum of European painting...  <-- Description


Here is the actual HTML SourceCode for the information above:
==========================================================
<FONT face=arial,sans-serif><A href="http://www.wga.hu/" target=_top>Web Gallery
of Art</A>&nbsp;<FONT color=#6f6f6f
size=-1>-&nbsp;<SPAN>http://www.wga.hu/</SPAN></FONT> <BR><FONT size=-1>Virtual
museum of European painting and sculpture of the Gothic, Renaissance and Baroque
periods (1100-1800). With commentaries on pictures, biographies of artists, and
guided tours.</FONT></FONT>
==========================================================


I also have this code
http://www.experts-exchange.com/Programming/Programming_Languages/Delphi/Q_21153731.html

Which grabs the URL & Title for the [Categories] but not for the [Web Pages].

Any idea's on this one?

Thanks All;
Carrzkiss

LVL 31
Wayne BarronAuthor, Web DeveloperAsked:
Who is Participating?
 
Russell LibbyConnect With a Mentor Software Engineer, Advisory Commented:
Perhaps a slight improvement.

Russell

----

var  i, j, x:    Integer;
     ovTable:    OleVariant;
     szText:     String;
     szItems:    Array [0..2] of String;
     dwPos:      Integer;
begin

  Memo1.Lines.BeginUpdate;
  try
     Memo1.Lines.Clear;
     for x:=7 to WB1.OleObject.Document.all.tags('TABLE').Length - 4 do
     begin
        ovTable:=WB1.OleObject.Document.all.tags('TABLE').item(x);
        for i:=0 to (ovTable.Rows.Length - 1) do
        begin
           for j:=0 to (ovTable.Rows.Item(i).Cells.Length - 1) do
           begin
              szText:=ovTable.Rows.Item(i).Cells.Item(j).InnerText;
              if Length(Trim(szText)) = 0 then
                 Continue
              else
              begin
                 szItems[0]:=EmptyStr;
                 szItems[1]:=EmptyStr;
                 szItems[2]:=EmptyStr;
                 dwPos:=Pos(' - http', LowerCase(szText));
                 if (dwPos > 0) then
                 begin
                    szItems[0]:=Trim(Copy(szText, 1, dwPos));
                    Delete(szText, 1, dwPos + 2);
                    dwPos:=Pos(#13#10, szText);
                    if (dwPos > 0) then
                    begin
                       szItems[1]:=Trim(Copy(szText, 1, dwPos));
                       Delete(szText, 1, dwPos + 1);
                       szItems[2]:=Trim(szText);
                    end;
                 end;
                 Memo1.Lines.Add(szItems[0]+'|'+szItems[1]+'|'+szItems[2]);
              end;
           end;
        end;
     end;
  finally
     Memo1.Lines.EndUpdate;
  end;

end;



0
 
Mohammed NasmanSoftware DeveloperCommented:
0
 
Wayne BarronAuthor, Web DeveloperAuthor Commented:
Nice.
I like the demo for:
[Extract Text]
(This would probably work, if I could find a way to break the lines up
And also, to have it "Grab" the text from a given URL instead of my physically copy/paste)
I like this one better, but the next one might be good as well.

==========

& [Links Plugin]

I looked at the code for the "DIHtmlLinksPlugin.pas" Component.
And if I could somehow add in the information for:
==========================================================
<FONT face=arial,sans-serif><A href="http://www.wga.hu/" target=_top>Web Gallery
of Art</A>&nbsp;<FONT color=#6f6f6f
size=-1>-&nbsp;<SPAN>http://www.wga.hu/</SPAN></FONT> <BR><FONT size=-1>Virtual
museum of European painting and sculpture of the Gothic, Renaissance and Baroque
periods (1100-1800). With commentaries on pictures, biographies of artists, and
guided tours.</FONT></FONT>
==========================================================
Then I could use this type of setup to maybe do what I need?
As I would need to look for:
URL <-- Already Implement into the code.

target=_top> </a>
<SPAN> </SPAN>

Other then that, Any idea's?
On how to use these component to do what I need?

Basically.
Type in a given URL
Extract links, Title, Description
From the page.
0
Cloud Class® Course: Certified Penetration Testing

This CPTE Certified Penetration Testing Engineer course covers everything you need to know about becoming a Certified Penetration Testing Engineer. Career Path: Professional roles include Ethical Hackers, Security Consultants, System Administrators, and Chief Security Officers.

 
Wayne BarronAuthor, Web DeveloperAuthor Commented:
mnasman;

If you have any suggestions on my last comment, please let me know?
If not, I am going to close this question down, and reopen another one.
(Hopefully you can assist on this some more)
0
 
Wayne BarronAuthor, Web DeveloperAuthor Commented:
mnasman.
Never mind on the HTMLParser http://www.zeitungsjunge.de/delphi/htmlparser/
I had forgotten that it was a Shareware.
Right now, it is not in the budget for something like this.
Hopefully one day soon, as it does seem to be a great component to use.


(If you cannot think of anything else, I am going to do a PAQ/Refund on this one.)

Take Care and thanks.
0
 
Wayne BarronAuthor, Web DeveloperAuthor Commented:
mnasman.

I figured this one out on my own.
But need a better way instead of writing code for every single Table on the page.
Starting at Table #7 -thru- Table #9 in this example. (In my case, I am going all the way to Table #30)

So, anyway, this is the code, that I am using now, which I should say, works pretty dag-on good.
And pretty fast.

Take Care
I am going to PAG/REFUND this one.
Carrzkiss

(Grabs Tables 7,8 & 9 from a page.)
==============================================
procedure TfrmExtractText.ButtonGetClick(Sender: TObject);
var
  i, j: integer;
  ovTable, ovTable1, ovTable2: OleVariant;
begin
ovTable2 := WB1.OleObject.Document.all.tags('TABLE').item(8);
  for i := 0 to (ovTable2.Rows.Length - 1) do
  begin
    for j := 0 to (ovTable2.Rows.Item(i).Cells.Length - 1) do
    begin
      memo1.Lines.Add(ovTable2.Rows.Item(i).Cells.Item(j).InnerText);
    end;
  end;

  ovTable := WB1.OleObject.Document.all.tags('TABLE').item(8);
  for i := 0 to (ovTable.Rows.Length - 1) do
  begin
    for j := 0 to (ovTable.Rows.Item(i).Cells.Length - 1) do
    begin
      memo1.Lines.Add(ovTable.Rows.Item(i).Cells.Item(j).InnerText);
    end;
  end;
  ovTable1 := WB1.OleObject.Document.all.tags('TABLE').item(9);
  for i := 0 to (ovTable1.Rows.Length - 1) do
  begin
    for j := 0 to (ovTable1.Rows.Item(i).Cells.Length - 1) do
    begin
      memo1.Lines.Add(ovTable1.Rows.Item(i).Cells.Item(j).InnerText);
    end;
  end;

end;
=======================
0
 
Wayne BarronAuthor, Web DeveloperAuthor Commented:
Keeping this one open for a while longer.
0
 
Wayne BarronAuthor, Web DeveloperAuthor Commented:
Another great job.

Basically, took what I was using and modified it to work great.

for x:=7 to WB1.OleObject.Document.all.tags('TABLE').Length - 4 do

The [7] is the Table #
But how does it know where the last Table under the [Web Sites] is at?

Great. Thank you once again Russell.
Wayne
(Up the Points from 250 - 350)
0
 
Russell LibbySoftware Engineer, Advisory Commented:
Thanks Wayne.

I had to calculate the difference between the static tables at the end of the page minus the total table count, which then returned the desired tables, thus giving you;

     for x:=7 to WB1.OleObject.Document.all.tags('TABLE').Length - 4 do

Russell
0
 
Wayne BarronAuthor, Web DeveloperAuthor Commented:
OK. Gotcha.
Just ran it through another test, and it still works. But misses about 7- items.

Give it a shot here.
http://directory.google.com/Top/Kids_and_Teens/Arts/Music/Bands_and_Ensembles/Drum_and_Bugle_Corps/Junior_Corps/

This Table count starts at: [6] instead of [7]
(Dag-on Google crap)

Anyway. I will have to do some checking, might have to change up the code on certain pages
And so forth, which is not really a big deal, just will slow progress down a tad.

Take Care and thank you Russell.
0
 
Russell LibbySoftware Engineer, Advisory Commented:
Its only tested on that one page, as parsing pages is something that is difficult to do generically. You may have to build the procedure so that it takes a starting / ending table index in order to use it on multiple similar (but not identical) pages.

Russell
0
 
Wayne BarronAuthor, Web DeveloperAuthor Commented:
Russell.
I just tried something. (It might not be the best way to do it, but with your code, the way you wrote it, it seems
To work pretty well)

OK. for here

for x:=7 to WB1.OleObject.Document.all.tags('TABLE').Length - 4 do

Changed it to

for x:=1 to WB1.OleObject.Document.all.tags('TABLE').Length - 1 do

What this will do is grab everything from the start of the Table(s) (That I need to get) to the [End].

The Tables that are "Before" and "After" the [Web Pages] Tables. it adds this to the Memo1.

||
||
||
||
||
||
<title>|<Link>|<Description>  (All of them)
||
||
||

So, all I need to do know if a simple [Search and Replace] for everything that is not Valide ||
And replace it with a [ ] Blank space.

So, your code works GREAT!!!!!!!!! now.

Thank you so very much once again.

Wayne
0
 
Russell LibbySoftware Engineer, Advisory Commented:
Cool ;-)

Russell
0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.