Solved

Extract URL, <Title> & <Description>

Posted on 2006-07-16
13
581 Views
Last Modified: 2010-04-05
Hello All;

 I am using the folllowing code to Extract URL's from a page.
=========================================

procedure TForm1.ButtonGetClick(Sender: TObject);
var
  i: Integer;
  URL: OleVariant; // This comes from " WebBrowser1.OnDocumentComplete "
  // Which is not the best way to do ( I do not think ) but works great,
begin
  Memo1.Lines.Clear;
  for i := 0 to WB1.OleObject.Document.links.Length - 1 do
    Memo1.Lines.Add(WB1.OleObject.Document.Links.Item(i));
end;

=========================================

The following link for example:
http://directory.google.com/Top/Kids_and_Teens/Arts/

OK.
I would like to grab not only the URL's from this page
(That are located in the [Web Pages] Area not [Categories] Area).
But also grab the <text> that is located with it.

Example:
the first URL is:
==========================================================
Web Gallery of Art - http://www.wga.hu/ 
Virtual museum of European painting and sculpture of the Gothic, Renaissance and Baroque periods (1100-1800). With commentaries on pictures, biographies of artists, and guided tours.
==========================================================

So in the above. I would like to grab

Web Gallery of Art <--   Title
http://www.wga.hu/  <-- URL
Virtual museum of European painting...  <-- Description


Here is the actual HTML SourceCode for the information above:
==========================================================
<FONT face=arial,sans-serif><A href="http://www.wga.hu/" target=_top>Web Gallery
of Art</A>&nbsp;<FONT color=#6f6f6f
size=-1>-&nbsp;<SPAN>http://www.wga.hu/</SPAN></FONT> <BR><FONT size=-1>Virtual
museum of European painting and sculpture of the Gothic, Renaissance and Baroque
periods (1100-1800). With commentaries on pictures, biographies of artists, and
guided tours.</FONT></FONT>
==========================================================


I also have this code
http://www.experts-exchange.com/Programming/Programming_Languages/Delphi/Q_21153731.html

Which grabs the URL & Title for the [Categories] but not for the [Web Pages].

Any idea's on this one?

Thanks All;
Carrzkiss

0
Comment
Question by:Wayne Barron
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 8
  • 4
13 Comments
 
LVL 22

Expert Comment

by:Mohammed Nasman
ID: 17119983
0
 
LVL 31

Author Comment

by:Wayne Barron
ID: 17120085
Nice.
I like the demo for:
[Extract Text]
(This would probably work, if I could find a way to break the lines up
And also, to have it "Grab" the text from a given URL instead of my physically copy/paste)
I like this one better, but the next one might be good as well.

==========

& [Links Plugin]

I looked at the code for the "DIHtmlLinksPlugin.pas" Component.
And if I could somehow add in the information for:
==========================================================
<FONT face=arial,sans-serif><A href="http://www.wga.hu/" target=_top>Web Gallery
of Art</A>&nbsp;<FONT color=#6f6f6f
size=-1>-&nbsp;<SPAN>http://www.wga.hu/</SPAN></FONT> <BR><FONT size=-1>Virtual
museum of European painting and sculpture of the Gothic, Renaissance and Baroque
periods (1100-1800). With commentaries on pictures, biographies of artists, and
guided tours.</FONT></FONT>
==========================================================
Then I could use this type of setup to maybe do what I need?
As I would need to look for:
URL <-- Already Implement into the code.

target=_top> </a>
<SPAN> </SPAN>

Other then that, Any idea's?
On how to use these component to do what I need?

Basically.
Type in a given URL
Extract links, Title, Description
From the page.
0
 
LVL 31

Author Comment

by:Wayne Barron
ID: 17126992
mnasman;

If you have any suggestions on my last comment, please let me know?
If not, I am going to close this question down, and reopen another one.
(Hopefully you can assist on this some more)
0
Technology Partners: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 
LVL 31

Author Comment

by:Wayne Barron
ID: 17127455
mnasman.
Never mind on the HTMLParser http://www.zeitungsjunge.de/delphi/htmlparser/
I had forgotten that it was a Shareware.
Right now, it is not in the budget for something like this.
Hopefully one day soon, as it does seem to be a great component to use.


(If you cannot think of anything else, I am going to do a PAQ/Refund on this one.)

Take Care and thanks.
0
 
LVL 31

Author Comment

by:Wayne Barron
ID: 17127561
mnasman.

I figured this one out on my own.
But need a better way instead of writing code for every single Table on the page.
Starting at Table #7 -thru- Table #9 in this example. (In my case, I am going all the way to Table #30)

So, anyway, this is the code, that I am using now, which I should say, works pretty dag-on good.
And pretty fast.

Take Care
I am going to PAG/REFUND this one.
Carrzkiss

(Grabs Tables 7,8 & 9 from a page.)
==============================================
procedure TfrmExtractText.ButtonGetClick(Sender: TObject);
var
  i, j: integer;
  ovTable, ovTable1, ovTable2: OleVariant;
begin
ovTable2 := WB1.OleObject.Document.all.tags('TABLE').item(8);
  for i := 0 to (ovTable2.Rows.Length - 1) do
  begin
    for j := 0 to (ovTable2.Rows.Item(i).Cells.Length - 1) do
    begin
      memo1.Lines.Add(ovTable2.Rows.Item(i).Cells.Item(j).InnerText);
    end;
  end;

  ovTable := WB1.OleObject.Document.all.tags('TABLE').item(8);
  for i := 0 to (ovTable.Rows.Length - 1) do
  begin
    for j := 0 to (ovTable.Rows.Item(i).Cells.Length - 1) do
    begin
      memo1.Lines.Add(ovTable.Rows.Item(i).Cells.Item(j).InnerText);
    end;
  end;
  ovTable1 := WB1.OleObject.Document.all.tags('TABLE').item(9);
  for i := 0 to (ovTable1.Rows.Length - 1) do
  begin
    for j := 0 to (ovTable1.Rows.Item(i).Cells.Length - 1) do
    begin
      memo1.Lines.Add(ovTable1.Rows.Item(i).Cells.Item(j).InnerText);
    end;
  end;

end;
=======================
0
 
LVL 31

Author Comment

by:Wayne Barron
ID: 17128044
Keeping this one open for a while longer.
0
 
LVL 26

Accepted Solution

by:
Russell Libby earned 350 total points
ID: 17133902
Perhaps a slight improvement.

Russell

----

var  i, j, x:    Integer;
     ovTable:    OleVariant;
     szText:     String;
     szItems:    Array [0..2] of String;
     dwPos:      Integer;
begin

  Memo1.Lines.BeginUpdate;
  try
     Memo1.Lines.Clear;
     for x:=7 to WB1.OleObject.Document.all.tags('TABLE').Length - 4 do
     begin
        ovTable:=WB1.OleObject.Document.all.tags('TABLE').item(x);
        for i:=0 to (ovTable.Rows.Length - 1) do
        begin
           for j:=0 to (ovTable.Rows.Item(i).Cells.Length - 1) do
           begin
              szText:=ovTable.Rows.Item(i).Cells.Item(j).InnerText;
              if Length(Trim(szText)) = 0 then
                 Continue
              else
              begin
                 szItems[0]:=EmptyStr;
                 szItems[1]:=EmptyStr;
                 szItems[2]:=EmptyStr;
                 dwPos:=Pos(' - http', LowerCase(szText));
                 if (dwPos > 0) then
                 begin
                    szItems[0]:=Trim(Copy(szText, 1, dwPos));
                    Delete(szText, 1, dwPos + 2);
                    dwPos:=Pos(#13#10, szText);
                    if (dwPos > 0) then
                    begin
                       szItems[1]:=Trim(Copy(szText, 1, dwPos));
                       Delete(szText, 1, dwPos + 1);
                       szItems[2]:=Trim(szText);
                    end;
                 end;
                 Memo1.Lines.Add(szItems[0]+'|'+szItems[1]+'|'+szItems[2]);
              end;
           end;
        end;
     end;
  finally
     Memo1.Lines.EndUpdate;
  end;

end;



0
 
LVL 31

Author Comment

by:Wayne Barron
ID: 17134623
Another great job.

Basically, took what I was using and modified it to work great.

for x:=7 to WB1.OleObject.Document.all.tags('TABLE').Length - 4 do

The [7] is the Table #
But how does it know where the last Table under the [Web Sites] is at?

Great. Thank you once again Russell.
Wayne
(Up the Points from 250 - 350)
0
 
LVL 26

Expert Comment

by:Russell Libby
ID: 17134706
Thanks Wayne.

I had to calculate the difference between the static tables at the end of the page minus the total table count, which then returned the desired tables, thus giving you;

     for x:=7 to WB1.OleObject.Document.all.tags('TABLE').Length - 4 do

Russell
0
 
LVL 31

Author Comment

by:Wayne Barron
ID: 17134919
OK. Gotcha.
Just ran it through another test, and it still works. But misses about 7- items.

Give it a shot here.
http://directory.google.com/Top/Kids_and_Teens/Arts/Music/Bands_and_Ensembles/Drum_and_Bugle_Corps/Junior_Corps/

This Table count starts at: [6] instead of [7]
(Dag-on Google crap)

Anyway. I will have to do some checking, might have to change up the code on certain pages
And so forth, which is not really a big deal, just will slow progress down a tad.

Take Care and thank you Russell.
0
 
LVL 26

Expert Comment

by:Russell Libby
ID: 17134966
Its only tested on that one page, as parsing pages is something that is difficult to do generically. You may have to build the procedure so that it takes a starting / ending table index in order to use it on multiple similar (but not identical) pages.

Russell
0
 
LVL 31

Author Comment

by:Wayne Barron
ID: 17135074
Russell.
I just tried something. (It might not be the best way to do it, but with your code, the way you wrote it, it seems
To work pretty well)

OK. for here

for x:=7 to WB1.OleObject.Document.all.tags('TABLE').Length - 4 do

Changed it to

for x:=1 to WB1.OleObject.Document.all.tags('TABLE').Length - 1 do

What this will do is grab everything from the start of the Table(s) (That I need to get) to the [End].

The Tables that are "Before" and "After" the [Web Pages] Tables. it adds this to the Memo1.

||
||
||
||
||
||
<title>|<Link>|<Description>  (All of them)
||
||
||

So, all I need to do know if a simple [Search and Replace] for everything that is not Valide ||
And replace it with a [ ] Blank space.

So, your code works GREAT!!!!!!!!! now.

Thank you so very much once again.

Wayne
0
 
LVL 26

Expert Comment

by:Russell Libby
ID: 17135256
Cool ;-)

Russell
0

Featured Post

Free Tool: Subnet Calculator

The subnet calculator helps you design networks by taking an IP address and network mask and returning information such as network, broadcast address, and host range.

One of a set of tools we're offering as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Title # Comments Views Activity
delphi parse string to params 3 156
delphi popmenu non latine charcters 3 33
Typecasting TBytes to Integer in Delphi XE8 2 36
Problem working with dynamic array - help 2 35
This article explains how to create forms/units independent of other forms/units object names in a delphi project. Have you ever created a form for user input in a Delphi project and then had the need to have that same form in a other Delphi proj…
Hello everybody This Article will show you how to validate number with TEdit control, What's the TEdit control? TEdit is a standard Windows edit control on a form, it allows to user to write, read and copy/paste single line of text. Usua…
In a recent question (https://www.experts-exchange.com/questions/29004105/Run-AutoHotkey-script-directly-from-Notepad.html) here at Experts Exchange, a member asked how to run an AutoHotkey script (.AHK) directly from Notepad++ (aka NPP). This video…

749 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question