Solved

Extract URL, <Title> & <Description>

Posted on 2006-07-16
13
568 Views
Last Modified: 2010-04-05
Hello All;

 I am using the folllowing code to Extract URL's from a page.
=========================================

procedure TForm1.ButtonGetClick(Sender: TObject);
var
  i: Integer;
  URL: OleVariant; // This comes from " WebBrowser1.OnDocumentComplete "
  // Which is not the best way to do ( I do not think ) but works great,
begin
  Memo1.Lines.Clear;
  for i := 0 to WB1.OleObject.Document.links.Length - 1 do
    Memo1.Lines.Add(WB1.OleObject.Document.Links.Item(i));
end;

=========================================

The following link for example:
http://directory.google.com/Top/Kids_and_Teens/Arts/

OK.
I would like to grab not only the URL's from this page
(That are located in the [Web Pages] Area not [Categories] Area).
But also grab the <text> that is located with it.

Example:
the first URL is:
==========================================================
Web Gallery of Art - http://www.wga.hu/ 
Virtual museum of European painting and sculpture of the Gothic, Renaissance and Baroque periods (1100-1800). With commentaries on pictures, biographies of artists, and guided tours.
==========================================================

So in the above. I would like to grab

Web Gallery of Art <--   Title
http://www.wga.hu/  <-- URL
Virtual museum of European painting...  <-- Description


Here is the actual HTML SourceCode for the information above:
==========================================================
<FONT face=arial,sans-serif><A href="http://www.wga.hu/" target=_top>Web Gallery
of Art</A>&nbsp;<FONT color=#6f6f6f
size=-1>-&nbsp;<SPAN>http://www.wga.hu/</SPAN></FONT> <BR><FONT size=-1>Virtual
museum of European painting and sculpture of the Gothic, Renaissance and Baroque
periods (1100-1800). With commentaries on pictures, biographies of artists, and
guided tours.</FONT></FONT>
==========================================================


I also have this code
http://www.experts-exchange.com/Programming/Programming_Languages/Delphi/Q_21153731.html

Which grabs the URL & Title for the [Categories] but not for the [Web Pages].

Any idea's on this one?

Thanks All;
Carrzkiss

0
Comment
Question by:Wayne Barron
  • 8
  • 4
13 Comments
 
LVL 22

Expert Comment

by:Mohammed Nasman
ID: 17119983
0
 
LVL 30

Author Comment

by:Wayne Barron
ID: 17120085
Nice.
I like the demo for:
[Extract Text]
(This would probably work, if I could find a way to break the lines up
And also, to have it "Grab" the text from a given URL instead of my physically copy/paste)
I like this one better, but the next one might be good as well.

==========

& [Links Plugin]

I looked at the code for the "DIHtmlLinksPlugin.pas" Component.
And if I could somehow add in the information for:
==========================================================
<FONT face=arial,sans-serif><A href="http://www.wga.hu/" target=_top>Web Gallery
of Art</A>&nbsp;<FONT color=#6f6f6f
size=-1>-&nbsp;<SPAN>http://www.wga.hu/</SPAN></FONT> <BR><FONT size=-1>Virtual
museum of European painting and sculpture of the Gothic, Renaissance and Baroque
periods (1100-1800). With commentaries on pictures, biographies of artists, and
guided tours.</FONT></FONT>
==========================================================
Then I could use this type of setup to maybe do what I need?
As I would need to look for:
URL <-- Already Implement into the code.

target=_top> </a>
<SPAN> </SPAN>

Other then that, Any idea's?
On how to use these component to do what I need?

Basically.
Type in a given URL
Extract links, Title, Description
From the page.
0
 
LVL 30

Author Comment

by:Wayne Barron
ID: 17126992
mnasman;

If you have any suggestions on my last comment, please let me know?
If not, I am going to close this question down, and reopen another one.
(Hopefully you can assist on this some more)
0
ScreenConnect 6.0 Free Trial

Discover new time-saving features in one game-changing release, ScreenConnect 6.0, based on partner feedback. New features include a redesigned UI, app configurations and chat acknowledgement to improve customer engagement!

 
LVL 30

Author Comment

by:Wayne Barron
ID: 17127455
mnasman.
Never mind on the HTMLParser http://www.zeitungsjunge.de/delphi/htmlparser/
I had forgotten that it was a Shareware.
Right now, it is not in the budget for something like this.
Hopefully one day soon, as it does seem to be a great component to use.


(If you cannot think of anything else, I am going to do a PAQ/Refund on this one.)

Take Care and thanks.
0
 
LVL 30

Author Comment

by:Wayne Barron
ID: 17127561
mnasman.

I figured this one out on my own.
But need a better way instead of writing code for every single Table on the page.
Starting at Table #7 -thru- Table #9 in this example. (In my case, I am going all the way to Table #30)

So, anyway, this is the code, that I am using now, which I should say, works pretty dag-on good.
And pretty fast.

Take Care
I am going to PAG/REFUND this one.
Carrzkiss

(Grabs Tables 7,8 & 9 from a page.)
==============================================
procedure TfrmExtractText.ButtonGetClick(Sender: TObject);
var
  i, j: integer;
  ovTable, ovTable1, ovTable2: OleVariant;
begin
ovTable2 := WB1.OleObject.Document.all.tags('TABLE').item(8);
  for i := 0 to (ovTable2.Rows.Length - 1) do
  begin
    for j := 0 to (ovTable2.Rows.Item(i).Cells.Length - 1) do
    begin
      memo1.Lines.Add(ovTable2.Rows.Item(i).Cells.Item(j).InnerText);
    end;
  end;

  ovTable := WB1.OleObject.Document.all.tags('TABLE').item(8);
  for i := 0 to (ovTable.Rows.Length - 1) do
  begin
    for j := 0 to (ovTable.Rows.Item(i).Cells.Length - 1) do
    begin
      memo1.Lines.Add(ovTable.Rows.Item(i).Cells.Item(j).InnerText);
    end;
  end;
  ovTable1 := WB1.OleObject.Document.all.tags('TABLE').item(9);
  for i := 0 to (ovTable1.Rows.Length - 1) do
  begin
    for j := 0 to (ovTable1.Rows.Item(i).Cells.Length - 1) do
    begin
      memo1.Lines.Add(ovTable1.Rows.Item(i).Cells.Item(j).InnerText);
    end;
  end;

end;
=======================
0
 
LVL 30

Author Comment

by:Wayne Barron
ID: 17128044
Keeping this one open for a while longer.
0
 
LVL 26

Accepted Solution

by:
Russell Libby earned 350 total points
ID: 17133902
Perhaps a slight improvement.

Russell

----

var  i, j, x:    Integer;
     ovTable:    OleVariant;
     szText:     String;
     szItems:    Array [0..2] of String;
     dwPos:      Integer;
begin

  Memo1.Lines.BeginUpdate;
  try
     Memo1.Lines.Clear;
     for x:=7 to WB1.OleObject.Document.all.tags('TABLE').Length - 4 do
     begin
        ovTable:=WB1.OleObject.Document.all.tags('TABLE').item(x);
        for i:=0 to (ovTable.Rows.Length - 1) do
        begin
           for j:=0 to (ovTable.Rows.Item(i).Cells.Length - 1) do
           begin
              szText:=ovTable.Rows.Item(i).Cells.Item(j).InnerText;
              if Length(Trim(szText)) = 0 then
                 Continue
              else
              begin
                 szItems[0]:=EmptyStr;
                 szItems[1]:=EmptyStr;
                 szItems[2]:=EmptyStr;
                 dwPos:=Pos(' - http', LowerCase(szText));
                 if (dwPos > 0) then
                 begin
                    szItems[0]:=Trim(Copy(szText, 1, dwPos));
                    Delete(szText, 1, dwPos + 2);
                    dwPos:=Pos(#13#10, szText);
                    if (dwPos > 0) then
                    begin
                       szItems[1]:=Trim(Copy(szText, 1, dwPos));
                       Delete(szText, 1, dwPos + 1);
                       szItems[2]:=Trim(szText);
                    end;
                 end;
                 Memo1.Lines.Add(szItems[0]+'|'+szItems[1]+'|'+szItems[2]);
              end;
           end;
        end;
     end;
  finally
     Memo1.Lines.EndUpdate;
  end;

end;



0
 
LVL 30

Author Comment

by:Wayne Barron
ID: 17134623
Another great job.

Basically, took what I was using and modified it to work great.

for x:=7 to WB1.OleObject.Document.all.tags('TABLE').Length - 4 do

The [7] is the Table #
But how does it know where the last Table under the [Web Sites] is at?

Great. Thank you once again Russell.
Wayne
(Up the Points from 250 - 350)
0
 
LVL 26

Expert Comment

by:Russell Libby
ID: 17134706
Thanks Wayne.

I had to calculate the difference between the static tables at the end of the page minus the total table count, which then returned the desired tables, thus giving you;

     for x:=7 to WB1.OleObject.Document.all.tags('TABLE').Length - 4 do

Russell
0
 
LVL 30

Author Comment

by:Wayne Barron
ID: 17134919
OK. Gotcha.
Just ran it through another test, and it still works. But misses about 7- items.

Give it a shot here.
http://directory.google.com/Top/Kids_and_Teens/Arts/Music/Bands_and_Ensembles/Drum_and_Bugle_Corps/Junior_Corps/

This Table count starts at: [6] instead of [7]
(Dag-on Google crap)

Anyway. I will have to do some checking, might have to change up the code on certain pages
And so forth, which is not really a big deal, just will slow progress down a tad.

Take Care and thank you Russell.
0
 
LVL 26

Expert Comment

by:Russell Libby
ID: 17134966
Its only tested on that one page, as parsing pages is something that is difficult to do generically. You may have to build the procedure so that it takes a starting / ending table index in order to use it on multiple similar (but not identical) pages.

Russell
0
 
LVL 30

Author Comment

by:Wayne Barron
ID: 17135074
Russell.
I just tried something. (It might not be the best way to do it, but with your code, the way you wrote it, it seems
To work pretty well)

OK. for here

for x:=7 to WB1.OleObject.Document.all.tags('TABLE').Length - 4 do

Changed it to

for x:=1 to WB1.OleObject.Document.all.tags('TABLE').Length - 1 do

What this will do is grab everything from the start of the Table(s) (That I need to get) to the [End].

The Tables that are "Before" and "After" the [Web Pages] Tables. it adds this to the Memo1.

||
||
||
||
||
||
<title>|<Link>|<Description>  (All of them)
||
||
||

So, all I need to do know if a simple [Search and Replace] for everything that is not Valide ||
And replace it with a [ ] Blank space.

So, your code works GREAT!!!!!!!!! now.

Thank you so very much once again.

Wayne
0
 
LVL 26

Expert Comment

by:Russell Libby
ID: 17135256
Cool ;-)

Russell
0

Featured Post

PRTG Network Monitor: Intuitive Network Monitoring

Network Monitoring is essential to ensure that computer systems and network devices are running. Use PRTG to monitor LANs, servers, websites, applications and devices, bandwidth, virtual environments, remote systems, IoT, and many more. PRTG is easy to set up & use.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Title # Comments Views Activity
Delphi: how to implement a User Shortcut mapper? 1 110
Delphi Form ownership 4 89
URL for downloading Google Chrome for Win XP 2 156
Reconfigure Delphi Install? 2 51
This article explains how to create forms/units independent of other forms/units object names in a delphi project. Have you ever created a form for user input in a Delphi project and then had the need to have that same form in a other Delphi proj…
Have you ever had your Delphi form/application just hanging while waiting for data to load? This is the article to read if you want to learn some things about adding threads for data loading in the background. First, I'll setup a general applica…
This video shows how to use Hyena, from SystemTools Software, to bulk import 100 user accounts from an external text file. View in 1080p for best video quality.
Finds all prime numbers in a range requested and places them in a public primes() array. I've demostrated a template size of 30 (2 * 3 * 5) but larger templates can be built such 210  (2 * 3 * 5 * 7) or 2310  (2 * 3 * 5 * 7 * 11). The larger templa…

778 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question