Solved

Extract URL, <Title> & <Description>

Posted on 2006-07-16
13
555 Views
Last Modified: 2010-04-05
Hello All;

 I am using the folllowing code to Extract URL's from a page.
=========================================

procedure TForm1.ButtonGetClick(Sender: TObject);
var
  i: Integer;
  URL: OleVariant; // This comes from " WebBrowser1.OnDocumentComplete "
  // Which is not the best way to do ( I do not think ) but works great,
begin
  Memo1.Lines.Clear;
  for i := 0 to WB1.OleObject.Document.links.Length - 1 do
    Memo1.Lines.Add(WB1.OleObject.Document.Links.Item(i));
end;

=========================================

The following link for example:
http://directory.google.com/Top/Kids_and_Teens/Arts/

OK.
I would like to grab not only the URL's from this page
(That are located in the [Web Pages] Area not [Categories] Area).
But also grab the <text> that is located with it.

Example:
the first URL is:
==========================================================
Web Gallery of Art - http://www.wga.hu/
Virtual museum of European painting and sculpture of the Gothic, Renaissance and Baroque periods (1100-1800). With commentaries on pictures, biographies of artists, and guided tours.
==========================================================

So in the above. I would like to grab

Web Gallery of Art <--   Title
http://www.wga.hu/  <-- URL
Virtual museum of European painting...  <-- Description


Here is the actual HTML SourceCode for the information above:
==========================================================
<FONT face=arial,sans-serif><A href="http://www.wga.hu/" target=_top>Web Gallery
of Art</A>&nbsp;<FONT color=#6f6f6f
size=-1>-&nbsp;<SPAN>http://www.wga.hu/</SPAN></FONT> <BR><FONT size=-1>Virtual
museum of European painting and sculpture of the Gothic, Renaissance and Baroque
periods (1100-1800). With commentaries on pictures, biographies of artists, and
guided tours.</FONT></FONT>
==========================================================


I also have this code
http://www.experts-exchange.com/Programming/Programming_Languages/Delphi/Q_21153731.html

Which grabs the URL & Title for the [Categories] but not for the [Web Pages].

Any idea's on this one?

Thanks All;
Carrzkiss

0
Comment
Question by:Wayne Barron
  • 8
  • 4
13 Comments
 
LVL 22

Expert Comment

by:mnasman
Comment Utility
0
 
LVL 30

Author Comment

by:Wayne Barron
Comment Utility
Nice.
I like the demo for:
[Extract Text]
(This would probably work, if I could find a way to break the lines up
And also, to have it "Grab" the text from a given URL instead of my physically copy/paste)
I like this one better, but the next one might be good as well.

==========

& [Links Plugin]

I looked at the code for the "DIHtmlLinksPlugin.pas" Component.
And if I could somehow add in the information for:
==========================================================
<FONT face=arial,sans-serif><A href="http://www.wga.hu/" target=_top>Web Gallery
of Art</A>&nbsp;<FONT color=#6f6f6f
size=-1>-&nbsp;<SPAN>http://www.wga.hu/</SPAN></FONT> <BR><FONT size=-1>Virtual
museum of European painting and sculpture of the Gothic, Renaissance and Baroque
periods (1100-1800). With commentaries on pictures, biographies of artists, and
guided tours.</FONT></FONT>
==========================================================
Then I could use this type of setup to maybe do what I need?
As I would need to look for:
URL <-- Already Implement into the code.

target=_top> </a>
<SPAN> </SPAN>

Other then that, Any idea's?
On how to use these component to do what I need?

Basically.
Type in a given URL
Extract links, Title, Description
From the page.
0
 
LVL 30

Author Comment

by:Wayne Barron
Comment Utility
mnasman;

If you have any suggestions on my last comment, please let me know?
If not, I am going to close this question down, and reopen another one.
(Hopefully you can assist on this some more)
0
 
LVL 30

Author Comment

by:Wayne Barron
Comment Utility
mnasman.
Never mind on the HTMLParser http://www.zeitungsjunge.de/delphi/htmlparser/
I had forgotten that it was a Shareware.
Right now, it is not in the budget for something like this.
Hopefully one day soon, as it does seem to be a great component to use.


(If you cannot think of anything else, I am going to do a PAQ/Refund on this one.)

Take Care and thanks.
0
 
LVL 30

Author Comment

by:Wayne Barron
Comment Utility
mnasman.

I figured this one out on my own.
But need a better way instead of writing code for every single Table on the page.
Starting at Table #7 -thru- Table #9 in this example. (In my case, I am going all the way to Table #30)

So, anyway, this is the code, that I am using now, which I should say, works pretty dag-on good.
And pretty fast.

Take Care
I am going to PAG/REFUND this one.
Carrzkiss

(Grabs Tables 7,8 & 9 from a page.)
==============================================
procedure TfrmExtractText.ButtonGetClick(Sender: TObject);
var
  i, j: integer;
  ovTable, ovTable1, ovTable2: OleVariant;
begin
ovTable2 := WB1.OleObject.Document.all.tags('TABLE').item(8);
  for i := 0 to (ovTable2.Rows.Length - 1) do
  begin
    for j := 0 to (ovTable2.Rows.Item(i).Cells.Length - 1) do
    begin
      memo1.Lines.Add(ovTable2.Rows.Item(i).Cells.Item(j).InnerText);
    end;
  end;

  ovTable := WB1.OleObject.Document.all.tags('TABLE').item(8);
  for i := 0 to (ovTable.Rows.Length - 1) do
  begin
    for j := 0 to (ovTable.Rows.Item(i).Cells.Length - 1) do
    begin
      memo1.Lines.Add(ovTable.Rows.Item(i).Cells.Item(j).InnerText);
    end;
  end;
  ovTable1 := WB1.OleObject.Document.all.tags('TABLE').item(9);
  for i := 0 to (ovTable1.Rows.Length - 1) do
  begin
    for j := 0 to (ovTable1.Rows.Item(i).Cells.Length - 1) do
    begin
      memo1.Lines.Add(ovTable1.Rows.Item(i).Cells.Item(j).InnerText);
    end;
  end;

end;
=======================
0
 
LVL 30

Author Comment

by:Wayne Barron
Comment Utility
Keeping this one open for a while longer.
0
How your wiki can always stay up-to-date

Quip doubles as a “living” wiki and a project management tool that evolves with your organization. As you finish projects in Quip, the work remains, easily accessible to all team members, new and old.
- Increase transparency
- Onboard new hires faster
- Access from mobile/offline

 
LVL 26

Accepted Solution

by:
Russell Libby earned 350 total points
Comment Utility
Perhaps a slight improvement.

Russell

----

var  i, j, x:    Integer;
     ovTable:    OleVariant;
     szText:     String;
     szItems:    Array [0..2] of String;
     dwPos:      Integer;
begin

  Memo1.Lines.BeginUpdate;
  try
     Memo1.Lines.Clear;
     for x:=7 to WB1.OleObject.Document.all.tags('TABLE').Length - 4 do
     begin
        ovTable:=WB1.OleObject.Document.all.tags('TABLE').item(x);
        for i:=0 to (ovTable.Rows.Length - 1) do
        begin
           for j:=0 to (ovTable.Rows.Item(i).Cells.Length - 1) do
           begin
              szText:=ovTable.Rows.Item(i).Cells.Item(j).InnerText;
              if Length(Trim(szText)) = 0 then
                 Continue
              else
              begin
                 szItems[0]:=EmptyStr;
                 szItems[1]:=EmptyStr;
                 szItems[2]:=EmptyStr;
                 dwPos:=Pos(' - http', LowerCase(szText));
                 if (dwPos > 0) then
                 begin
                    szItems[0]:=Trim(Copy(szText, 1, dwPos));
                    Delete(szText, 1, dwPos + 2);
                    dwPos:=Pos(#13#10, szText);
                    if (dwPos > 0) then
                    begin
                       szItems[1]:=Trim(Copy(szText, 1, dwPos));
                       Delete(szText, 1, dwPos + 1);
                       szItems[2]:=Trim(szText);
                    end;
                 end;
                 Memo1.Lines.Add(szItems[0]+'|'+szItems[1]+'|'+szItems[2]);
              end;
           end;
        end;
     end;
  finally
     Memo1.Lines.EndUpdate;
  end;

end;



0
 
LVL 30

Author Comment

by:Wayne Barron
Comment Utility
Another great job.

Basically, took what I was using and modified it to work great.

for x:=7 to WB1.OleObject.Document.all.tags('TABLE').Length - 4 do

The [7] is the Table #
But how does it know where the last Table under the [Web Sites] is at?

Great. Thank you once again Russell.
Wayne
(Up the Points from 250 - 350)
0
 
LVL 26

Expert Comment

by:Russell Libby
Comment Utility
Thanks Wayne.

I had to calculate the difference between the static tables at the end of the page minus the total table count, which then returned the desired tables, thus giving you;

     for x:=7 to WB1.OleObject.Document.all.tags('TABLE').Length - 4 do

Russell
0
 
LVL 30

Author Comment

by:Wayne Barron
Comment Utility
OK. Gotcha.
Just ran it through another test, and it still works. But misses about 7- items.

Give it a shot here.
http://directory.google.com/Top/Kids_and_Teens/Arts/Music/Bands_and_Ensembles/Drum_and_Bugle_Corps/Junior_Corps/

This Table count starts at: [6] instead of [7]
(Dag-on Google crap)

Anyway. I will have to do some checking, might have to change up the code on certain pages
And so forth, which is not really a big deal, just will slow progress down a tad.

Take Care and thank you Russell.
0
 
LVL 26

Expert Comment

by:Russell Libby
Comment Utility
Its only tested on that one page, as parsing pages is something that is difficult to do generically. You may have to build the procedure so that it takes a starting / ending table index in order to use it on multiple similar (but not identical) pages.

Russell
0
 
LVL 30

Author Comment

by:Wayne Barron
Comment Utility
Russell.
I just tried something. (It might not be the best way to do it, but with your code, the way you wrote it, it seems
To work pretty well)

OK. for here

for x:=7 to WB1.OleObject.Document.all.tags('TABLE').Length - 4 do

Changed it to

for x:=1 to WB1.OleObject.Document.all.tags('TABLE').Length - 1 do

What this will do is grab everything from the start of the Table(s) (That I need to get) to the [End].

The Tables that are "Before" and "After" the [Web Pages] Tables. it adds this to the Memo1.

||
||
||
||
||
||
<title>|<Link>|<Description>  (All of them)
||
||
||

So, all I need to do know if a simple [Search and Replace] for everything that is not Valide ||
And replace it with a [ ] Blank space.

So, your code works GREAT!!!!!!!!! now.

Thank you so very much once again.

Wayne
0
 
LVL 26

Expert Comment

by:Russell Libby
Comment Utility
Cool ;-)

Russell
0

Featured Post

Enabling OSINT in Activity Based Intelligence

Activity based intelligence (ABI) requires access to all available sources of data. Recorded Future allows analysts to observe structured data on the open, deep, and dark web.

Join & Write a Comment

Suggested Solutions

A lot of questions regard threads in Delphi.   One of the more specific questions is how to show progress of the thread.   Updating a progressbar from inside a thread is a mistake. A solution to this would be to send a synchronized message to the…
Introduction The parallel port is a very commonly known port, it was widely used to connect a printer to the PC, if you look at the back of your computer, for those who don't have newer computers, there will be a port with 25 pins and a small print…
When you create an app prototype with Adobe XD, you can insert system screens -- sharing or Control Center, for example -- with just a few clicks. This video shows you how. You can take the full course on Experts Exchange at http://bit.ly/XDcourse.
This video demonstrates how to create an example email signature rule for a department in a company using CodeTwo Exchange Rules. The signature will be inserted beneath users' latest emails in conversations and will be displayed in users' Sent Items…

772 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

12 Experts available now in Live!

Get 1:1 Help Now