Solved

Parsing images from HTML

Posted on 2004-04-29
23
297 Views
Last Modified: 2010-04-05
Hi,

What I ant is to parsing Images and the Links for these images from an HTML File, most Components I foun parses oly Images or Links, but I need the Image Link!

Could anybody help me?

k4hvd77
0
Comment
Question by:k4hvd77
23 Comments
 
LVL 7

Expert Comment

by:sftweng
ID: 10947255
0
 
LVL 17

Expert Comment

by:mokule
ID: 10947259
I suggest using regular expression.

For example You can download
http://regexpstudio.com/TRegExpr/TRegExpr.html

It's quite powerfull and easy to use.
0
 
LVL 7

Expert Comment

by:sftweng
ID: 10947305
Re: "LinkGrabber": The whole project is available in the following zip file: http://members.rogers.com/alan.bu/LinkGrabber.zip
0
Industry Leaders: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 
LVL 4

Author Comment

by:k4hvd77
ID: 10947548
sftweng,

Cannot Understand how LinkGrabber could help me do that!

what I need is follwoing:

I have a HTML File:


------------------------------------------------------------------------------------------------------------------------------------------

<html>

<head>
<meta http-equiv="Content-Type" content="text/html; charset=windows-1252">
<title>Neue Seite 1</title>
</head>

<body>

<p><a href="http://www.google.de">
<img border="0" src="http://www.google.de/intl/de_de/images/logo.gif" width="800" height="600"></a></p>

</body>

</html>


------------------------------------------------------------------------------------------------------------------------------------------

by Clicking on the Image I will redirected to google, now I want to extract the Image ("http://www.google.de/intl/de_de/images/logo.gif) and the Link for this image (http://www.google.de), and get an output like  this:

[Link01]
image= http://www.google.de/intl/de_de/images/logo.gif
Link= http://www.google.de



0
 
LVL 7

Expert Comment

by:sftweng
ID: 10947678
Your could modify the LinkGrabber "TestForLink" procedure to pull out all "<img> directives
0
 
LVL 4

Author Comment

by:k4hvd77
ID: 10947728
Could you send me some examples?
0
 
LVL 7

Expert Comment

by:sftweng
ID: 10948358
I have to go to a meeting for a couple of hours but I'll try to get back to this later today. Sorry.
0
 
LVL 4

Author Comment

by:k4hvd77
ID: 10948378
no problem ;)
0
 
LVL 6

Expert Comment

by:Amir Azhdari
ID: 10950724
k4hvd77
place a webbrowser, memo and 2 buttons  on the form and try this code :
by the way , first navigate the page(ex. www.yahoo.com or html file or ... )  to the webbrowser.

Regards
Azhdari


unit Unit1;

interface

uses
  Windows, Messages, SysUtils, Variants, Classes, Graphics, Controls, Forms,

  Dialogs,activex,comctrls,olectrls, mshtml, StdCtrls, SHDocVw,clipbrd;

type
  TForm1 = class(TForm)
    WebBrowser1: TWebBrowser;
    Button1: TButton;
    Memo1: TMemo;
    Button2: TButton;
    procedure Button2Click(Sender: TObject);
    procedure Button1Click(Sender: TObject);
  private
    { Private declarations }
  public
    { Public declarations }
  end;

var
  Form1: TForm1;

implementation

{$R *.dfm}



procedure TForm1.Button2Click(Sender: TObject);
var li:word;
var s1,s2:string;
var i,j:integer;
begin
memo1.Lines.Clear;
 for li:=0 to webbrowser1.OleObject.document.images.length-1 do
   begin
    s1:='';
    with memo1.lines do
      begin

          add('[LINK'+inttostr(li)+']');
          add('image= '+webbrowser1.OleObject.document.images.item(li).src);
          s1:=webbrowser1.OleObject.document.images.item(li).src;
          if ((strpos(pchar(s1),'http')<>nil) or (strpos(pchar(s1),'ftp')<>nil))  then
           begin
              s2:='';
              j:=0;
              for i:=1 to length(s1) do
                begin
                 if (j=3) then
                     break;
                 s2:=s2+s1[i];
                 if s1[i]='/' then
                   inc(j);
                end;
             add('Link= '+s2);
           end
           else
             add('Link= Load From Drive');


      end;


   end;

end;

procedure TForm1.Button1Click(Sender: TObject);
begin
webbrowser1.Navigate('www.yahoo.com');
end;

end.


0
 
LVL 4

Author Comment

by:k4hvd77
ID: 10950999
AmirAzhdari,

that's not what I'm looking for!


RE:

what I need is follwoing:

I have a HTML File:


------------------------------------------------------------------------------------------------------------------------------------------

<html>

<head>
<meta http-equiv="Content-Type" content="text/html; charset=windows-1252">
<title>Neue Seite 1</title>
</head>

<body>

<p><a href="http://www.google.de">
<img border="0" src="http://www.google.de/intl/de_de/images/logo.gif" width="800" height="600"></a></p>

</body>

</html>


------------------------------------------------------------------------------------------------------------------------------------------

by Clicking on the Image I will redirected to google, now I want to extract the Image ("http://www.google.de/intl/de_de/images/logo.gif) and the Link for this image (http://www.google.de), and get an output like  this:

[Link01]
image= http://www.google.de/intl/de_de/images/logo.gif
Link= http://www.google.de
0
 
LVL 22

Expert Comment

by:Mohammed Nasman
ID: 10951436
0
 
LVL 7

Expert Comment

by:sftweng
ID: 10951656
Change to "LinkGrabber":
{==============================================================================}
procedure TLGForm.Parse1Click(Sender: TObject);
VAR
  il, ic, lc, lb, rb, linkCount : INTEGER;
  s, t, d : String;
  collecting : BOOLEAN;
  parentNode : TTreeNode;
  currNode : TTreeNode;
  currText, hrefText, srcText : String;
  deltaTime : INTEGER;
  lineCount, lineCountTick : INTEGER;
{------------------------------------------------------------------------------}
procedure TestForLink(ts : String; VAR cnt : INTEGER);
var hrefPos : INTEGER;
begin
  currText := UpperCase(ts);
  IF (Pos('<A',currText) = 1) THEN
  BEGIN
    hrefPos := Pos('HREF="',currText);
    IF (hrefPos > 0) THEN
    BEGIN
      Delete(ts,1,hrefPos+5);
      hrefPos := Pos('"',ts);
      currText := UpperCase(ts);
      IF (hrefPos > 0)
      {AND (Pos('.JPG',currText) = Length(currText)-5)} THEN
      BEGIN
        hrefText := Copy(ts,1,hrefPos-1);
        ListBoxLinks.Items.Add('Link='+hrefText);
        INC(cnt);
      END {IF};
    END {IF};
  END {IF};
end {TestForLink};
{------------------------------------------------------------------------------}
procedure TestForImage(ts : String; VAR cnt : INTEGER);
var srcPos : INTEGER;
begin
  currText := UpperCase(ts);
  IF (Pos('<IMG',currText) = 1) THEN
  BEGIN
    srcPos := Pos('SRC="',currText);
    IF (srcPos > 0) THEN
    BEGIN
      Delete(ts,1,srcPos+4);
      srcPos := Pos('"',ts);
      currText := UpperCase(ts);
      IF (srcPos > 0)
      {AND (Pos('.JPG',currText) = Length(currText)-5)} THEN
      BEGIN
        srcText := Copy(ts,1,srcPos-1);
        ListBoxLinks.Items.Add('Image='+srcText);
        INC(cnt);
      END {IF};
    END {IF};
  END {IF};
end {TestForImage};
{------------------------------------------------------------------------------}
begin {Parse1Click}
  startTime := Now;
  rootNode.Text := EditURL.Text;
  parentNode := rootNode;
  ListBoxLinks.Items.Clear;
  linkCount := 0;
  WITH MemoRawHTML DO BEGIN
    lc := Lines.Count;
    lineCount := lc;
    lineCountTick := LineCount DIV 20;
    collecting := FALSE;
    t := ''; d := '';
    TreeViewParsed.Visible := FALSE;
    ListBoxLinks.Visible := FALSE;
    FOR il := 0 TO lc-1
    DO BEGIN
      s := Lines[il];
//      StatusBar.SimpleText := s;
      FOR ic := 1 TO Length(s) DO
      BEGIN
        IF s[ic] = '<'
        THEN BEGIN
          collecting := TRUE;
          WITH TreeViewParsed.Items DO
          IF d <> '' THEN BEGIN
            AddChild(parentNode,d);
            TestForLink(d,linkCount);
            TestForImage(d,linkCount);
          END {IF};
          d := '';
        END {IF};
        IF NOT collecting THEN d := d+s[ic];
        IF collecting  THEN t := t+s[ic];
        IF s[ic] = '>'
        THEN BEGIN
          collecting := FALSE;
          WITH TreeViewParsed.Items DO
          BEGIN
            IF Pos('</',t) = 1
            THEN AddChild(parentNode,t)
            ELSE parentNode := Add(rootNode,t);
            TestForLink(t,linkCount);
            TestForImage(t,linkCount);
          END {WITH };
          t := '';
        END {IF};
        IF (linkCount >= StrToInt(EditLinkLimit.Text)) THEN Break;
      END {FOR};
      IF ((il MOD lineCountTick) = 0) THEN
      BEGIN
        ProgressBar1.Position := (il * 100 DIV lineCount);
      END {IF};
    END {FOR};
    ListBoxLinks.Visible := TRUE;
    TreeViewParsed.Visible := TRUE;
  END {WITH };
  endTime := Now;
  deltaTime := SecondsBetween(startTime,endTime);
  StatusBar.SimpleText := Format('Done parse in %d seconds',[deltaTime]);
  MemoDiag.Lines.Add(StatusBar.SimpleText);
//  TreeView1.FullExpand;
end {Parse1Click};
0
 
LVL 4

Author Comment

by:k4hvd77
ID: 10952121
sftweng,

I'm using Delphi 7 and cannot Complie the Project!
1. have no FastNet (NMHTTP) Components,
2. I have't the  JVCL
0
 
LVL 7

Expert Comment

by:sftweng
ID: 10952181
I'll have to check carefully but I don't think you need them. Just write a program that puts your HTML into a string, passes it into the procedure (Parse1Click) and stores or uses the results.

Concentrate on the "TestFor" procedures and just pass them HTML strings from whatever source you choose.
0
 
LVL 7

Expert Comment

by:sftweng
ID: 10952192
I don't think "Parse1Click" needs either NMHTTP or JVCL.
0
 
LVL 4

Author Comment

by:k4hvd77
ID: 10952303
sorry cannot understant how to get it work!

Could you send me the project to  admin@titaniumserver.de


thanks
k4hvd77
0
 
LVL 7

Accepted Solution

by:
sftweng earned 250 total points
ID: 10952470
k4hvd77, Ex-Ex rules don't allow me to use email. You should have been able to download the original project from the URL I posted earlier and then to cut-and-paste the replacement code frommy earlier posting. I'd like to help you on this, but I'm prevented by Ex-Ex rules from using email correspondence.

But if you don't have the NMHTTP and JVCL components, anyway, you should just take the source code for "Parse1Click", written above, and remove all of the component references, e.g., an edit box, treenode and listbox, and replace them with string equivalents.

The core of the code is the "TestFor" procedures - just feed them HTML lines, acquired from whatever source you like, and feed the results (added via Listbox.Add) back to the client (caller) software.
0
 
LVL 7

Expert Comment

by:sftweng
ID: 10952652
k4hvd77, when I said "you should have been able to", I meant no criticism and I recognize the fact that we are dealing with different versions of Delphi (6 & 7) and libraries. My intention was to focus on the key software, the "TestFor" procedure, which should be more portable than the rest of the application.

I do recommend, however, that yu take a good look at using (at least), the JCL and JVCL components, available from http://www.jedi-delphi.org

Good luck, and do, please, continue to ask questions - I'll be pleased to help.
0
 
LVL 7

Expert Comment

by:sftweng
ID: 10952678
Sorry, that should be http://delphi-jedi.org.
0
 
LVL 7

Expert Comment

by:sftweng
ID: 13190230
I believe my solution met the requirements
0

Featured Post

On Demand Webinar - Networking for the Cloud Era

This webinar discusses:
-Common barriers companies experience when moving to the cloud
-How SD-WAN changes the way we look at networks
-Best practices customers should employ moving forward with cloud migration
-What happens behind the scenes of SteelConnect’s one-click button

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

The uses clause is one of those things that just tends to grow and grow. Most of the time this is in the main form, as it's from this form that all others are called. If you have a big application (including many forms), the uses clause in the in…
Objective: - This article will help user in how to convert their numeric value become words. How to use 1. You can copy this code in your Unit as function 2. than you can perform your function by type this code The Code   (CODE) The Im…
With Secure Portal Encryption, the recipient is sent a link to their email address directing them to the email laundry delivery page. From there, the recipient will be required to enter a user name and password to enter the page. Once the recipient …

733 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question