Solved

Parsing images from HTML

Posted on 2004-04-29
23
294 Views
Last Modified: 2010-04-05
Hi,

What I ant is to parsing Images and the Links for these images from an HTML File, most Components I foun parses oly Images or Links, but I need the Image Link!

Could anybody help me?

k4hvd77
0
Comment
Question by:k4hvd77
23 Comments
 
LVL 7

Expert Comment

by:sftweng
ID: 10947255
0
 
LVL 17

Expert Comment

by:mokule
ID: 10947259
I suggest using regular expression.

For example You can download
http://regexpstudio.com/TRegExpr/TRegExpr.html

It's quite powerfull and easy to use.
0
 
LVL 7

Expert Comment

by:sftweng
ID: 10947305
Re: "LinkGrabber": The whole project is available in the following zip file: http://members.rogers.com/alan.bu/LinkGrabber.zip
0
 
LVL 4

Author Comment

by:k4hvd77
ID: 10947548
sftweng,

Cannot Understand how LinkGrabber could help me do that!

what I need is follwoing:

I have a HTML File:


------------------------------------------------------------------------------------------------------------------------------------------

<html>

<head>
<meta http-equiv="Content-Type" content="text/html; charset=windows-1252">
<title>Neue Seite 1</title>
</head>

<body>

<p><a href="http://www.google.de">
<img border="0" src="http://www.google.de/intl/de_de/images/logo.gif" width="800" height="600"></a></p>

</body>

</html>


------------------------------------------------------------------------------------------------------------------------------------------

by Clicking on the Image I will redirected to google, now I want to extract the Image ("http://www.google.de/intl/de_de/images/logo.gif) and the Link for this image (http://www.google.de), and get an output like  this:

[Link01]
image= http://www.google.de/intl/de_de/images/logo.gif
Link= http://www.google.de



0
 
LVL 7

Expert Comment

by:sftweng
ID: 10947678
Your could modify the LinkGrabber "TestForLink" procedure to pull out all "<img> directives
0
 
LVL 4

Author Comment

by:k4hvd77
ID: 10947728
Could you send me some examples?
0
 
LVL 7

Expert Comment

by:sftweng
ID: 10948358
I have to go to a meeting for a couple of hours but I'll try to get back to this later today. Sorry.
0
 
LVL 4

Author Comment

by:k4hvd77
ID: 10948378
no problem ;)
0
 
LVL 6

Expert Comment

by:Amir Azhdari
ID: 10950724
k4hvd77
place a webbrowser, memo and 2 buttons  on the form and try this code :
by the way , first navigate the page(ex. www.yahoo.com or html file or ... )  to the webbrowser.

Regards
Azhdari


unit Unit1;

interface

uses
  Windows, Messages, SysUtils, Variants, Classes, Graphics, Controls, Forms,

  Dialogs,activex,comctrls,olectrls, mshtml, StdCtrls, SHDocVw,clipbrd;

type
  TForm1 = class(TForm)
    WebBrowser1: TWebBrowser;
    Button1: TButton;
    Memo1: TMemo;
    Button2: TButton;
    procedure Button2Click(Sender: TObject);
    procedure Button1Click(Sender: TObject);
  private
    { Private declarations }
  public
    { Public declarations }
  end;

var
  Form1: TForm1;

implementation

{$R *.dfm}



procedure TForm1.Button2Click(Sender: TObject);
var li:word;
var s1,s2:string;
var i,j:integer;
begin
memo1.Lines.Clear;
 for li:=0 to webbrowser1.OleObject.document.images.length-1 do
   begin
    s1:='';
    with memo1.lines do
      begin

          add('[LINK'+inttostr(li)+']');
          add('image= '+webbrowser1.OleObject.document.images.item(li).src);
          s1:=webbrowser1.OleObject.document.images.item(li).src;
          if ((strpos(pchar(s1),'http')<>nil) or (strpos(pchar(s1),'ftp')<>nil))  then
           begin
              s2:='';
              j:=0;
              for i:=1 to length(s1) do
                begin
                 if (j=3) then
                     break;
                 s2:=s2+s1[i];
                 if s1[i]='/' then
                   inc(j);
                end;
             add('Link= '+s2);
           end
           else
             add('Link= Load From Drive');


      end;


   end;

end;

procedure TForm1.Button1Click(Sender: TObject);
begin
webbrowser1.Navigate('www.yahoo.com');
end;

end.


0
 
LVL 4

Author Comment

by:k4hvd77
ID: 10950999
AmirAzhdari,

that's not what I'm looking for!


RE:

what I need is follwoing:

I have a HTML File:


------------------------------------------------------------------------------------------------------------------------------------------

<html>

<head>
<meta http-equiv="Content-Type" content="text/html; charset=windows-1252">
<title>Neue Seite 1</title>
</head>

<body>

<p><a href="http://www.google.de">
<img border="0" src="http://www.google.de/intl/de_de/images/logo.gif" width="800" height="600"></a></p>

</body>

</html>


------------------------------------------------------------------------------------------------------------------------------------------

by Clicking on the Image I will redirected to google, now I want to extract the Image ("http://www.google.de/intl/de_de/images/logo.gif) and the Link for this image (http://www.google.de), and get an output like  this:

[Link01]
image= http://www.google.de/intl/de_de/images/logo.gif
Link= http://www.google.de
0
Is Your Active Directory as Secure as You Think?

More than 75% of all records are compromised because of the loss or theft of a privileged credential. Experts have been exploring Active Directory infrastructure to identify key threats and establish best practices for keeping data safe. Attend this month’s webinar to learn more.

 
LVL 22

Expert Comment

by:Mohammed Nasman
ID: 10951436
0
 
LVL 7

Expert Comment

by:sftweng
ID: 10951656
Change to "LinkGrabber":
{==============================================================================}
procedure TLGForm.Parse1Click(Sender: TObject);
VAR
  il, ic, lc, lb, rb, linkCount : INTEGER;
  s, t, d : String;
  collecting : BOOLEAN;
  parentNode : TTreeNode;
  currNode : TTreeNode;
  currText, hrefText, srcText : String;
  deltaTime : INTEGER;
  lineCount, lineCountTick : INTEGER;
{------------------------------------------------------------------------------}
procedure TestForLink(ts : String; VAR cnt : INTEGER);
var hrefPos : INTEGER;
begin
  currText := UpperCase(ts);
  IF (Pos('<A',currText) = 1) THEN
  BEGIN
    hrefPos := Pos('HREF="',currText);
    IF (hrefPos > 0) THEN
    BEGIN
      Delete(ts,1,hrefPos+5);
      hrefPos := Pos('"',ts);
      currText := UpperCase(ts);
      IF (hrefPos > 0)
      {AND (Pos('.JPG',currText) = Length(currText)-5)} THEN
      BEGIN
        hrefText := Copy(ts,1,hrefPos-1);
        ListBoxLinks.Items.Add('Link='+hrefText);
        INC(cnt);
      END {IF};
    END {IF};
  END {IF};
end {TestForLink};
{------------------------------------------------------------------------------}
procedure TestForImage(ts : String; VAR cnt : INTEGER);
var srcPos : INTEGER;
begin
  currText := UpperCase(ts);
  IF (Pos('<IMG',currText) = 1) THEN
  BEGIN
    srcPos := Pos('SRC="',currText);
    IF (srcPos > 0) THEN
    BEGIN
      Delete(ts,1,srcPos+4);
      srcPos := Pos('"',ts);
      currText := UpperCase(ts);
      IF (srcPos > 0)
      {AND (Pos('.JPG',currText) = Length(currText)-5)} THEN
      BEGIN
        srcText := Copy(ts,1,srcPos-1);
        ListBoxLinks.Items.Add('Image='+srcText);
        INC(cnt);
      END {IF};
    END {IF};
  END {IF};
end {TestForImage};
{------------------------------------------------------------------------------}
begin {Parse1Click}
  startTime := Now;
  rootNode.Text := EditURL.Text;
  parentNode := rootNode;
  ListBoxLinks.Items.Clear;
  linkCount := 0;
  WITH MemoRawHTML DO BEGIN
    lc := Lines.Count;
    lineCount := lc;
    lineCountTick := LineCount DIV 20;
    collecting := FALSE;
    t := ''; d := '';
    TreeViewParsed.Visible := FALSE;
    ListBoxLinks.Visible := FALSE;
    FOR il := 0 TO lc-1
    DO BEGIN
      s := Lines[il];
//      StatusBar.SimpleText := s;
      FOR ic := 1 TO Length(s) DO
      BEGIN
        IF s[ic] = '<'
        THEN BEGIN
          collecting := TRUE;
          WITH TreeViewParsed.Items DO
          IF d <> '' THEN BEGIN
            AddChild(parentNode,d);
            TestForLink(d,linkCount);
            TestForImage(d,linkCount);
          END {IF};
          d := '';
        END {IF};
        IF NOT collecting THEN d := d+s[ic];
        IF collecting  THEN t := t+s[ic];
        IF s[ic] = '>'
        THEN BEGIN
          collecting := FALSE;
          WITH TreeViewParsed.Items DO
          BEGIN
            IF Pos('</',t) = 1
            THEN AddChild(parentNode,t)
            ELSE parentNode := Add(rootNode,t);
            TestForLink(t,linkCount);
            TestForImage(t,linkCount);
          END {WITH };
          t := '';
        END {IF};
        IF (linkCount >= StrToInt(EditLinkLimit.Text)) THEN Break;
      END {FOR};
      IF ((il MOD lineCountTick) = 0) THEN
      BEGIN
        ProgressBar1.Position := (il * 100 DIV lineCount);
      END {IF};
    END {FOR};
    ListBoxLinks.Visible := TRUE;
    TreeViewParsed.Visible := TRUE;
  END {WITH };
  endTime := Now;
  deltaTime := SecondsBetween(startTime,endTime);
  StatusBar.SimpleText := Format('Done parse in %d seconds',[deltaTime]);
  MemoDiag.Lines.Add(StatusBar.SimpleText);
//  TreeView1.FullExpand;
end {Parse1Click};
0
 
LVL 4

Author Comment

by:k4hvd77
ID: 10952121
sftweng,

I'm using Delphi 7 and cannot Complie the Project!
1. have no FastNet (NMHTTP) Components,
2. I have't the  JVCL
0
 
LVL 7

Expert Comment

by:sftweng
ID: 10952181
I'll have to check carefully but I don't think you need them. Just write a program that puts your HTML into a string, passes it into the procedure (Parse1Click) and stores or uses the results.

Concentrate on the "TestFor" procedures and just pass them HTML strings from whatever source you choose.
0
 
LVL 7

Expert Comment

by:sftweng
ID: 10952192
I don't think "Parse1Click" needs either NMHTTP or JVCL.
0
 
LVL 4

Author Comment

by:k4hvd77
ID: 10952303
sorry cannot understant how to get it work!

Could you send me the project to  admin@titaniumserver.de


thanks
k4hvd77
0
 
LVL 7

Accepted Solution

by:
sftweng earned 250 total points
ID: 10952470
k4hvd77, Ex-Ex rules don't allow me to use email. You should have been able to download the original project from the URL I posted earlier and then to cut-and-paste the replacement code frommy earlier posting. I'd like to help you on this, but I'm prevented by Ex-Ex rules from using email correspondence.

But if you don't have the NMHTTP and JVCL components, anyway, you should just take the source code for "Parse1Click", written above, and remove all of the component references, e.g., an edit box, treenode and listbox, and replace them with string equivalents.

The core of the code is the "TestFor" procedures - just feed them HTML lines, acquired from whatever source you like, and feed the results (added via Listbox.Add) back to the client (caller) software.
0
 
LVL 7

Expert Comment

by:sftweng
ID: 10952652
k4hvd77, when I said "you should have been able to", I meant no criticism and I recognize the fact that we are dealing with different versions of Delphi (6 & 7) and libraries. My intention was to focus on the key software, the "TestFor" procedure, which should be more portable than the rest of the application.

I do recommend, however, that yu take a good look at using (at least), the JCL and JVCL components, available from http://www.jedi-delphi.org

Good luck, and do, please, continue to ask questions - I'll be pleased to help.
0
 
LVL 7

Expert Comment

by:sftweng
ID: 10952678
Sorry, that should be http://delphi-jedi.org.
0
 
LVL 7

Expert Comment

by:sftweng
ID: 13190230
I believe my solution met the requirements
0

Featured Post

Is Your Active Directory as Secure as You Think?

More than 75% of all records are compromised because of the loss or theft of a privileged credential. Experts have been exploring Active Directory infrastructure to identify key threats and establish best practices for keeping data safe. Attend this month’s webinar to learn more.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

In this tutorial I will show you how to use the Windows Speech API in Delphi. I will only cover basic functions such as text to speech and controlling the speed of the speech. SAPI Installation First you need to install the SAPI type library, th…
Hello everybody This Article will show you how to validate number with TEdit control, What's the TEdit control? TEdit is a standard Windows edit control on a form, it allows to user to write, read and copy/paste single line of text. Usua…
This Micro Tutorial demonstrates using Microsoft Excel pivot tables, how to reverse engineer competitors' marketing strategies through backlinks.
Windows 10 is mostly good. However the one thing that annoys me is how many clicks you have to do to dial a VPN connection. You have to go to settings from the start menu, (2 clicks), Network and Internet (1 click), Click VPN (another click) then fi…

920 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

14 Experts available now in Live!

Get 1:1 Help Now