Solved

Parsing images from HTML

Posted on 2004-04-29
23
296 Views
Last Modified: 2010-04-05
Hi,

What I ant is to parsing Images and the Links for these images from an HTML File, most Components I foun parses oly Images or Links, but I need the Image Link!

Could anybody help me?

k4hvd77
0
Comment
Question by:k4hvd77
23 Comments
 
LVL 7

Expert Comment

by:sftweng
ID: 10947255
0
 
LVL 17

Expert Comment

by:mokule
ID: 10947259
I suggest using regular expression.

For example You can download
http://regexpstudio.com/TRegExpr/TRegExpr.html

It's quite powerfull and easy to use.
0
 
LVL 7

Expert Comment

by:sftweng
ID: 10947305
Re: "LinkGrabber": The whole project is available in the following zip file: http://members.rogers.com/alan.bu/LinkGrabber.zip
0
Free Tool: Port Scanner

Check which ports are open to the outside world. Helps make sure that your firewall rules are working as intended.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

 
LVL 4

Author Comment

by:k4hvd77
ID: 10947548
sftweng,

Cannot Understand how LinkGrabber could help me do that!

what I need is follwoing:

I have a HTML File:


------------------------------------------------------------------------------------------------------------------------------------------

<html>

<head>
<meta http-equiv="Content-Type" content="text/html; charset=windows-1252">
<title>Neue Seite 1</title>
</head>

<body>

<p><a href="http://www.google.de">
<img border="0" src="http://www.google.de/intl/de_de/images/logo.gif" width="800" height="600"></a></p>

</body>

</html>


------------------------------------------------------------------------------------------------------------------------------------------

by Clicking on the Image I will redirected to google, now I want to extract the Image ("http://www.google.de/intl/de_de/images/logo.gif) and the Link for this image (http://www.google.de), and get an output like  this:

[Link01]
image= http://www.google.de/intl/de_de/images/logo.gif
Link= http://www.google.de



0
 
LVL 7

Expert Comment

by:sftweng
ID: 10947678
Your could modify the LinkGrabber "TestForLink" procedure to pull out all "<img> directives
0
 
LVL 4

Author Comment

by:k4hvd77
ID: 10947728
Could you send me some examples?
0
 
LVL 7

Expert Comment

by:sftweng
ID: 10948358
I have to go to a meeting for a couple of hours but I'll try to get back to this later today. Sorry.
0
 
LVL 4

Author Comment

by:k4hvd77
ID: 10948378
no problem ;)
0
 
LVL 6

Expert Comment

by:Amir Azhdari
ID: 10950724
k4hvd77
place a webbrowser, memo and 2 buttons  on the form and try this code :
by the way , first navigate the page(ex. www.yahoo.com or html file or ... )  to the webbrowser.

Regards
Azhdari


unit Unit1;

interface

uses
  Windows, Messages, SysUtils, Variants, Classes, Graphics, Controls, Forms,

  Dialogs,activex,comctrls,olectrls, mshtml, StdCtrls, SHDocVw,clipbrd;

type
  TForm1 = class(TForm)
    WebBrowser1: TWebBrowser;
    Button1: TButton;
    Memo1: TMemo;
    Button2: TButton;
    procedure Button2Click(Sender: TObject);
    procedure Button1Click(Sender: TObject);
  private
    { Private declarations }
  public
    { Public declarations }
  end;

var
  Form1: TForm1;

implementation

{$R *.dfm}



procedure TForm1.Button2Click(Sender: TObject);
var li:word;
var s1,s2:string;
var i,j:integer;
begin
memo1.Lines.Clear;
 for li:=0 to webbrowser1.OleObject.document.images.length-1 do
   begin
    s1:='';
    with memo1.lines do
      begin

          add('[LINK'+inttostr(li)+']');
          add('image= '+webbrowser1.OleObject.document.images.item(li).src);
          s1:=webbrowser1.OleObject.document.images.item(li).src;
          if ((strpos(pchar(s1),'http')<>nil) or (strpos(pchar(s1),'ftp')<>nil))  then
           begin
              s2:='';
              j:=0;
              for i:=1 to length(s1) do
                begin
                 if (j=3) then
                     break;
                 s2:=s2+s1[i];
                 if s1[i]='/' then
                   inc(j);
                end;
             add('Link= '+s2);
           end
           else
             add('Link= Load From Drive');


      end;


   end;

end;

procedure TForm1.Button1Click(Sender: TObject);
begin
webbrowser1.Navigate('www.yahoo.com');
end;

end.


0
 
LVL 4

Author Comment

by:k4hvd77
ID: 10950999
AmirAzhdari,

that's not what I'm looking for!


RE:

what I need is follwoing:

I have a HTML File:


------------------------------------------------------------------------------------------------------------------------------------------

<html>

<head>
<meta http-equiv="Content-Type" content="text/html; charset=windows-1252">
<title>Neue Seite 1</title>
</head>

<body>

<p><a href="http://www.google.de">
<img border="0" src="http://www.google.de/intl/de_de/images/logo.gif" width="800" height="600"></a></p>

</body>

</html>


------------------------------------------------------------------------------------------------------------------------------------------

by Clicking on the Image I will redirected to google, now I want to extract the Image ("http://www.google.de/intl/de_de/images/logo.gif) and the Link for this image (http://www.google.de), and get an output like  this:

[Link01]
image= http://www.google.de/intl/de_de/images/logo.gif
Link= http://www.google.de
0
 
LVL 22

Expert Comment

by:Mohammed Nasman
ID: 10951436
0
 
LVL 7

Expert Comment

by:sftweng
ID: 10951656
Change to "LinkGrabber":
{==============================================================================}
procedure TLGForm.Parse1Click(Sender: TObject);
VAR
  il, ic, lc, lb, rb, linkCount : INTEGER;
  s, t, d : String;
  collecting : BOOLEAN;
  parentNode : TTreeNode;
  currNode : TTreeNode;
  currText, hrefText, srcText : String;
  deltaTime : INTEGER;
  lineCount, lineCountTick : INTEGER;
{------------------------------------------------------------------------------}
procedure TestForLink(ts : String; VAR cnt : INTEGER);
var hrefPos : INTEGER;
begin
  currText := UpperCase(ts);
  IF (Pos('<A',currText) = 1) THEN
  BEGIN
    hrefPos := Pos('HREF="',currText);
    IF (hrefPos > 0) THEN
    BEGIN
      Delete(ts,1,hrefPos+5);
      hrefPos := Pos('"',ts);
      currText := UpperCase(ts);
      IF (hrefPos > 0)
      {AND (Pos('.JPG',currText) = Length(currText)-5)} THEN
      BEGIN
        hrefText := Copy(ts,1,hrefPos-1);
        ListBoxLinks.Items.Add('Link='+hrefText);
        INC(cnt);
      END {IF};
    END {IF};
  END {IF};
end {TestForLink};
{------------------------------------------------------------------------------}
procedure TestForImage(ts : String; VAR cnt : INTEGER);
var srcPos : INTEGER;
begin
  currText := UpperCase(ts);
  IF (Pos('<IMG',currText) = 1) THEN
  BEGIN
    srcPos := Pos('SRC="',currText);
    IF (srcPos > 0) THEN
    BEGIN
      Delete(ts,1,srcPos+4);
      srcPos := Pos('"',ts);
      currText := UpperCase(ts);
      IF (srcPos > 0)
      {AND (Pos('.JPG',currText) = Length(currText)-5)} THEN
      BEGIN
        srcText := Copy(ts,1,srcPos-1);
        ListBoxLinks.Items.Add('Image='+srcText);
        INC(cnt);
      END {IF};
    END {IF};
  END {IF};
end {TestForImage};
{------------------------------------------------------------------------------}
begin {Parse1Click}
  startTime := Now;
  rootNode.Text := EditURL.Text;
  parentNode := rootNode;
  ListBoxLinks.Items.Clear;
  linkCount := 0;
  WITH MemoRawHTML DO BEGIN
    lc := Lines.Count;
    lineCount := lc;
    lineCountTick := LineCount DIV 20;
    collecting := FALSE;
    t := ''; d := '';
    TreeViewParsed.Visible := FALSE;
    ListBoxLinks.Visible := FALSE;
    FOR il := 0 TO lc-1
    DO BEGIN
      s := Lines[il];
//      StatusBar.SimpleText := s;
      FOR ic := 1 TO Length(s) DO
      BEGIN
        IF s[ic] = '<'
        THEN BEGIN
          collecting := TRUE;
          WITH TreeViewParsed.Items DO
          IF d <> '' THEN BEGIN
            AddChild(parentNode,d);
            TestForLink(d,linkCount);
            TestForImage(d,linkCount);
          END {IF};
          d := '';
        END {IF};
        IF NOT collecting THEN d := d+s[ic];
        IF collecting  THEN t := t+s[ic];
        IF s[ic] = '>'
        THEN BEGIN
          collecting := FALSE;
          WITH TreeViewParsed.Items DO
          BEGIN
            IF Pos('</',t) = 1
            THEN AddChild(parentNode,t)
            ELSE parentNode := Add(rootNode,t);
            TestForLink(t,linkCount);
            TestForImage(t,linkCount);
          END {WITH };
          t := '';
        END {IF};
        IF (linkCount >= StrToInt(EditLinkLimit.Text)) THEN Break;
      END {FOR};
      IF ((il MOD lineCountTick) = 0) THEN
      BEGIN
        ProgressBar1.Position := (il * 100 DIV lineCount);
      END {IF};
    END {FOR};
    ListBoxLinks.Visible := TRUE;
    TreeViewParsed.Visible := TRUE;
  END {WITH };
  endTime := Now;
  deltaTime := SecondsBetween(startTime,endTime);
  StatusBar.SimpleText := Format('Done parse in %d seconds',[deltaTime]);
  MemoDiag.Lines.Add(StatusBar.SimpleText);
//  TreeView1.FullExpand;
end {Parse1Click};
0
 
LVL 4

Author Comment

by:k4hvd77
ID: 10952121
sftweng,

I'm using Delphi 7 and cannot Complie the Project!
1. have no FastNet (NMHTTP) Components,
2. I have't the  JVCL
0
 
LVL 7

Expert Comment

by:sftweng
ID: 10952181
I'll have to check carefully but I don't think you need them. Just write a program that puts your HTML into a string, passes it into the procedure (Parse1Click) and stores or uses the results.

Concentrate on the "TestFor" procedures and just pass them HTML strings from whatever source you choose.
0
 
LVL 7

Expert Comment

by:sftweng
ID: 10952192
I don't think "Parse1Click" needs either NMHTTP or JVCL.
0
 
LVL 4

Author Comment

by:k4hvd77
ID: 10952303
sorry cannot understant how to get it work!

Could you send me the project to  admin@titaniumserver.de


thanks
k4hvd77
0
 
LVL 7

Accepted Solution

by:
sftweng earned 250 total points
ID: 10952470
k4hvd77, Ex-Ex rules don't allow me to use email. You should have been able to download the original project from the URL I posted earlier and then to cut-and-paste the replacement code frommy earlier posting. I'd like to help you on this, but I'm prevented by Ex-Ex rules from using email correspondence.

But if you don't have the NMHTTP and JVCL components, anyway, you should just take the source code for "Parse1Click", written above, and remove all of the component references, e.g., an edit box, treenode and listbox, and replace them with string equivalents.

The core of the code is the "TestFor" procedures - just feed them HTML lines, acquired from whatever source you like, and feed the results (added via Listbox.Add) back to the client (caller) software.
0
 
LVL 7

Expert Comment

by:sftweng
ID: 10952652
k4hvd77, when I said "you should have been able to", I meant no criticism and I recognize the fact that we are dealing with different versions of Delphi (6 & 7) and libraries. My intention was to focus on the key software, the "TestFor" procedure, which should be more portable than the rest of the application.

I do recommend, however, that yu take a good look at using (at least), the JCL and JVCL components, available from http://www.jedi-delphi.org

Good luck, and do, please, continue to ask questions - I'll be pleased to help.
0
 
LVL 7

Expert Comment

by:sftweng
ID: 10952678
Sorry, that should be http://delphi-jedi.org.
0
 
LVL 7

Expert Comment

by:sftweng
ID: 13190230
I believe my solution met the requirements
0

Featured Post

Announcing the Most Valuable Experts of 2016

MVEs are more concerned with the satisfaction of those they help than with the considerable points they can earn. They are the types of people you feel privileged to call colleagues. Join us in honoring this amazing group of Experts.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

This article explains how to create forms/units independent of other forms/units object names in a delphi project. Have you ever created a form for user input in a Delphi project and then had the need to have that same form in a other Delphi proj…
Have you ever had your Delphi form/application just hanging while waiting for data to load? This is the article to read if you want to learn some things about adding threads for data loading in the background. First, I'll setup a general applica…
In a recent question (https://www.experts-exchange.com/questions/29004105/Run-AutoHotkey-script-directly-from-Notepad.html) here at Experts Exchange, a member asked how to run an AutoHotkey script (.AHK) directly from Notepad++ (aka NPP). This video…
I've attached the XLSM Excel spreadsheet I used in the video and also text files containing the macros used below. https://filedb.experts-exchange.com/incoming/2017/03_w12/1151775/Permutations.txt https://filedb.experts-exchange.com/incoming/201…

856 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question