Want to protect your cyber security and still get fast solutions? Ask a secure question today.Go Premium

x
  • Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 317
  • Last Modified:

Parsing images from HTML

Hi,

What I ant is to parsing Images and the Links for these images from an HTML File, most Components I foun parses oly Images or Links, but I need the Image Link!

Could anybody help me?

k4hvd77
0
k4hvd77
Asked:
k4hvd77
1 Solution
 
sftwengCommented:
0
 
mokuleCommented:
I suggest using regular expression.

For example You can download
http://regexpstudio.com/TRegExpr/TRegExpr.html

It's quite powerfull and easy to use.
0
 
sftwengCommented:
Re: "LinkGrabber": The whole project is available in the following zip file: http://members.rogers.com/alan.bu/LinkGrabber.zip
0
VIDEO: THE CONCERTO CLOUD FOR HEALTHCARE

Modern healthcare requires a modern cloud. View this brief video to understand how the Concerto Cloud for Healthcare can help your organization.

 
k4hvd77Author Commented:
sftweng,

Cannot Understand how LinkGrabber could help me do that!

what I need is follwoing:

I have a HTML File:


------------------------------------------------------------------------------------------------------------------------------------------

<html>

<head>
<meta http-equiv="Content-Type" content="text/html; charset=windows-1252">
<title>Neue Seite 1</title>
</head>

<body>

<p><a href="http://www.google.de">
<img border="0" src="http://www.google.de/intl/de_de/images/logo.gif" width="800" height="600"></a></p>

</body>

</html>


------------------------------------------------------------------------------------------------------------------------------------------

by Clicking on the Image I will redirected to google, now I want to extract the Image ("http://www.google.de/intl/de_de/images/logo.gif) and the Link for this image (http://www.google.de), and get an output like  this:

[Link01]
image= http://www.google.de/intl/de_de/images/logo.gif
Link= http://www.google.de



0
 
sftwengCommented:
Your could modify the LinkGrabber "TestForLink" procedure to pull out all "<img> directives
0
 
k4hvd77Author Commented:
Could you send me some examples?
0
 
sftwengCommented:
I have to go to a meeting for a couple of hours but I'll try to get back to this later today. Sorry.
0
 
k4hvd77Author Commented:
no problem ;)
0
 
Amir AzhdariCommented:
k4hvd77
place a webbrowser, memo and 2 buttons  on the form and try this code :
by the way , first navigate the page(ex. www.yahoo.com or html file or ... )  to the webbrowser.

Regards
Azhdari


unit Unit1;

interface

uses
  Windows, Messages, SysUtils, Variants, Classes, Graphics, Controls, Forms,

  Dialogs,activex,comctrls,olectrls, mshtml, StdCtrls, SHDocVw,clipbrd;

type
  TForm1 = class(TForm)
    WebBrowser1: TWebBrowser;
    Button1: TButton;
    Memo1: TMemo;
    Button2: TButton;
    procedure Button2Click(Sender: TObject);
    procedure Button1Click(Sender: TObject);
  private
    { Private declarations }
  public
    { Public declarations }
  end;

var
  Form1: TForm1;

implementation

{$R *.dfm}



procedure TForm1.Button2Click(Sender: TObject);
var li:word;
var s1,s2:string;
var i,j:integer;
begin
memo1.Lines.Clear;
 for li:=0 to webbrowser1.OleObject.document.images.length-1 do
   begin
    s1:='';
    with memo1.lines do
      begin

          add('[LINK'+inttostr(li)+']');
          add('image= '+webbrowser1.OleObject.document.images.item(li).src);
          s1:=webbrowser1.OleObject.document.images.item(li).src;
          if ((strpos(pchar(s1),'http')<>nil) or (strpos(pchar(s1),'ftp')<>nil))  then
           begin
              s2:='';
              j:=0;
              for i:=1 to length(s1) do
                begin
                 if (j=3) then
                     break;
                 s2:=s2+s1[i];
                 if s1[i]='/' then
                   inc(j);
                end;
             add('Link= '+s2);
           end
           else
             add('Link= Load From Drive');


      end;


   end;

end;

procedure TForm1.Button1Click(Sender: TObject);
begin
webbrowser1.Navigate('www.yahoo.com');
end;

end.


0
 
k4hvd77Author Commented:
AmirAzhdari,

that's not what I'm looking for!


RE:

what I need is follwoing:

I have a HTML File:


------------------------------------------------------------------------------------------------------------------------------------------

<html>

<head>
<meta http-equiv="Content-Type" content="text/html; charset=windows-1252">
<title>Neue Seite 1</title>
</head>

<body>

<p><a href="http://www.google.de">
<img border="0" src="http://www.google.de/intl/de_de/images/logo.gif" width="800" height="600"></a></p>

</body>

</html>


------------------------------------------------------------------------------------------------------------------------------------------

by Clicking on the Image I will redirected to google, now I want to extract the Image ("http://www.google.de/intl/de_de/images/logo.gif) and the Link for this image (http://www.google.de), and get an output like  this:

[Link01]
image= http://www.google.de/intl/de_de/images/logo.gif
Link= http://www.google.de
0
 
Mohammed NasmanSoftware DeveloperCommented:
0
 
sftwengCommented:
Change to "LinkGrabber":
{==============================================================================}
procedure TLGForm.Parse1Click(Sender: TObject);
VAR
  il, ic, lc, lb, rb, linkCount : INTEGER;
  s, t, d : String;
  collecting : BOOLEAN;
  parentNode : TTreeNode;
  currNode : TTreeNode;
  currText, hrefText, srcText : String;
  deltaTime : INTEGER;
  lineCount, lineCountTick : INTEGER;
{------------------------------------------------------------------------------}
procedure TestForLink(ts : String; VAR cnt : INTEGER);
var hrefPos : INTEGER;
begin
  currText := UpperCase(ts);
  IF (Pos('<A',currText) = 1) THEN
  BEGIN
    hrefPos := Pos('HREF="',currText);
    IF (hrefPos > 0) THEN
    BEGIN
      Delete(ts,1,hrefPos+5);
      hrefPos := Pos('"',ts);
      currText := UpperCase(ts);
      IF (hrefPos > 0)
      {AND (Pos('.JPG',currText) = Length(currText)-5)} THEN
      BEGIN
        hrefText := Copy(ts,1,hrefPos-1);
        ListBoxLinks.Items.Add('Link='+hrefText);
        INC(cnt);
      END {IF};
    END {IF};
  END {IF};
end {TestForLink};
{------------------------------------------------------------------------------}
procedure TestForImage(ts : String; VAR cnt : INTEGER);
var srcPos : INTEGER;
begin
  currText := UpperCase(ts);
  IF (Pos('<IMG',currText) = 1) THEN
  BEGIN
    srcPos := Pos('SRC="',currText);
    IF (srcPos > 0) THEN
    BEGIN
      Delete(ts,1,srcPos+4);
      srcPos := Pos('"',ts);
      currText := UpperCase(ts);
      IF (srcPos > 0)
      {AND (Pos('.JPG',currText) = Length(currText)-5)} THEN
      BEGIN
        srcText := Copy(ts,1,srcPos-1);
        ListBoxLinks.Items.Add('Image='+srcText);
        INC(cnt);
      END {IF};
    END {IF};
  END {IF};
end {TestForImage};
{------------------------------------------------------------------------------}
begin {Parse1Click}
  startTime := Now;
  rootNode.Text := EditURL.Text;
  parentNode := rootNode;
  ListBoxLinks.Items.Clear;
  linkCount := 0;
  WITH MemoRawHTML DO BEGIN
    lc := Lines.Count;
    lineCount := lc;
    lineCountTick := LineCount DIV 20;
    collecting := FALSE;
    t := ''; d := '';
    TreeViewParsed.Visible := FALSE;
    ListBoxLinks.Visible := FALSE;
    FOR il := 0 TO lc-1
    DO BEGIN
      s := Lines[il];
//      StatusBar.SimpleText := s;
      FOR ic := 1 TO Length(s) DO
      BEGIN
        IF s[ic] = '<'
        THEN BEGIN
          collecting := TRUE;
          WITH TreeViewParsed.Items DO
          IF d <> '' THEN BEGIN
            AddChild(parentNode,d);
            TestForLink(d,linkCount);
            TestForImage(d,linkCount);
          END {IF};
          d := '';
        END {IF};
        IF NOT collecting THEN d := d+s[ic];
        IF collecting  THEN t := t+s[ic];
        IF s[ic] = '>'
        THEN BEGIN
          collecting := FALSE;
          WITH TreeViewParsed.Items DO
          BEGIN
            IF Pos('</',t) = 1
            THEN AddChild(parentNode,t)
            ELSE parentNode := Add(rootNode,t);
            TestForLink(t,linkCount);
            TestForImage(t,linkCount);
          END {WITH };
          t := '';
        END {IF};
        IF (linkCount >= StrToInt(EditLinkLimit.Text)) THEN Break;
      END {FOR};
      IF ((il MOD lineCountTick) = 0) THEN
      BEGIN
        ProgressBar1.Position := (il * 100 DIV lineCount);
      END {IF};
    END {FOR};
    ListBoxLinks.Visible := TRUE;
    TreeViewParsed.Visible := TRUE;
  END {WITH };
  endTime := Now;
  deltaTime := SecondsBetween(startTime,endTime);
  StatusBar.SimpleText := Format('Done parse in %d seconds',[deltaTime]);
  MemoDiag.Lines.Add(StatusBar.SimpleText);
//  TreeView1.FullExpand;
end {Parse1Click};
0
 
k4hvd77Author Commented:
sftweng,

I'm using Delphi 7 and cannot Complie the Project!
1. have no FastNet (NMHTTP) Components,
2. I have't the  JVCL
0
 
sftwengCommented:
I'll have to check carefully but I don't think you need them. Just write a program that puts your HTML into a string, passes it into the procedure (Parse1Click) and stores or uses the results.

Concentrate on the "TestFor" procedures and just pass them HTML strings from whatever source you choose.
0
 
sftwengCommented:
I don't think "Parse1Click" needs either NMHTTP or JVCL.
0
 
k4hvd77Author Commented:
sorry cannot understant how to get it work!

Could you send me the project to  admin@titaniumserver.de


thanks
k4hvd77
0
 
sftwengCommented:
k4hvd77, Ex-Ex rules don't allow me to use email. You should have been able to download the original project from the URL I posted earlier and then to cut-and-paste the replacement code frommy earlier posting. I'd like to help you on this, but I'm prevented by Ex-Ex rules from using email correspondence.

But if you don't have the NMHTTP and JVCL components, anyway, you should just take the source code for "Parse1Click", written above, and remove all of the component references, e.g., an edit box, treenode and listbox, and replace them with string equivalents.

The core of the code is the "TestFor" procedures - just feed them HTML lines, acquired from whatever source you like, and feed the results (added via Listbox.Add) back to the client (caller) software.
0
 
sftwengCommented:
k4hvd77, when I said "you should have been able to", I meant no criticism and I recognize the fact that we are dealing with different versions of Delphi (6 & 7) and libraries. My intention was to focus on the key software, the "TestFor" procedure, which should be more portable than the rest of the application.

I do recommend, however, that yu take a good look at using (at least), the JCL and JVCL components, available from http://www.jedi-delphi.org

Good luck, and do, please, continue to ask questions - I'll be pleased to help.
0
 
sftwengCommented:
Sorry, that should be http://delphi-jedi.org.
0
 
sftwengCommented:
I believe my solution met the requirements
0

Featured Post

Concerto Cloud for Software Providers & ISVs

Can Concerto Cloud Services help you focus on evolving your application offerings, while delivering the best cloud experience to your customers? From DevOps to revenue models and customer support, the answer is yes!

Learn how Concerto can help you.

Tackle projects and never again get stuck behind a technical roadblock.
Join Now