asked on

HTML parser

I am using Delphi 5.

I have a need that I will describe below, and which I would like to receive advice on the best way to proceed.

When opening/reading an HTML file, I need to obtain a list of all the hyperlinks on the page. I need the actual URL as found in the <a href=> tag, and also a parallel list containing the text found between the <a href> and the closing </a> tags.

In other words, look at the source code for this page: http://www.lasercuts.info/test.htm

What I want from the above page are 2 string lists, (let's call them ParsedURL and ParsedLink), which will contain:

ParsedURL[0] = http://www.yahoo.com
ParsedURL[1] = http://www.google.com
ParsedURL[2] = http://www.msn.com
ParsedURL[3] = http://www.cnn.com

ParsedLink[0] = Yahoo
ParsedLink[1] = Google
ParsedLink[2] = MSN
ParsedLink[3] = CNN

Any recommendations on how to proceed? I'd prefer not to have create a HTML parser from scratch.

Softtech

ASKER

I should clarify, that the HTML page in question is a local file residing on the hard drive. I am not looking for a HTTP component. I'll handle the fetch/creation of the HTML file. I simply need a way to parse an existing local, disk based HTML file as described above.

LMuadDIb

well, it wouldnt be to hard to create a parser just for gathering url/links from an html page, if thats all you want to gather.
just opening the file and reading the lines, grabbing the urls/links by searching for "<a href="...etc...

I have fooled around with a few 3rd party components parsers like EL HTML, DIHtmlParser and CoolDevs HTml Tools and all can get the job done but none of them are free and are overkill for what you want to do. Just check out Torry, there are some free parsers and code there:

http://www.torry.net/pages.php?id=216&SID=8d6e8bf151adee12b81f4736acf011cf

ILE

wry easy html praser

SOLUTION

ILE

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

Softtech

ASKER

>> wery easy if i anderstand u correctly!!

I'm sure you understand me correctly, however, the code you posted (although appreciated) is not very bullet-proof or reliable. Take a large web page like one at eBay.com, and you will find that your code fails to parse 95% of <a href> the links.

I didn't intend for you to write my code for me. I was simply hoping someone knew of existing code that does the parsing rather than having to reinvent the wheel.

SOLUTION

Ferruccio Accalai

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

TheLeader

well, you don't need to use strings list,if you want your program to run faster & accurate.
I suggest to change your way of html parsing & avoid as much as possible storing variables, also try to compile the page only once, not twice or more.

Hope it helped.

ASKER CERTIFIED SOLUTION

Eddie Shipman

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

Eddie Shipman

Since you are looking for anchors in local files, change this line:

GetAnchorList(sh.get('http://www.borland.com'),
Memo1.lines,
Memo2.Lines);

to this:

GetAnchorList(sh.get('file:///myhtmlfile.htm'),
Memo1.lines,
Memo2.Lines);

moorhouselondon

I personally would use State Table analysis to parse the links:

https://www.experts-exchange.com/questions/21192880/Search-through-list-of-HTML-files-to-find-file-names.html

This technique would cope with all eventualities once they had been seen 'in the field'.