Softtech
asked on
HTML parser
I am using Delphi 5.
I have a need that I will describe below, and which I would like to receive advice on the best way to proceed.
When opening/reading an HTML file, I need to obtain a list of all the hyperlinks on the page. I need the actual URL as found in the <a href=> tag, and also a parallel list containing the text found between the <a href> and the closing </a> tags.
In other words, look at the source code for this page: http://www.lasercuts.info/test.htm
What I want from the above page are 2 string lists, (let's call them ParsedURL and ParsedLink), which will contain:
ParsedURL[0] = http://www.yahoo.com
ParsedURL[1] = http://www.google.com
ParsedURL[2] = http://www.msn.com
ParsedURL[3] = http://www.cnn.com
ParsedLink[0] = Yahoo
ParsedLink[1] = Google
ParsedLink[2] = MSN
ParsedLink[3] = CNN
Any recommendations on how to proceed? I'd prefer not to have create a HTML parser from scratch.
I have a need that I will describe below, and which I would like to receive advice on the best way to proceed.
When opening/reading an HTML file, I need to obtain a list of all the hyperlinks on the page. I need the actual URL as found in the <a href=> tag, and also a parallel list containing the text found between the <a href> and the closing </a> tags.
In other words, look at the source code for this page: http://www.lasercuts.info/test.htm
What I want from the above page are 2 string lists, (let's call them ParsedURL and ParsedLink), which will contain:
ParsedURL[0] = http://www.yahoo.com
ParsedURL[1] = http://www.google.com
ParsedURL[2] = http://www.msn.com
ParsedURL[3] = http://www.cnn.com
ParsedLink[0] = Yahoo
ParsedLink[1] = Google
ParsedLink[2] = MSN
ParsedLink[3] = CNN
Any recommendations on how to proceed? I'd prefer not to have create a HTML parser from scratch.
well, it wouldnt be to hard to create a parser just for gathering url/links from an html page, if thats all you want to gather.
just opening the file and reading the lines, grabbing the urls/links by searching for "<a href="...etc...
I have fooled around with a few 3rd party components parsers like EL HTML, DIHtmlParser and CoolDevs HTml Tools and all can get the job done but none of them are free and are overkill for what you want to do. Just check out Torry, there are some free parsers and code there:
http://www.torry.net/pages.php?id=216&SID=8d6e8bf151adee12b81f4736acf011cf
just opening the file and reading the lines, grabbing the urls/links by searching for "<a href="...etc...
I have fooled around with a few 3rd party components parsers like EL HTML, DIHtmlParser and CoolDevs HTml Tools and all can get the job done but none of them are free and are overkill for what you want to do. Just check out Torry, there are some free parsers and code there:
http://www.torry.net/pages.php?id=216&SID=8d6e8bf151adee12b81f4736acf011cf
wry easy html praser
SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
>> wery easy if i anderstand u correctly!!
I'm sure you understand me correctly, however, the code you posted (although appreciated) is not very bullet-proof or reliable. Take a large web page like one at eBay.com, and you will find that your code fails to parse 95% of <a href> the links.
I didn't intend for you to write my code for me. I was simply hoping someone knew of existing code that does the parsing rather than having to reinvent the wheel.
I'm sure you understand me correctly, however, the code you posted (although appreciated) is not very bullet-proof or reliable. Take a large web page like one at eBay.com, and you will find that your code fails to parse 95% of <a href> the links.
I didn't intend for you to write my code for me. I was simply hoping someone knew of existing code that does the parsing rather than having to reinvent the wheel.
SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
well, you don't need to use strings list,if you want your program to run faster & accurate.
I suggest to change your way of html parsing & avoid as much as possible storing variables, also try to compile the page only once, not twice or more.
Hope it helped.
I suggest to change your way of html parsing & avoid as much as possible storing variables, also try to compile the page only once, not twice or more.
Hope it helped.
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Since you are looking for anchors in local files, change this line:
GetAnchorList(sh.get('http://www.borland.com'),
Memo1.lines,
Memo2.Lines);
to this:
GetAnchorList(sh.get('file :///myhtml file.htm') ,
Memo1.lines,
Memo2.Lines);
GetAnchorList(sh.get('http://www.borland.com'),
Memo1.lines,
Memo2.Lines);
to this:
GetAnchorList(sh.get('file
Memo1.lines,
Memo2.Lines);
I personally would use State Table analysis to parse the links:
https://www.experts-exchange.com/questions/21192880/Search-through-list-of-HTML-files-to-find-file-names.html
This technique would cope with all eventualities once they had been seen 'in the field'.
https://www.experts-exchange.com/questions/21192880/Search-through-list-of-HTML-files-to-find-file-names.html
This technique would cope with all eventualities once they had been seen 'in the field'.
ASKER