pede
asked on
Extracting link, frames and text from HTML
Hi, I have actually already made a version of this but Im not completely satisfied, and Im thinking of rewriting if from scratch.
Given an HTML file I need to extract all visible text, and all links. Im using this to do some web crawling.
Anyone know of a good way do to this? I will most likely have to use an HTML parser (I use TLegHTMLParser now), but I have to do a LOT of string parsing myself when using that one. I would like it much more simple, like getting all links in a TStringList, or something like that. And that is _qualified_ links, btw. A link like '/main.asp' should be 'www.somewebsite.com/main.asp'. The parser could have a source property or something like that, so it can complete the links.
I hope you understand what Im looking for ;)
Given an HTML file I need to extract all visible text, and all links. Im using this to do some web crawling.
Anyone know of a good way do to this? I will most likely have to use an HTML parser (I use TLegHTMLParser now), but I have to do a LOT of string parsing myself when using that one. I would like it much more simple, like getting all links in a TStringList, or something like that. And that is _qualified_ links, btw. A link like '/main.asp' should be 'www.somewebsite.com/main.asp'. The parser could have a source property or something like that, so it can complete the links.
I hope you understand what Im looking for ;)
Listening.
ASKER
Hey, stop listening and find me a parser :-p
Anyway, the function CombineURL in unit UrlMon will do be combining (what I call qualifying above), so I just need to get all links (and text, but all parsers do that). I cant imagine there isnt a component for this, but I havent found one!
Anyway, the function CombineURL in unit UrlMon will do be combining (what I call qualifying above), so I just need to get all links (and text, but all parsers do that). I cant imagine there isnt a component for this, but I havent found one!
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
Hi Hinnack, that one looks promising - thanks!