Link to home
Start Free TrialLog in
Avatar of pede
pede

asked on

Extracting link, frames and text from HTML

Hi, I have actually already made a version of this but Im not completely satisfied, and Im thinking of rewriting if from scratch.

Given an HTML file I need to extract all visible text, and all links. Im using this to do some web crawling.

Anyone know of a good way do to this? I will most likely have to use an HTML parser (I use TLegHTMLParser now), but I have to do a LOT of string parsing myself when using that one. I would like it much more simple, like getting all links in a TStringList, or something like that. And that is _qualified_ links, btw. A link like '/main.asp' should be 'www.somewebsite.com/main.asp'. The parser could have a source property or something like that, so it can complete the links.

I hope you understand what Im looking for ;)
Avatar of Wim ten Brink
Wim ten Brink
Flag of Netherlands image

Listening.
Avatar of pede
pede

ASKER

Hey, stop listening and find me a parser :-p

Anyway, the function CombineURL in unit UrlMon will do be combining (what I call qualifying above), so I just need to get all links (and text, but all parsers do that). I cant imagine there isnt a component for this, but I havent found one!
ASKER CERTIFIED SOLUTION
Avatar of hinnack
hinnack

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of pede

ASKER

Hi Hinnack, that one looks promising - thanks!