What I generally recommend is to run the HTML through tidy (http://tidy.sourceforge.n
The reason is that a lot of the HTML out there on the internet is full of mistakes, so even a specialized parser can easily "misinterpret" it. Tidy is specialized in cleaning up such HTML.
Of course, you can see if a specialized parser like the one evilrix suggested works for you - if so it would be a bit easier to implement ... I've never tried any, so I can't comment on that :)
Main Topics
Browse All Topics





by: evilrixPosted on 2009-07-13 at 23:45:24ID: 24846781
Try El Kabong, it's a very simple (and very forgiving) Sax style HTML parser.
jects/ekht ml/
"El-Kabong is a high-speed, forgiving, sax-style HTML parser. Its aim is to provide consumers with a very fast, clean, lightweight library which parses HTML quickly, while forgiving syntactically incorrect tags."
http://sourceforge.net/pro