Link to home
Create AccountLog in
Avatar of TheRealLoki
TheRealLokiFlag for New Zealand

asked on

HTML to XML with HTML Tidy but cannot parse the XML after

I am trying to extract the tags from html files
specifically the <a  and <p  sections
I noticed several similar questions in the group, which recommended using HTML Tidy
http://tidy.sourceforge.net/

however when I parse the file, all I see is
NODE_DOCUMENT_TYPE  with the name "html"
FWIW Internet Explorer shows the XML file fine.
Am I just forgetting to do something really obvious?>

I use Delphi, but the code should be fairly similar to other languages

Work_DOMDocument := nil;
  OleCheck(CoCreateInstance(Class_DOMDocument40, nil, CLSCTX_ALL,IXMLDOMDocument, Work_DOMDocument));
      if not Work_DOMDocument.load( 'test.xml' ) then
        ShowMessage('Error loading DOMDocument'
      else
        DisplayXMLStructure(Work_DOMDocument); // simple routine that walks the nodes

Open in new window

Avatar of BobSiemens
BobSiemens

Are you using these flags (you need to)

-asxml, -asxhtml
(sorry, probably just one of these flags) -asxhtml
Avatar of TheRealLoki

ASKER

yes, I have tried those.
I am actually using this very page as a test ie.
https://www.experts-exchange.com/Programming/Languages/Pascal/Delphi/Q__23490524.html

and I can get IE to show the "tidy'd" result, but I can not do it with delphi code
ASKER CERTIFIED SOLUTION
Avatar of Eddie Shipman
Eddie Shipman
Flag of United States of America image

Link to home
membership
Create a free account to see this answer
Signing up is free and takes 30 seconds. No credit card required.
See answer
sadly, due to other constraints, I need to use the MS XML (IXMLDomDocument).
The UILess Parser uses MSHTML to parse the document. Why on earth do you need to parse HTML files with an MSXML parser, *MOST* HTML  is not well formed and you will get TONS of errors trying to parse it using MSXML, primarily because tons do not even have a DECL.

Hi Eddie,
Simple answer: XML is not right for parsing HTML. points are yours

Long Boring bits:I have tried to convince the client that XML is not the way to go for HTML, and they are begrudgingly accepting, so I will award you the points as soon as i get a chance to try out the UILess parser.
I hope you don't mind waiting a little bit longer for your points :-)

Nah, but if you'd post a sample and tell me what you want, I can write up a solution for you.
I wrote a couple "wrapper" functions in UILess to get all anchors and all images so it won't be difficult.
How are things going on your testing? Do you need more help with the UILess parser?
yes please.
for my testing, I have just done a "View Page source" on this question, and saved teh html to a file on my hard drive, I am then running the UILess demo and seeing a long list of "Anchors"
e.g.
file:///S:/contactUs.jsp?department=6#onlineCustomerService
file:///S:/
file:///S:/
file:///S:/findAnswers.jsp
...
...
file:///S:/recordAnswerRating.jsp?qid=23490524&aid=21817495&rel=1&token=08267e1b6e5dd2e2c70b45f7221bd590&redirectURL=%2FWeb_Development%2FWeb_Languages-Standards%2FXML%2FQ_23490524.html
file:///S:/M_1198981.html
file:///S:/temp/HTML%20Tidy/splitPoints.jsp?qid=23490524
file:///S:/temp/HTML%20Tidy/acceptAnswer.jsp?aid=21817495#selectGrade
file:///S:/M_1198981.html
...
...
etc
and a list of 7 images
file:///S:/timer/timer1.gif
file:///S:/timer/timer2.gif
file:///S:/timer/timer3.gif
file:///S:/timer/timer4.gif
file:///S:/timer/timer5.gif
http://metrics.experts-exchange.com/b/ss/eexchangeprod/1/H.7--NS/0
file:///S:/timer/timer6.gif

although I am sure there should be more than that...

what i'm struggling with though is once I determine a "section" of the page I want
e.g. your first comment
*********
EddieShipman:
You really should be using the UILess Parser from the EmbeddedWB package. You can write your own
function to return specific tags.

Get it here: http://www.torry.net/vcl/internet/browsers/EmbeddedWBD2005Version14.61.zip
and read this post: https://www.experts-exchange.com/questions/21254855/HTML-parser.html
*********

How can I isolate it and then get the text and the links inside ?


The reason you are getting the anchors like this: file:///S:/timer/timer1.gif
is because the parser uses relative links and since you have the file on your harddrive,
it makes them absolute links to your local drive.

I don't understand what you mean by "section" and how you want to parse it.
If I'm not mistaken, the UILess Parser has an OnTag event that you can use to capture any tag you want, like DIVs, then you figure out if you are in the right "section" and process the anchors there.

Help me understand what it is you are desiring.
"section" is abstract
in most cases it will be a <TABLE> </TABLE> block and in other cases it will be a <P> </P> block
in this example (this web page) it is the largest <DIV> block that includes the ID "21946033"
Well, taht is entirely up to you to figure out. There is no way to essentially "section off" the portions of a page and parse only that unless you know exactly what you are looking for.

You still haven't explained exactly what you are looking for.
it's going to be an abstract parser where the user will make up some rules (using a gui) to get a block of text from the web page
I want to break the page up into hierarchical tags
e.g
<head>
    sometext
</head>
<body>
    <div>
        <p>this is some text</p>
    </div>
    <div>
        <p>this is more text</p>
    </div>
</body>

the user will make the rules that they want to get "<body> : <div>[2] : <P>"  i.e the text "this is more text"
I have the gui framework fine, and it works with XML, CSV, Excel etc fine. it's just trying to get it to work with HTML now.
My simple test is to try to get the "text" portion of your first answer on this page
i.e. "You really should be using the UILess Parser..."

Once I know how to iterate the tags, and get the text portion, i will be set and can code the rest myself
Well, the problem is going to take some coding, the UILess Parser isn't really going to do that for you.
However, if the user KNOWS the information in the HTML they are trying to parse, it may be way easier to just use the DOM and get to the elements in question directly.

I will post a way to do it with MSHTML using a rule setup to get the text of my first reply on this URL and post the code later.
SOLUTION
Link to home
membership
Create a free account to see this answer
Signing up is free and takes 30 seconds. No credit card required.