TheRealLoki
asked on
HTML to XML with HTML Tidy but cannot parse the XML after
I am trying to extract the tags from html files
specifically the <a and <p sections
I noticed several similar questions in the group, which recommended using HTML Tidy
http://tidy.sourceforge.net/
however when I parse the file, all I see is
NODE_DOCUMENT_TYPE with the name "html"
FWIW Internet Explorer shows the XML file fine.
Am I just forgetting to do something really obvious?>
I use Delphi, but the code should be fairly similar to other languages
specifically the <a and <p sections
I noticed several similar questions in the group, which recommended using HTML Tidy
http://tidy.sourceforge.net/
however when I parse the file, all I see is
NODE_DOCUMENT_TYPE with the name "html"
FWIW Internet Explorer shows the XML file fine.
Am I just forgetting to do something really obvious?>
I use Delphi, but the code should be fairly similar to other languages
Work_DOMDocument := nil;
OleCheck(CoCreateInstance(Class_DOMDocument40, nil, CLSCTX_ALL,IXMLDOMDocument, Work_DOMDocument));
if not Work_DOMDocument.load( 'test.xml' ) then
ShowMessage('Error loading DOMDocument'
else
DisplayXMLStructure(Work_DOMDocument); // simple routine that walks the nodes
(sorry, probably just one of these flags) -asxhtml
ASKER
yes, I have tried those.
I am actually using this very page as a test ie.
https://www.experts-exchange.com/Programming/Languages/Pascal/Delphi/Q__23490524.html
and I can get IE to show the "tidy'd" result, but I can not do it with delphi code
I am actually using this very page as a test ie.
https://www.experts-exchange.com/Programming/Languages/Pascal/Delphi/Q__23490524.html
and I can get IE to show the "tidy'd" result, but I can not do it with delphi code
ASKER CERTIFIED SOLUTION
membership
Create a free account to see this answer
Signing up is free and takes 30 seconds. No credit card required.
ASKER
sadly, due to other constraints, I need to use the MS XML (IXMLDomDocument).
The UILess Parser uses MSHTML to parse the document. Why on earth do you need to parse HTML files with an MSXML parser, *MOST* HTML is not well formed and you will get TONS of errors trying to parse it using MSXML, primarily because tons do not even have a DECL.
ASKER
Hi Eddie,
Simple answer: XML is not right for parsing HTML. points are yours
Long Boring bits:I have tried to convince the client that XML is not the way to go for HTML, and they are begrudgingly accepting, so I will award you the points as soon as i get a chance to try out the UILess parser.
I hope you don't mind waiting a little bit longer for your points :-)
Simple answer: XML is not right for parsing HTML. points are yours
Long Boring bits:I have tried to convince the client that XML is not the way to go for HTML, and they are begrudgingly accepting, so I will award you the points as soon as i get a chance to try out the UILess parser.
I hope you don't mind waiting a little bit longer for your points :-)
Nah, but if you'd post a sample and tell me what you want, I can write up a solution for you.
I wrote a couple "wrapper" functions in UILess to get all anchors and all images so it won't be difficult.
I wrote a couple "wrapper" functions in UILess to get all anchors and all images so it won't be difficult.
How are things going on your testing? Do you need more help with the UILess parser?
ASKER
yes please.
for my testing, I have just done a "View Page source" on this question, and saved teh html to a file on my hard drive, I am then running the UILess demo and seeing a long list of "Anchors"
e.g.
file:///S:/contactUs.jsp?d epartment= 6#onlineCu stomerServ ice
file:///S:/
file:///S:/
file:///S:/findAnswers.jsp
...
...
file:///S:/recordAnswerRat ing.jsp?qi d=23490524 &aid=21817 495&rel=1& token=0826 7e1b6e5dd2 e2c70b45f7 221bd590&r edirectURL =%2FWeb_De velopment% 2FWeb_Lang uages-Stan dards%2FXM L%2FQ_2349 0524.html
file:///S:/M_1198981.html
file:///S:/temp/HTML%20Tid y/splitPoi nts.jsp?qi d=23490524
file:///S:/temp/HTML%20Tid y/acceptAn swer.jsp?a id=2181749 5#selectGr ade
file:///S:/M_1198981.html
...
...
etc
and a list of 7 images
file:///S:/timer/timer1.gi f
file:///S:/timer/timer2.gi f
file:///S:/timer/timer3.gi f
file:///S:/timer/timer4.gi f
file:///S:/timer/timer5.gi f
http://metrics.experts-exchange.com/b/ss/eexchangeprod/1/H.7--NS/0
file:///S:/timer/timer6.gi f
although I am sure there should be more than that...
what i'm struggling with though is once I determine a "section" of the page I want
e.g. your first comment
*********
EddieShipman:
You really should be using the UILess Parser from the EmbeddedWB package. You can write your own
function to return specific tags.
Get it here: http://www.torry.net/vcl/internet/browsers/EmbeddedWBD2005Version14.61.zip
and read this post: https://www.experts-exchange.com/questions/21254855/HTML-parser.html
*********
How can I isolate it and then get the text and the links inside ?
for my testing, I have just done a "View Page source" on this question, and saved teh html to a file on my hard drive, I am then running the UILess demo and seeing a long list of "Anchors"
e.g.
file:///S:/contactUs.jsp?d
file:///S:/
file:///S:/
file:///S:/findAnswers.jsp
...
...
file:///S:/recordAnswerRat
file:///S:/M_1198981.html
file:///S:/temp/HTML%20Tid
file:///S:/temp/HTML%20Tid
file:///S:/M_1198981.html
...
...
etc
and a list of 7 images
file:///S:/timer/timer1.gi
file:///S:/timer/timer2.gi
file:///S:/timer/timer3.gi
file:///S:/timer/timer4.gi
file:///S:/timer/timer5.gi
http://metrics.experts-exchange.com/b/ss/eexchangeprod/1/H.7--NS/0
file:///S:/timer/timer6.gi
although I am sure there should be more than that...
what i'm struggling with though is once I determine a "section" of the page I want
e.g. your first comment
*********
EddieShipman:
You really should be using the UILess Parser from the EmbeddedWB package. You can write your own
function to return specific tags.
Get it here: http://www.torry.net/vcl/internet/browsers/EmbeddedWBD2005Version14.61.zip
and read this post: https://www.experts-exchange.com/questions/21254855/HTML-parser.html
*********
How can I isolate it and then get the text and the links inside ?
The reason you are getting the anchors like this: file:///S:/timer/timer1.gi f
is because the parser uses relative links and since you have the file on your harddrive,
it makes them absolute links to your local drive.
I don't understand what you mean by "section" and how you want to parse it.
If I'm not mistaken, the UILess Parser has an OnTag event that you can use to capture any tag you want, like DIVs, then you figure out if you are in the right "section" and process the anchors there.
Help me understand what it is you are desiring.
is because the parser uses relative links and since you have the file on your harddrive,
it makes them absolute links to your local drive.
I don't understand what you mean by "section" and how you want to parse it.
If I'm not mistaken, the UILess Parser has an OnTag event that you can use to capture any tag you want, like DIVs, then you figure out if you are in the right "section" and process the anchors there.
Help me understand what it is you are desiring.
ASKER
"section" is abstract
in most cases it will be a <TABLE> </TABLE> block and in other cases it will be a <P> </P> block
in this example (this web page) it is the largest <DIV> block that includes the ID "21946033"
in most cases it will be a <TABLE> </TABLE> block and in other cases it will be a <P> </P> block
in this example (this web page) it is the largest <DIV> block that includes the ID "21946033"
Well, taht is entirely up to you to figure out. There is no way to essentially "section off" the portions of a page and parse only that unless you know exactly what you are looking for.
You still haven't explained exactly what you are looking for.
You still haven't explained exactly what you are looking for.
ASKER
it's going to be an abstract parser where the user will make up some rules (using a gui) to get a block of text from the web page
I want to break the page up into hierarchical tags
e.g
<head>
sometext
</head>
<body>
<div>
<p>this is some text</p>
</div>
<div>
<p>this is more text</p>
</div>
</body>
the user will make the rules that they want to get "<body> : <div>[2] : <P>" i.e the text "this is more text"
I have the gui framework fine, and it works with XML, CSV, Excel etc fine. it's just trying to get it to work with HTML now.
My simple test is to try to get the "text" portion of your first answer on this page
i.e. "You really should be using the UILess Parser..."
Once I know how to iterate the tags, and get the text portion, i will be set and can code the rest myself
I want to break the page up into hierarchical tags
e.g
<head>
sometext
</head>
<body>
<div>
<p>this is some text</p>
</div>
<div>
<p>this is more text</p>
</div>
</body>
the user will make the rules that they want to get "<body> : <div>[2] : <P>" i.e the text "this is more text"
I have the gui framework fine, and it works with XML, CSV, Excel etc fine. it's just trying to get it to work with HTML now.
My simple test is to try to get the "text" portion of your first answer on this page
i.e. "You really should be using the UILess Parser..."
Once I know how to iterate the tags, and get the text portion, i will be set and can code the rest myself
Well, the problem is going to take some coding, the UILess Parser isn't really going to do that for you.
However, if the user KNOWS the information in the HTML they are trying to parse, it may be way easier to just use the DOM and get to the elements in question directly.
I will post a way to do it with MSHTML using a rule setup to get the text of my first reply on this URL and post the code later.
However, if the user KNOWS the information in the HTML they are trying to parse, it may be way easier to just use the DOM and get to the elements in question directly.
I will post a way to do it with MSHTML using a rule setup to get the text of my first reply on this URL and post the code later.
SOLUTION
membership
Create a free account to see this answer
Signing up is free and takes 30 seconds. No credit card required.
-asxml, -asxhtml