Solved

HTML to XML with HTML Tidy but cannot parse the XML after

Posted on 2008-06-16
16
349 Views
Last Modified: 2013-11-18
I am trying to extract the tags from html files
specifically the <a  and <p  sections
I noticed several similar questions in the group, which recommended using HTML Tidy
http://tidy.sourceforge.net/

however when I parse the file, all I see is
NODE_DOCUMENT_TYPE  with the name "html"
FWIW Internet Explorer shows the XML file fine.
Am I just forgetting to do something really obvious?>

I use Delphi, but the code should be fairly similar to other languages

Work_DOMDocument := nil;

  OleCheck(CoCreateInstance(Class_DOMDocument40, nil, CLSCTX_ALL,IXMLDOMDocument, Work_DOMDocument));

      if not Work_DOMDocument.load( 'test.xml' ) then

        ShowMessage('Error loading DOMDocument'

      else

        DisplayXMLStructure(Work_DOMDocument); // simple routine that walks the nodes

Open in new window

0
Comment
Question by:TheRealLoki
  • 8
  • 6
  • 2
16 Comments
 
LVL 10

Expert Comment

by:BobSiemens
ID: 21817495
Are you using these flags (you need to)

-asxml, -asxhtml
0
 
LVL 10

Expert Comment

by:BobSiemens
ID: 21817501
(sorry, probably just one of these flags) -asxhtml
0
 
LVL 17

Author Comment

by:TheRealLoki
ID: 21895668
yes, I have tried those.
I am actually using this very page as a test ie.
http://www.experts-exchange.com/Programming/Languages/Pascal/Delphi/Q__23490524.html

and I can get IE to show the "tidy'd" result, but I can not do it with delphi code
0
 
LVL 26

Accepted Solution

by:
EddieShipman earned 500 total points
ID: 21946033
You really should be using the UILess Parser from the EmbeddedWB package. You can write your own
function to return specific tags.

Get it here: http://www.torry.net/vcl/internet/browsers/EmbeddedWBD2005Version14.61.zip
and read this post: http://www.experts-exchange.com/Programming/Languages/Pascal/Delphi/Q_21254855.html
0
 
LVL 17

Author Comment

by:TheRealLoki
ID: 22046968
sadly, due to other constraints, I need to use the MS XML (IXMLDomDocument).
0
 
LVL 26

Expert Comment

by:EddieShipman
ID: 22047885
The UILess Parser uses MSHTML to parse the document. Why on earth do you need to parse HTML files with an MSXML parser, *MOST* HTML  is not well formed and you will get TONS of errors trying to parse it using MSXML, primarily because tons do not even have a DECL.

0
 
LVL 17

Author Comment

by:TheRealLoki
ID: 22125495
Hi Eddie,
Simple answer: XML is not right for parsing HTML. points are yours

Long Boring bits:I have tried to convince the client that XML is not the way to go for HTML, and they are begrudgingly accepting, so I will award you the points as soon as i get a chance to try out the UILess parser.
I hope you don't mind waiting a little bit longer for your points :-)

0
 
LVL 26

Expert Comment

by:EddieShipman
ID: 22125814
Nah, but if you'd post a sample and tell me what you want, I can write up a solution for you.
I wrote a couple "wrapper" functions in UILess to get all anchors and all images so it won't be difficult.
0
Better Security Awareness With Threat Intelligence

See how one of the leading financial services organizations uses Recorded Future as part of a holistic threat intelligence program to promote security awareness and proactively and efficiently identify threats.

 
LVL 26

Expert Comment

by:EddieShipman
ID: 22221628
How are things going on your testing? Do you need more help with the UILess parser?
0
 
LVL 17

Author Comment

by:TheRealLoki
ID: 22226746
yes please.
for my testing, I have just done a "View Page source" on this question, and saved teh html to a file on my hard drive, I am then running the UILess demo and seeing a long list of "Anchors"
e.g.
file:///S:/contactUs.jsp?department=6#onlineCustomerService
file:///S:/
file:///S:/
file:///S:/findAnswers.jsp
...
...
file:///S:/recordAnswerRating.jsp?qid=23490524&aid=21817495&rel=1&token=08267e1b6e5dd2e2c70b45f7221bd590&redirectURL=%2FWeb_Development%2FWeb_Languages-Standards%2FXML%2FQ_23490524.html
file:///S:/M_1198981.html
file:///S:/temp/HTML%20Tidy/splitPoints.jsp?qid=23490524
file:///S:/temp/HTML%20Tidy/acceptAnswer.jsp?aid=21817495#selectGrade
file:///S:/M_1198981.html
...
...
etc
and a list of 7 images
file:///S:/timer/timer1.gif
file:///S:/timer/timer2.gif
file:///S:/timer/timer3.gif
file:///S:/timer/timer4.gif
file:///S:/timer/timer5.gif
http://metrics.experts-exchange.com/b/ss/eexchangeprod/1/H.7--NS/0
file:///S:/timer/timer6.gif

although I am sure there should be more than that...

what i'm struggling with though is once I determine a "section" of the page I want
e.g. your first comment
*********
EddieShipman:
You really should be using the UILess Parser from the EmbeddedWB package. You can write your own
function to return specific tags.

Get it here: http://www.torry.net/vcl/internet/browsers/EmbeddedWBD2005Version14.61.zip
and read this post: http://www.experts-exchange.com/Programming/Languages/Pascal/Delphi/Q_21254855.html
*********

How can I isolate it and then get the text and the links inside ?


0
 
LVL 26

Expert Comment

by:EddieShipman
ID: 22227170
The reason you are getting the anchors like this: file:///S:/timer/timer1.gif
is because the parser uses relative links and since you have the file on your harddrive,
it makes them absolute links to your local drive.

I don't understand what you mean by "section" and how you want to parse it.
If I'm not mistaken, the UILess Parser has an OnTag event that you can use to capture any tag you want, like DIVs, then you figure out if you are in the right "section" and process the anchors there.

Help me understand what it is you are desiring.
0
 
LVL 17

Author Comment

by:TheRealLoki
ID: 22228036
"section" is abstract
in most cases it will be a <TABLE> </TABLE> block and in other cases it will be a <P> </P> block
in this example (this web page) it is the largest <DIV> block that includes the ID "21946033"
0
 
LVL 26

Expert Comment

by:EddieShipman
ID: 22229218
Well, taht is entirely up to you to figure out. There is no way to essentially "section off" the portions of a page and parse only that unless you know exactly what you are looking for.

You still haven't explained exactly what you are looking for.
0
 
LVL 17

Author Comment

by:TheRealLoki
ID: 22234946
it's going to be an abstract parser where the user will make up some rules (using a gui) to get a block of text from the web page
I want to break the page up into hierarchical tags
e.g
<head>
    sometext
</head>
<body>
    <div>
        <p>this is some text</p>
    </div>
    <div>
        <p>this is more text</p>
    </div>
</body>

the user will make the rules that they want to get "<body> : <div>[2] : <P>"  i.e the text "this is more text"
I have the gui framework fine, and it works with XML, CSV, Excel etc fine. it's just trying to get it to work with HTML now.
My simple test is to try to get the "text" portion of your first answer on this page
i.e. "You really should be using the UILess Parser..."

Once I know how to iterate the tags, and get the text portion, i will be set and can code the rest myself
0
 
LVL 26

Expert Comment

by:EddieShipman
ID: 22235689
Well, the problem is going to take some coding, the UILess Parser isn't really going to do that for you.
However, if the user KNOWS the information in the HTML they are trying to parse, it may be way easier to just use the DOM and get to the elements in question directly.

I will post a way to do it with MSHTML using a rule setup to get the text of my first reply on this URL and post the code later.
0
 
LVL 26

Assisted Solution

by:EddieShipman
EddieShipman earned 500 total points
ID: 22235970
Whoa...I ran into some EXTREMELY strange things in working on this.

I am loading the page into TWebBrowser form the URL above. The HTML source is not the same unless I log in in TWebBrowser.

So I have included a button to parse AFTER the login. This code gets all the DIVs with my answers.
For some reason, I don't have time to figure it out, it pulls each of them twice.
unit Unit1;
 

interface
 

uses

  Windows, Messages, SysUtils, Variants, Classes, Graphics, Controls, Forms,

  Dialogs, OleCtrls, SHDocVw, MSHTML, StdCtrls;
 

type

  TForm1 = class(TForm)

    WebBrowser1: TWebBrowser;

    Button1: TButton;

    Button2: TButton;

    procedure WebBrowser1DocumentComplete(Sender: TObject;

      const pDisp: IDispatch; var URL: OleVariant);

    procedure Button1Click(Sender: TObject);

    procedure Button2Click(Sender: TObject);

  private

    { Private declarations }

  public

    { Public declarations }

  end;
 

var

  Form1: TForm1;
 

implementation
 

{$R *.dfm}
 

procedure TForm1.WebBrowser1DocumentComplete(Sender: TObject;

  const pDisp: IDispatch; var URL: OleVariant);

begin

  Button2.Enabled := True;

end;
 

procedure TForm1.Button1Click(Sender: TObject);

begin

  WebBrowser1.Navigate('http://www.experts-exchange.com/Web_Development/Web_Languages-Standards/XML/Q_23490524.html');

end;
 

Const Username = 'EddieShipman';
 

procedure TForm1.Button2Click(Sender: TObject);

var

  iDoc: IHtmlDocument2;

  i, c, q: integer;

  iColl: IHTMLElementCollection;

  iAnswererChildren: IHTMLElementCollection;

  iAnswererChild: IHTMLElement;

  iAnswerer: IHTMLElement;

  iInfoBody: IHTMLElement;

  iInfoBodyChildren: IHTMLElementCollection;

  iAnswerBodyQuotedsChildren: IHTMLElementCollection;

  iAnswerBodyQuoted: IHTMLElement;

  iABQParent: IHTMLElement;

  iRichText: IHTMLElement;

  Dispatch: IDispatch;

begin

  IDoc := WebBrowser1.Document as IHtmlDocument2;

  if Assigned(IDoc) then

  begin

    iColl := iDoc.all.tags('DIV') as IHTMLElementCollection;

    if Assigned(IColl) then

    begin

      for i := 0 to iColl.length-1 do

      begin

        iAnswerer := iColl.item(i, 0) as IHTMLElement;

        if Assigned(iAnswerer) then

        begin

          if iAnswerer.className = 'answerer' then

          begin

            iAnswererChildren := iAnswerer.children as IHTMLElementCollection;

            if iAnswererChildren.length > 0 then

            begin

              for c := 0 to iAnswererChildren.length-1 do

              begin

                iAnswererChild := iAnswererChildren.item(c, 0) as IHTMLElement;

                if Assigned(iAnswererChild) then

                begin

                  if iAnswererChild.tagName = 'A' then

                  begin

                    if Pos(Username, iAnswererChild.innerText ) > 0 then

                    begin

                      // if this is the one we want the answer from

                      // get the infoColHeader parent so we can get the rich text div

                      iInfoBody := iAnswerer.parentElement;

                      if Assigned(iInfoBody) then

                      begin

                        iInfoBodyChildren := iInfoBody.children as IHTMLElementCollection;

                        // Now get the third child (remember, starts at 0)

                        iAnswerBodyQuoted := iInfoBodyChildren.item(2, 0) as IHTMLElement;

                        iAnswerBodyQuotedsChildren := iAnswerBodyQuoted.children as IHTMLElementCollection;

                        for q := 0 to iAnswerBodyQuotedsChildren.length-1 do

                        begin

                          iRichText := iAnswerBodyQuotedsChildren.item(0,0) as IHTMLElement;

                          if Assigned(iRichText) then

                          begin

                            if iRichText.className = 'richText' then

                              ShowMessage(iRichText.innerText);

                          end;

                        end;

                      end;

                    end;

                  end;

                end;

              end;

            end;

          end;

        end;

      end;

    end;

  end;

end;
 

end.

Open in new window

0

Featured Post

Highfive + Dolby Voice = No More Audio Complaints!

Poor audio quality is one of the top reasons people don’t use video conferencing. Get the crispest, clearest audio powered by Dolby Voice in every meeting. Highfive and Dolby Voice deliver the best video conferencing and audio experience for every meeting and every room.

Join & Write a Comment

The Confluence of Individual Knowledge and the Collective Intelligence At this writing (summer 2013) the term API (http://dictionary.reference.com/browse/API?s=t) has made its way into the popular lexicon of the English language.  A few years ago, …
JavaScript has plenty of pieces of code people often just copy/paste from somewhere but never quite fully understand. Self-Executing functions are just one good example that I'll try to demystify here.
Viewers will learn about arithmetic and Boolean expressions in Java and the logical operators used to create Boolean expressions. We will cover the symbols used for arithmetic expressions and define each logical operator and how to use them in Boole…
The viewer will receive an overview of the basics of CSS showing inline styles. In the head tags set up your style tags: (CODE) Reference the nav tag and set your properties.: (CODE) Set the reference for the UL element and styles for it to ensu…

747 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

16 Experts available now in Live!

Get 1:1 Help Now