Go Premium for a chance to win a PS4. Enter to Win

x
?
Solved

HTML to XML with HTML Tidy but cannot parse the XML after

Posted on 2008-06-16
16
Medium Priority
?
394 Views
Last Modified: 2013-11-18
I am trying to extract the tags from html files
specifically the <a  and <p  sections
I noticed several similar questions in the group, which recommended using HTML Tidy
http://tidy.sourceforge.net/

however when I parse the file, all I see is
NODE_DOCUMENT_TYPE  with the name "html"
FWIW Internet Explorer shows the XML file fine.
Am I just forgetting to do something really obvious?>

I use Delphi, but the code should be fairly similar to other languages

Work_DOMDocument := nil;
  OleCheck(CoCreateInstance(Class_DOMDocument40, nil, CLSCTX_ALL,IXMLDOMDocument, Work_DOMDocument));
      if not Work_DOMDocument.load( 'test.xml' ) then
        ShowMessage('Error loading DOMDocument'
      else
        DisplayXMLStructure(Work_DOMDocument); // simple routine that walks the nodes

Open in new window

0
Comment
Question by:TheRealLoki
  • 8
  • 6
  • 2
16 Comments
 
LVL 10

Expert Comment

by:BobSiemens
ID: 21817495
Are you using these flags (you need to)

-asxml, -asxhtml
0
 
LVL 10

Expert Comment

by:BobSiemens
ID: 21817501
(sorry, probably just one of these flags) -asxhtml
0
 
LVL 17

Author Comment

by:TheRealLoki
ID: 21895668
yes, I have tried those.
I am actually using this very page as a test ie.
http://www.experts-exchange.com/Programming/Languages/Pascal/Delphi/Q__23490524.html

and I can get IE to show the "tidy'd" result, but I can not do it with delphi code
0
Independent Software Vendors: We Want Your Opinion

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 
LVL 26

Accepted Solution

by:
EddieShipman earned 2000 total points
ID: 21946033
You really should be using the UILess Parser from the EmbeddedWB package. You can write your own
function to return specific tags.

Get it here: http://www.torry.net/vcl/internet/browsers/EmbeddedWBD2005Version14.61.zip
and read this post: http://www.experts-exchange.com/Programming/Languages/Pascal/Delphi/Q_21254855.html
0
 
LVL 17

Author Comment

by:TheRealLoki
ID: 22046968
sadly, due to other constraints, I need to use the MS XML (IXMLDomDocument).
0
 
LVL 26

Expert Comment

by:EddieShipman
ID: 22047885
The UILess Parser uses MSHTML to parse the document. Why on earth do you need to parse HTML files with an MSXML parser, *MOST* HTML  is not well formed and you will get TONS of errors trying to parse it using MSXML, primarily because tons do not even have a DECL.

0
 
LVL 17

Author Comment

by:TheRealLoki
ID: 22125495
Hi Eddie,
Simple answer: XML is not right for parsing HTML. points are yours

Long Boring bits:I have tried to convince the client that XML is not the way to go for HTML, and they are begrudgingly accepting, so I will award you the points as soon as i get a chance to try out the UILess parser.
I hope you don't mind waiting a little bit longer for your points :-)

0
 
LVL 26

Expert Comment

by:EddieShipman
ID: 22125814
Nah, but if you'd post a sample and tell me what you want, I can write up a solution for you.
I wrote a couple "wrapper" functions in UILess to get all anchors and all images so it won't be difficult.
0
 
LVL 26

Expert Comment

by:EddieShipman
ID: 22221628
How are things going on your testing? Do you need more help with the UILess parser?
0
 
LVL 17

Author Comment

by:TheRealLoki
ID: 22226746
yes please.
for my testing, I have just done a "View Page source" on this question, and saved teh html to a file on my hard drive, I am then running the UILess demo and seeing a long list of "Anchors"
e.g.
file:///S:/contactUs.jsp?department=6#onlineCustomerService
file:///S:/
file:///S:/
file:///S:/findAnswers.jsp
...
...
file:///S:/recordAnswerRating.jsp?qid=23490524&aid=21817495&rel=1&token=08267e1b6e5dd2e2c70b45f7221bd590&redirectURL=%2FWeb_Development%2FWeb_Languages-Standards%2FXML%2FQ_23490524.html
file:///S:/M_1198981.html
file:///S:/temp/HTML%20Tidy/splitPoints.jsp?qid=23490524
file:///S:/temp/HTML%20Tidy/acceptAnswer.jsp?aid=21817495#selectGrade
file:///S:/M_1198981.html
...
...
etc
and a list of 7 images
file:///S:/timer/timer1.gif
file:///S:/timer/timer2.gif
file:///S:/timer/timer3.gif
file:///S:/timer/timer4.gif
file:///S:/timer/timer5.gif
http://metrics.experts-exchange.com/b/ss/eexchangeprod/1/H.7--NS/0
file:///S:/timer/timer6.gif

although I am sure there should be more than that...

what i'm struggling with though is once I determine a "section" of the page I want
e.g. your first comment
*********
EddieShipman:
You really should be using the UILess Parser from the EmbeddedWB package. You can write your own
function to return specific tags.

Get it here: http://www.torry.net/vcl/internet/browsers/EmbeddedWBD2005Version14.61.zip
and read this post: http://www.experts-exchange.com/Programming/Languages/Pascal/Delphi/Q_21254855.html
*********

How can I isolate it and then get the text and the links inside ?


0
 
LVL 26

Expert Comment

by:EddieShipman
ID: 22227170
The reason you are getting the anchors like this: file:///S:/timer/timer1.gif
is because the parser uses relative links and since you have the file on your harddrive,
it makes them absolute links to your local drive.

I don't understand what you mean by "section" and how you want to parse it.
If I'm not mistaken, the UILess Parser has an OnTag event that you can use to capture any tag you want, like DIVs, then you figure out if you are in the right "section" and process the anchors there.

Help me understand what it is you are desiring.
0
 
LVL 17

Author Comment

by:TheRealLoki
ID: 22228036
"section" is abstract
in most cases it will be a <TABLE> </TABLE> block and in other cases it will be a <P> </P> block
in this example (this web page) it is the largest <DIV> block that includes the ID "21946033"
0
 
LVL 26

Expert Comment

by:EddieShipman
ID: 22229218
Well, taht is entirely up to you to figure out. There is no way to essentially "section off" the portions of a page and parse only that unless you know exactly what you are looking for.

You still haven't explained exactly what you are looking for.
0
 
LVL 17

Author Comment

by:TheRealLoki
ID: 22234946
it's going to be an abstract parser where the user will make up some rules (using a gui) to get a block of text from the web page
I want to break the page up into hierarchical tags
e.g
<head>
    sometext
</head>
<body>
    <div>
        <p>this is some text</p>
    </div>
    <div>
        <p>this is more text</p>
    </div>
</body>

the user will make the rules that they want to get "<body> : <div>[2] : <P>"  i.e the text "this is more text"
I have the gui framework fine, and it works with XML, CSV, Excel etc fine. it's just trying to get it to work with HTML now.
My simple test is to try to get the "text" portion of your first answer on this page
i.e. "You really should be using the UILess Parser..."

Once I know how to iterate the tags, and get the text portion, i will be set and can code the rest myself
0
 
LVL 26

Expert Comment

by:EddieShipman
ID: 22235689
Well, the problem is going to take some coding, the UILess Parser isn't really going to do that for you.
However, if the user KNOWS the information in the HTML they are trying to parse, it may be way easier to just use the DOM and get to the elements in question directly.

I will post a way to do it with MSHTML using a rule setup to get the text of my first reply on this URL and post the code later.
0
 
LVL 26

Assisted Solution

by:EddieShipman
EddieShipman earned 2000 total points
ID: 22235970
Whoa...I ran into some EXTREMELY strange things in working on this.

I am loading the page into TWebBrowser form the URL above. The HTML source is not the same unless I log in in TWebBrowser.

So I have included a button to parse AFTER the login. This code gets all the DIVs with my answers.
For some reason, I don't have time to figure it out, it pulls each of them twice.
unit Unit1;
 
interface
 
uses
  Windows, Messages, SysUtils, Variants, Classes, Graphics, Controls, Forms,
  Dialogs, OleCtrls, SHDocVw, MSHTML, StdCtrls;
 
type
  TForm1 = class(TForm)
    WebBrowser1: TWebBrowser;
    Button1: TButton;
    Button2: TButton;
    procedure WebBrowser1DocumentComplete(Sender: TObject;
      const pDisp: IDispatch; var URL: OleVariant);
    procedure Button1Click(Sender: TObject);
    procedure Button2Click(Sender: TObject);
  private
    { Private declarations }
  public
    { Public declarations }
  end;
 
var
  Form1: TForm1;
 
implementation
 
{$R *.dfm}
 
procedure TForm1.WebBrowser1DocumentComplete(Sender: TObject;
  const pDisp: IDispatch; var URL: OleVariant);
begin
  Button2.Enabled := True;
end;
 
procedure TForm1.Button1Click(Sender: TObject);
begin
  WebBrowser1.Navigate('http://www.experts-exchange.com/Web_Development/Web_Languages-Standards/XML/Q_23490524.html');
end;
 
Const Username = 'EddieShipman';
 
procedure TForm1.Button2Click(Sender: TObject);
var
  iDoc: IHtmlDocument2;
  i, c, q: integer;
  iColl: IHTMLElementCollection;
  iAnswererChildren: IHTMLElementCollection;
  iAnswererChild: IHTMLElement;
  iAnswerer: IHTMLElement;
  iInfoBody: IHTMLElement;
  iInfoBodyChildren: IHTMLElementCollection;
  iAnswerBodyQuotedsChildren: IHTMLElementCollection;
  iAnswerBodyQuoted: IHTMLElement;
  iABQParent: IHTMLElement;
  iRichText: IHTMLElement;
  Dispatch: IDispatch;
begin
  IDoc := WebBrowser1.Document as IHtmlDocument2;
  if Assigned(IDoc) then
  begin
    iColl := iDoc.all.tags('DIV') as IHTMLElementCollection;
    if Assigned(IColl) then
    begin
      for i := 0 to iColl.length-1 do
      begin
        iAnswerer := iColl.item(i, 0) as IHTMLElement;
        if Assigned(iAnswerer) then
        begin
          if iAnswerer.className = 'answerer' then
          begin
            iAnswererChildren := iAnswerer.children as IHTMLElementCollection;
            if iAnswererChildren.length > 0 then
            begin
              for c := 0 to iAnswererChildren.length-1 do
              begin
                iAnswererChild := iAnswererChildren.item(c, 0) as IHTMLElement;
                if Assigned(iAnswererChild) then
                begin
                  if iAnswererChild.tagName = 'A' then
                  begin
                    if Pos(Username, iAnswererChild.innerText ) > 0 then
                    begin
                      // if this is the one we want the answer from
                      // get the infoColHeader parent so we can get the rich text div
                      iInfoBody := iAnswerer.parentElement;
                      if Assigned(iInfoBody) then
                      begin
                        iInfoBodyChildren := iInfoBody.children as IHTMLElementCollection;
                        // Now get the third child (remember, starts at 0)
                        iAnswerBodyQuoted := iInfoBodyChildren.item(2, 0) as IHTMLElement;
                        iAnswerBodyQuotedsChildren := iAnswerBodyQuoted.children as IHTMLElementCollection;
                        for q := 0 to iAnswerBodyQuotedsChildren.length-1 do
                        begin
                          iRichText := iAnswerBodyQuotedsChildren.item(0,0) as IHTMLElement;
                          if Assigned(iRichText) then
                          begin
                            if iRichText.className = 'richText' then
                              ShowMessage(iRichText.innerText);
                          end;
                        end;
                      end;
                    end;
                  end;
                end;
              end;
            end;
          end;
        end;
      end;
    end;
  end;
end;
 
end.

Open in new window

0

Featured Post

Concerto's Cloud Advisory Services

Want to avoid the missteps to gaining all the benefits of the cloud? Learn more about the different assessment options from our Cloud Advisory team.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Introduction Knockoutjs (Knockout) is a JavaScript framework (Model View ViewModel or MVVM framework).   The main ideology behind Knockout is to control from JavaScript how a page looks whilst creating an engaging user experience in the least …
I found this questions asking how to do this in many different forums, so I will describe here how to implement a solution using PHP and AJAX. The logical flow for the problem should be: Write an event handler for the first drop down box to get …
Viewers will learn one way to get user input in Java. Introduce the Scanner object: Declare the variable that stores the user input: An example prompting the user for input: Methods you need to invoke in order to properly get  user input:
HTML5 has deprecated a few of the older ways of showing media as well as offering up a new way to create games and animations. Audio, video, and canvas are just a few of the adjustments made between XHTML and HTML5. As we learned in our last micr…
Suggested Courses

971 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question