Want to win a PS4? Go Premium and enter to win our High-Tech Treats giveaway. Enter to Win

x
?
Solved

Delphi: How to parse an HTML page with XPath?

Posted on 2013-10-23
7
Medium Priority
?
2,174 Views
Last Modified: 2013-10-30
Hi there!

What I need is an working example that loads an HTML web page (e.g. with TIdHttp) and applies an XPath expression to retrieve a result.

Use for example: http://de.wikipedia.org/wiki/Physik
XPath expression: //*[@id="toc"]/ul/li[2]/ul/li[3]/ul/li[1]/a/span[2]

The expression should return: Mathematische Physik

Regards,
Dirk.
0
Comment
Question by:dirkil2
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 4
  • 2
7 Comments
 

Author Comment

by:dirkil2
ID: 39594201
Did you try that? Especially load an HTML page into a TXMLDocument?  

I tried so many things which didn't work. That's why I need a working sample not just a link to a weg page that has something about XML and XPath on it.

For example that fails:
var
  lXmlDoc: IXMLDOMDocument;
  lPage: String; // Web site content
begin
  lXmlDoc := CoDOMDocument.Create;
  lXmlDoc.load(lPage);
0
 
LVL 28

Accepted Solution

by:
Sinisa Vuk earned 2000 total points
ID: 39596816
Sorry, I'm not aware of external link agreement until now. Links I posted here was a "fast" help - some users complain about slowness. I'm not encourage giving
working example
- asker will do copy paste - but what is behind ....

So still .... made working example:

uses ActiveX, MSXML2_TLB;
...
function ParseHTM4String(sTextToParse, sXPath: String; var sResponse: String;
  var sError: String): Boolean;
var
  XMLDomDoc: IXMLDOMDocument2;
  XMLDomNode: IXMLDOMNode;
begin
  Result := False;
  sResponse := '';
  sError := '';

  XMLDomDoc := CoDOMDocument30.Create; 
  try
    try
      XMLDomDoc.setProperty('ProhibitDTD', 'False');

      XMLDomDoc.async := False;
      XMLDomDoc.PreserveWhitespace := True;
      XMLDomDoc.ResolveExternals := True;
      XMLDomDoc.ValidateOnParse := False;

      if XMLDomDoc.loadXML(sTextToParse) then
      begin
        XMLDomDoc.setProperty('SelectionLanguage', 'XPath');
        XMLDomDoc.setProperty('SelectionNamespaces', 'xmlns:x=''http://www.w3.org/1999/xhtml''');

        XMLDomNode := XMLDomDoc.selectSingleNode(sXPath);
        if XMLDomNode<>nil then
        begin
          sResponse := XMLDomNode.text;
          Result := True;
        end;
      end
      else
        sError := Trim(XMLDomDoc.parseError.reason);
    except
    end;
  finally
    XMLDomDoc := nil;
  end;
end;

function GetHTMLInfo(sURI, sXPath: String; var sResponse: String; var sError: String): Boolean;
const
 COMPLETED = 4;
 OK = 200;
var
  XMLHTTPRequest  : IXMLHTTPRequest;
begin
  Result := False;
  sResponse := '';
  sError := '';

  XMLHTTPRequest := CoXMLHTTP.Create;
  try
    XMLHTTPRequest.open('GET', sURI, False, EmptyParam, EmptyParam);
    XMLHTTPRequest.send(EmptyParam);
    if (XMLHTTPRequest.readyState = COMPLETED) and (XMLHTTPRequest.status = OK) then
    begin
      Result := ParseHTM4String(XMLHTTPRequest.responseText, sXPath, sResponse, sError);
    end
    else
      sError := Trim(XMLHTTPRequest.statusText);
  finally
    XMLHTTPRequest := nil;
  end;
end;

initialization
  CoInitialize(nil);

finalization
  CoUninitialize;

Open in new window


Example is using windows MSXML2 interface which you can get here:
MSXML2_TLB.pas
Use XMLHTTPRequest to get http page (instead of TIdHttp) and pass response to my ParseHTM4String function and XPath search string.

Usage:
GetHTMLInfo('http://de.wikipedia.org/wiki/Physik',
    Edit1.Text, sResult, sError);
//sResult is result ....

Open in new window


Line
XMLDomDoc.setProperty('SelectionLanguage', 'XPath');

Open in new window

is added because you will use XPath search capability ....
0
 

Author Comment

by:dirkil2
ID: 39597827
@sinisav

Excellent stuff! Thank you very much.

Unfortunately, it does not work with a different web page. Your code gives the error "The character '>' was expected". I checked the web page with a HTML validator and it validates ok. Do you have an idea what is going wrong?

I attached my program so you can ran it straight away.

Regards,
Dirk.
XPath.zip
0
VIDEO: THE CONCERTO CLOUD FOR HEALTHCARE

Modern healthcare requires a modern cloud. View this brief video to understand how the Concerto Cloud for Healthcare can help your organization.

 
LVL 28

Expert Comment

by:Sinisa Vuk
ID: 39600175
until now I have no success too.
I remove dtd checking by remove first line of html (this will speed up parsing too):
sHTM := XMLHTTPRequest.responseText;
Delete(sHTM, 1, Pos('<html', sHTM)-1);

Result := ParseHTM4String(sHTM, sXPath, sResponse, sError);

Open in new window


Now I get:
End tag 'head' does not match the start tag 'link'.

Maybe is something wrong on this site (set in XPath.zip source)....
0
 

Author Comment

by:dirkil2
ID: 39612779
I wrote my program now in C# and it works like a charm. So I suppose there is nothing wrong on the site; there is rather a bug in the XPath parser.

This is unfortunate for me since I'd prefer to have this program written in Delphi.

But anyway, this site was not part of my question and your program solved it. Therefore, thank you vey much for your effort.
0
 

Author Closing Comment

by:dirkil2
ID: 39612782
Thank you very much.
0

Featured Post

Technology Partners: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Building a website can seem like a daunting task to the uninitiated but it really only requires knowledge of two basic languages: HTML and CSS.
Originally, this post was published on Monitis Blog, you can check it here . Websites are getting bigger and more complicated by the day. Video, images and custom fonts are all great for showcasing your product or service. But the price to pay in…
The viewer will learn the basics of jQuery, including how to invoke it on a web page. Reference your jQuery libraries: (CODE) Include your new external js/jQuery file: (CODE) Write your first lines of code to setup your site for jQuery.: (CODE)
The viewer will learn the basics of jQuery including how to code hide show and toggles. Reference your jQuery libraries: (CODE) Include your new external js/jQuery file: (CODE) Write your first lines of code to setup your site for jQuery…
Suggested Courses

610 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question