Solved

Delphi: How to parse an HTML page with XPath?

Posted on 2013-10-23
7
1,992 Views
Last Modified: 2013-10-30
Hi there!

What I need is an working example that loads an HTML web page (e.g. with TIdHttp) and applies an XPath expression to retrieve a result.

Use for example: http://de.wikipedia.org/wiki/Physik
XPath expression: //*[@id="toc"]/ul/li[2]/ul/li[3]/ul/li[1]/a/span[2]

The expression should return: Mathematische Physik

Regards,
Dirk.
0
Comment
Question by:dirkil2
  • 4
  • 2
7 Comments
 

Author Comment

by:dirkil2
ID: 39594201
Did you try that? Especially load an HTML page into a TXMLDocument?  

I tried so many things which didn't work. That's why I need a working sample not just a link to a weg page that has something about XML and XPath on it.

For example that fails:
var
  lXmlDoc: IXMLDOMDocument;
  lPage: String; // Web site content
begin
  lXmlDoc := CoDOMDocument.Create;
  lXmlDoc.load(lPage);
0
 
LVL 26

Accepted Solution

by:
Sinisa Vuk earned 500 total points
ID: 39596816
Sorry, I'm not aware of external link agreement until now. Links I posted here was a "fast" help - some users complain about slowness. I'm not encourage giving
working example
- asker will do copy paste - but what is behind ....

So still .... made working example:

uses ActiveX, MSXML2_TLB;
...
function ParseHTM4String(sTextToParse, sXPath: String; var sResponse: String;
  var sError: String): Boolean;
var
  XMLDomDoc: IXMLDOMDocument2;
  XMLDomNode: IXMLDOMNode;
begin
  Result := False;
  sResponse := '';
  sError := '';

  XMLDomDoc := CoDOMDocument30.Create; 
  try
    try
      XMLDomDoc.setProperty('ProhibitDTD', 'False');

      XMLDomDoc.async := False;
      XMLDomDoc.PreserveWhitespace := True;
      XMLDomDoc.ResolveExternals := True;
      XMLDomDoc.ValidateOnParse := False;

      if XMLDomDoc.loadXML(sTextToParse) then
      begin
        XMLDomDoc.setProperty('SelectionLanguage', 'XPath');
        XMLDomDoc.setProperty('SelectionNamespaces', 'xmlns:x=''http://www.w3.org/1999/xhtml''');

        XMLDomNode := XMLDomDoc.selectSingleNode(sXPath);
        if XMLDomNode<>nil then
        begin
          sResponse := XMLDomNode.text;
          Result := True;
        end;
      end
      else
        sError := Trim(XMLDomDoc.parseError.reason);
    except
    end;
  finally
    XMLDomDoc := nil;
  end;
end;

function GetHTMLInfo(sURI, sXPath: String; var sResponse: String; var sError: String): Boolean;
const
 COMPLETED = 4;
 OK = 200;
var
  XMLHTTPRequest  : IXMLHTTPRequest;
begin
  Result := False;
  sResponse := '';
  sError := '';

  XMLHTTPRequest := CoXMLHTTP.Create;
  try
    XMLHTTPRequest.open('GET', sURI, False, EmptyParam, EmptyParam);
    XMLHTTPRequest.send(EmptyParam);
    if (XMLHTTPRequest.readyState = COMPLETED) and (XMLHTTPRequest.status = OK) then
    begin
      Result := ParseHTM4String(XMLHTTPRequest.responseText, sXPath, sResponse, sError);
    end
    else
      sError := Trim(XMLHTTPRequest.statusText);
  finally
    XMLHTTPRequest := nil;
  end;
end;

initialization
  CoInitialize(nil);

finalization
  CoUninitialize;

Open in new window


Example is using windows MSXML2 interface which you can get here:
MSXML2_TLB.pas
Use XMLHTTPRequest to get http page (instead of TIdHttp) and pass response to my ParseHTM4String function and XPath search string.

Usage:
GetHTMLInfo('http://de.wikipedia.org/wiki/Physik',
    Edit1.Text, sResult, sError);
//sResult is result ....

Open in new window


Line
XMLDomDoc.setProperty('SelectionLanguage', 'XPath');

Open in new window

is added because you will use XPath search capability ....
0
 

Author Comment

by:dirkil2
ID: 39597827
@sinisav

Excellent stuff! Thank you very much.

Unfortunately, it does not work with a different web page. Your code gives the error "The character '>' was expected". I checked the web page with a HTML validator and it validates ok. Do you have an idea what is going wrong?

I attached my program so you can ran it straight away.

Regards,
Dirk.
XPath.zip
0
Free Tool: Port Scanner

Check which ports are open to the outside world. Helps make sure that your firewall rules are working as intended.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

 
LVL 26

Expert Comment

by:Sinisa Vuk
ID: 39600175
until now I have no success too.
I remove dtd checking by remove first line of html (this will speed up parsing too):
sHTM := XMLHTTPRequest.responseText;
Delete(sHTM, 1, Pos('<html', sHTM)-1);

Result := ParseHTM4String(sHTM, sXPath, sResponse, sError);

Open in new window


Now I get:
End tag 'head' does not match the start tag 'link'.

Maybe is something wrong on this site (set in XPath.zip source)....
0
 

Author Comment

by:dirkil2
ID: 39612779
I wrote my program now in C# and it works like a charm. So I suppose there is nothing wrong on the site; there is rather a bug in the XPath parser.

This is unfortunate for me since I'd prefer to have this program written in Delphi.

But anyway, this site was not part of my question and your program solved it. Therefore, thank you vey much for your effort.
0
 

Author Closing Comment

by:dirkil2
ID: 39612782
Thank you very much.
0

Featured Post

Free Tool: Path Explorer

An intuitive utility to help find the CSS path to UI elements on a webpage. These paths are used frequently in a variety of front-end development and QA automation tasks.

One of a set of tools we're offering as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

The article shows the basic steps of integrating an HTML theme template into an ASP.NET MVC project
When crafting your “Why Us” page, there are a plethora of pitfalls to avoid. Follow these five tips, and you’ll be well on your way to creating an effective page.
In this tutorial viewers will learn how to position overlapping items using z-index in CSS. They will also learn the restrictions on the z-index property.  Create a new HTML document with an internal stylesheet.: Create a div in CSS and name it Red.…
In this tutorial viewers will learn how to embed videos in a webpage using HTML5. Ensure your DOCTYPE declaration is set to HTML5: "<!DOCTYPE html>": Use the <video> tag to insert a video. Define the src as the URL of your video; this is similar to …

820 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question