Solved

Delphi: How to parse an HTML page with XPath?

Posted on 2013-10-23
7
1,895 Views
Last Modified: 2013-10-30
Hi there!

What I need is an working example that loads an HTML web page (e.g. with TIdHttp) and applies an XPath expression to retrieve a result.

Use for example: http://de.wikipedia.org/wiki/Physik
XPath expression: //*[@id="toc"]/ul/li[2]/ul/li[3]/ul/li[1]/a/span[2]

The expression should return: Mathematische Physik

Regards,
Dirk.
0
Comment
Question by:dirkil2
  • 4
  • 2
7 Comments
 

Author Comment

by:dirkil2
ID: 39594201
Did you try that? Especially load an HTML page into a TXMLDocument?  

I tried so many things which didn't work. That's why I need a working sample not just a link to a weg page that has something about XML and XPath on it.

For example that fails:
var
  lXmlDoc: IXMLDOMDocument;
  lPage: String; // Web site content
begin
  lXmlDoc := CoDOMDocument.Create;
  lXmlDoc.load(lPage);
0
 
LVL 25

Accepted Solution

by:
Sinisa Vuk earned 500 total points
ID: 39596816
Sorry, I'm not aware of external link agreement until now. Links I posted here was a "fast" help - some users complain about slowness. I'm not encourage giving
working example
- asker will do copy paste - but what is behind ....

So still .... made working example:

uses ActiveX, MSXML2_TLB;
...
function ParseHTM4String(sTextToParse, sXPath: String; var sResponse: String;
  var sError: String): Boolean;
var
  XMLDomDoc: IXMLDOMDocument2;
  XMLDomNode: IXMLDOMNode;
begin
  Result := False;
  sResponse := '';
  sError := '';

  XMLDomDoc := CoDOMDocument30.Create; 
  try
    try
      XMLDomDoc.setProperty('ProhibitDTD', 'False');

      XMLDomDoc.async := False;
      XMLDomDoc.PreserveWhitespace := True;
      XMLDomDoc.ResolveExternals := True;
      XMLDomDoc.ValidateOnParse := False;

      if XMLDomDoc.loadXML(sTextToParse) then
      begin
        XMLDomDoc.setProperty('SelectionLanguage', 'XPath');
        XMLDomDoc.setProperty('SelectionNamespaces', 'xmlns:x=''http://www.w3.org/1999/xhtml''');

        XMLDomNode := XMLDomDoc.selectSingleNode(sXPath);
        if XMLDomNode<>nil then
        begin
          sResponse := XMLDomNode.text;
          Result := True;
        end;
      end
      else
        sError := Trim(XMLDomDoc.parseError.reason);
    except
    end;
  finally
    XMLDomDoc := nil;
  end;
end;

function GetHTMLInfo(sURI, sXPath: String; var sResponse: String; var sError: String): Boolean;
const
 COMPLETED = 4;
 OK = 200;
var
  XMLHTTPRequest  : IXMLHTTPRequest;
begin
  Result := False;
  sResponse := '';
  sError := '';

  XMLHTTPRequest := CoXMLHTTP.Create;
  try
    XMLHTTPRequest.open('GET', sURI, False, EmptyParam, EmptyParam);
    XMLHTTPRequest.send(EmptyParam);
    if (XMLHTTPRequest.readyState = COMPLETED) and (XMLHTTPRequest.status = OK) then
    begin
      Result := ParseHTM4String(XMLHTTPRequest.responseText, sXPath, sResponse, sError);
    end
    else
      sError := Trim(XMLHTTPRequest.statusText);
  finally
    XMLHTTPRequest := nil;
  end;
end;

initialization
  CoInitialize(nil);

finalization
  CoUninitialize;

Open in new window


Example is using windows MSXML2 interface which you can get here:
MSXML2_TLB.pas
Use XMLHTTPRequest to get http page (instead of TIdHttp) and pass response to my ParseHTM4String function and XPath search string.

Usage:
GetHTMLInfo('http://de.wikipedia.org/wiki/Physik',
    Edit1.Text, sResult, sError);
//sResult is result ....

Open in new window


Line
XMLDomDoc.setProperty('SelectionLanguage', 'XPath');

Open in new window

is added because you will use XPath search capability ....
0
 

Author Comment

by:dirkil2
ID: 39597827
@sinisav

Excellent stuff! Thank you very much.

Unfortunately, it does not work with a different web page. Your code gives the error "The character '>' was expected". I checked the web page with a HTML validator and it validates ok. Do you have an idea what is going wrong?

I attached my program so you can ran it straight away.

Regards,
Dirk.
XPath.zip
0
How to run any project with ease

Manage projects of all sizes how you want. Great for personal to-do lists, project milestones, team priorities and launch plans.
- Combine task lists, docs, spreadsheets, and chat in one
- View and edit from mobile/offline
- Cut down on emails

 
LVL 25

Expert Comment

by:Sinisa Vuk
ID: 39600175
until now I have no success too.
I remove dtd checking by remove first line of html (this will speed up parsing too):
sHTM := XMLHTTPRequest.responseText;
Delete(sHTM, 1, Pos('<html', sHTM)-1);

Result := ParseHTM4String(sHTM, sXPath, sResponse, sError);

Open in new window


Now I get:
End tag 'head' does not match the start tag 'link'.

Maybe is something wrong on this site (set in XPath.zip source)....
0
 

Author Comment

by:dirkil2
ID: 39612779
I wrote my program now in C# and it works like a charm. So I suppose there is nothing wrong on the site; there is rather a bug in the XPath parser.

This is unfortunate for me since I'd prefer to have this program written in Delphi.

But anyway, this site was not part of my question and your program solved it. Therefore, thank you vey much for your effort.
0
 

Author Closing Comment

by:dirkil2
ID: 39612782
Thank you very much.
0

Featured Post

Highfive + Dolby Voice = No More Audio Complaints!

Poor audio quality is one of the top reasons people don’t use video conferencing. Get the crispest, clearest audio powered by Dolby Voice in every meeting. Highfive and Dolby Voice deliver the best video conferencing and audio experience for every meeting and every room.

Join & Write a Comment

Suggested Solutions

This article describes how to create custom column layout styles for Bootstrap. The article uses 5 columns to illustrate the concept, but the principle can be extended to any number of columns.
Building a website can seem like a daunting task to the uninitiated but it really only requires knowledge of two basic languages: HTML and CSS.
In this Micro Tutorial viewers will learn how to create navigation buttons that change on rollover, using CSS (Continuation of the CSS Image Sprite tutorial) Create a parent ID for all the list items       - Specify position: absolute and display: block…
HTML5 has deprecated a few of the older ways of showing media as well as offering up a new way to create games and animations. Audio, video, and canvas are just a few of the adjustments made between XHTML and HTML5. As we learned in our last micr…

708 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

16 Experts available now in Live!

Get 1:1 Help Now