Improve company productivity with a Business Account.Sign Up

x
  • Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 2343
  • Last Modified:

Delphi: How to parse an HTML page with XPath?

Hi there!

What I need is an working example that loads an HTML web page (e.g. with TIdHttp) and applies an XPath expression to retrieve a result.

Use for example: http://de.wikipedia.org/wiki/Physik
XPath expression: //*[@id="toc"]/ul/li[2]/ul/li[3]/ul/li[1]/a/span[2]

The expression should return: Mathematische Physik

Regards,
Dirk.
0
dirkil2
Asked:
dirkil2
  • 4
  • 2
1 Solution
 
dirkil2Author Commented:
Did you try that? Especially load an HTML page into a TXMLDocument?  

I tried so many things which didn't work. That's why I need a working sample not just a link to a weg page that has something about XML and XPath on it.

For example that fails:
var
  lXmlDoc: IXMLDOMDocument;
  lPage: String; // Web site content
begin
  lXmlDoc := CoDOMDocument.Create;
  lXmlDoc.load(lPage);
0
 
Sinisa VukCommented:
Sorry, I'm not aware of external link agreement until now. Links I posted here was a "fast" help - some users complain about slowness. I'm not encourage giving
working example
- asker will do copy paste - but what is behind ....

So still .... made working example:

uses ActiveX, MSXML2_TLB;
...
function ParseHTM4String(sTextToParse, sXPath: String; var sResponse: String;
  var sError: String): Boolean;
var
  XMLDomDoc: IXMLDOMDocument2;
  XMLDomNode: IXMLDOMNode;
begin
  Result := False;
  sResponse := '';
  sError := '';

  XMLDomDoc := CoDOMDocument30.Create; 
  try
    try
      XMLDomDoc.setProperty('ProhibitDTD', 'False');

      XMLDomDoc.async := False;
      XMLDomDoc.PreserveWhitespace := True;
      XMLDomDoc.ResolveExternals := True;
      XMLDomDoc.ValidateOnParse := False;

      if XMLDomDoc.loadXML(sTextToParse) then
      begin
        XMLDomDoc.setProperty('SelectionLanguage', 'XPath');
        XMLDomDoc.setProperty('SelectionNamespaces', 'xmlns:x=''http://www.w3.org/1999/xhtml''');

        XMLDomNode := XMLDomDoc.selectSingleNode(sXPath);
        if XMLDomNode<>nil then
        begin
          sResponse := XMLDomNode.text;
          Result := True;
        end;
      end
      else
        sError := Trim(XMLDomDoc.parseError.reason);
    except
    end;
  finally
    XMLDomDoc := nil;
  end;
end;

function GetHTMLInfo(sURI, sXPath: String; var sResponse: String; var sError: String): Boolean;
const
 COMPLETED = 4;
 OK = 200;
var
  XMLHTTPRequest  : IXMLHTTPRequest;
begin
  Result := False;
  sResponse := '';
  sError := '';

  XMLHTTPRequest := CoXMLHTTP.Create;
  try
    XMLHTTPRequest.open('GET', sURI, False, EmptyParam, EmptyParam);
    XMLHTTPRequest.send(EmptyParam);
    if (XMLHTTPRequest.readyState = COMPLETED) and (XMLHTTPRequest.status = OK) then
    begin
      Result := ParseHTM4String(XMLHTTPRequest.responseText, sXPath, sResponse, sError);
    end
    else
      sError := Trim(XMLHTTPRequest.statusText);
  finally
    XMLHTTPRequest := nil;
  end;
end;

initialization
  CoInitialize(nil);

finalization
  CoUninitialize;

Open in new window


Example is using windows MSXML2 interface which you can get here:
MSXML2_TLB.pas
Use XMLHTTPRequest to get http page (instead of TIdHttp) and pass response to my ParseHTM4String function and XPath search string.

Usage:
GetHTMLInfo('http://de.wikipedia.org/wiki/Physik',
    Edit1.Text, sResult, sError);
//sResult is result ....

Open in new window


Line
XMLDomDoc.setProperty('SelectionLanguage', 'XPath');

Open in new window

is added because you will use XPath search capability ....
0
 
dirkil2Author Commented:
@sinisav

Excellent stuff! Thank you very much.

Unfortunately, it does not work with a different web page. Your code gives the error "The character '>' was expected". I checked the web page with a HTML validator and it validates ok. Do you have an idea what is going wrong?

I attached my program so you can ran it straight away.

Regards,
Dirk.
XPath.zip
0
Get 10% Off Your First Squarespace Website

Ready to showcase your work, publish content or promote your business online? With Squarespace’s award-winning templates and 24/7 customer service, getting started is simple. Head to Squarespace.com and use offer code ‘EXPERTS’ to get 10% off your first purchase.

 
Sinisa VukCommented:
until now I have no success too.
I remove dtd checking by remove first line of html (this will speed up parsing too):
sHTM := XMLHTTPRequest.responseText;
Delete(sHTM, 1, Pos('<html', sHTM)-1);

Result := ParseHTM4String(sHTM, sXPath, sResponse, sError);

Open in new window


Now I get:
End tag 'head' does not match the start tag 'link'.

Maybe is something wrong on this site (set in XPath.zip source)....
0
 
dirkil2Author Commented:
I wrote my program now in C# and it works like a charm. So I suppose there is nothing wrong on the site; there is rather a bug in the XPath parser.

This is unfortunate for me since I'd prefer to have this program written in Delphi.

But anyway, this site was not part of my question and your program solved it. Therefore, thank you vey much for your effort.
0
 
dirkil2Author Commented:
Thank you very much.
0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

Join & Write a Comment

Featured Post

What Kind of Coding Program is Right for You?

There are many ways to learn to code these days. From coding bootcamps like Flatiron School to online courses to totally free beginner resources. The best way to learn to code depends on many factors, but the most important one is you. See what course is best for you.

  • 4
  • 2
Tackle projects and never again get stuck behind a technical roadblock.
Join Now