how to extract all the hyperlinks on this webpage

mmalik15
mmalik15 used Ask the Experts™
on
on this web page http://www.scie-socialcareonline.org.uk/topics.asp?guid=64f07a36-85f2-4aac-a862-61b9116190ad if we click on expand all in the list of browse topics. How can we extract all the hyperlinks of the with titles like adoption, access to birth records etc
Comment
Watch Question

Do more with

Expert Office
EXPERT OFFICE® is a registered trademark of EXPERTS EXCHANGE®
Glanced up at my screen and thought I had coded the Matrix...  Turns out, I just fell asleep on the keyboard.
Most Valuable Expert 2011
Top Expert 2015
Commented:
The subsections are displayed by simply changing the display style from none to block, and the links exist in the source HTML (i.e. they are not pulled via AJAX). For this reason you should be able to just select all the links within that section.

If you're still using Html Agility Pack, then you could do:

doc.DocumentNode.SelectNodes("//span[@class='branch']//a[not(starts-with(@href, 'javascript:'))]")

Open in new window

Author

Commented:
Many thanks again kaufmed..

how can i exclude rss link in the xpath? Apart from that its working fine.

Also could you kindly tell me any xpath tool to extract the information from html DOM or what's the best approach to write xpath for html dom?
ǩa̹̼͍̓̂ͪͤͭ̓u͈̳̟͕̬ͩ͂̌͌̾̀ͪf̭̤͉̅̋͛͂̓͛̈m̩̘̱̃e͙̳͊̑̂ͦ̌ͯ̚d͋̋ͧ̑ͯ͛̉Glanced up at my screen and thought I had coded the Matrix...  Turns out, I just fell asleep on the keyboard.
Most Valuable Expert 2011
Top Expert 2015

Commented:
Oh, sorry. I meant to exclude that as well:

doc.DocumentNode.SelectNodes("//span[@class='branch']//a[not(starts-with(@href, 'javascript:')) and not(starts-with(@href, 'rss/'))]")

Open in new window

Amazon Web Services

Are you thinking about creating an Amazon Web Services account for your business? Not sure where to start? In this course you’ll get an overview of the history of AWS and take a tour of their user interface.

Author

Commented:
Brilliant kaufmed. Its working perfectly.

I use Altova to test any xpath on xml documents but wonder if  there is a similar tool to test Html DOM.
ǩa̹̼͍̓̂ͪͤͭ̓u͈̳̟͕̬ͩ͂̌͌̾̀ͪf̭̤͉̅̋͛͂̓͛̈m̩̘̱̃e͙̳͊̑̂ͦ̌ͯ̚d͋̋ͧ̑ͯ͛̉Glanced up at my screen and thought I had coded the Matrix...  Turns out, I just fell asleep on the keyboard.
Most Valuable Expert 2011
Top Expert 2015

Commented:
I don't know of any. HTML is becoming more in line with XML with new standards that are released. Most of the frameworks people use today to build HTML do so such that the HTML is well-formed (similar to XML). As such, you should be able to use Altova on any well-formed HTML since HTML is (technically) a subset of XML (even though HTML was around first). Unless you are dealing with someone who hand-code their web page, you should be OK using Altova.
ǩa̹̼͍̓̂ͪͤͭ̓u͈̳̟͕̬ͩ͂̌͌̾̀ͪf̭̤͉̅̋͛͂̓͛̈m̩̘̱̃e͙̳͊̑̂ͦ̌ͯ̚d͋̋ͧ̑ͯ͛̉Glanced up at my screen and thought I had coded the Matrix...  Turns out, I just fell asleep on the keyboard.
Most Valuable Expert 2011
Top Expert 2015

Commented:
P.S.

One of the reasons HTML Agility Pack is so popular is that the team sought to make a library that could handle (as best as one can) mal-formed HTML. HAP takes some liberties in making the source HTML well-formed so that you can use XPath against the loaded document.

Author

Commented:
Thanks kaufmed... Its worth having EE membership because of the presence of people like you!

Do more with

Expert Office
Submit tech questions to Ask the Experts™ at any time to receive solutions, advice, and new ideas from leading industry professionals.

Start 7-Day Free Trial