[Webinar] Streamline your web hosting managementRegister Today

x
  • Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 1718
  • Last Modified:

Handling relative Uri's in rewritten urls

Hello, i'm writing a simple web spider for whole web crawling. I'm using System.Uri class to resolve relative uri's found on webpages. Everything is fine with standard uri's however some servers use url rewritng scheme which Uri class fails to interpret. Example:

base url: http://www.wosp.org.pl/fundacja/index.php/11/2
relative url: index.php/11/2/0

All browsers resolves that relative url to: http://www.wosp.org.pl/fundacja/index.php/11/2/0

But Uri class resolves to http://www.wosp.org.pl/fundacja/index.php/11/index.php/11/2/0

Unfortunately that url does not generate 404 error so i'm getting more and more urls in my db going like http://www.wosp.org.pl/fundacja/index.php/11/index.php/11/2/index.php/11/2/index.php/11/2/

I've also tested java URI class which behaves in the same way.

So is there a bug in Uri class or maybe that relative url is not valid, but if it is so why does it work with all of the browsers.

Also i would welcome any suggestions on how to deal with that problem.
0
phervers
Asked:
phervers
  • 2
1 Solution
 
SteveH_UKCommented:
The issue is not so much with the Uri class as with your spider.

If you look at the source of the page http://www.wosp.org.pl/fundacja/index.php/11/2, you will see that there is a base href tag in the page header:

<base href="http://www.wosp.org.pl/fundacja/">

This tells the browser where to base relative URLs.  Your spider needs to handle these tags.
0
 
pherversAuthor Commented:
Great that solves my problem, somehow i wasn't aware of <base>  tag.  It caused a lot of confusion so i guess i won't use it in future either :)
0
 
SteveH_UKCommented:
Glad to help:)
0

Featured Post

Never miss a deadline with monday.com

The revolutionary project management tool is here!   Plan visually with a single glance and make sure your projects get done.

  • 2
Tackle projects and never again get stuck behind a technical roadblock.
Join Now