Regex String Needed to Trap Certain URLs

I am using a program to create for me a sitemap.xml file.  It's awesome at crawling my site, unfortunately I do not want it to index certain pages.  Fortunately, the program allows me to filter out pages using Regular Expressions.  I hope someone could help me write this regex string to plug into the program.  

Following is an example of an aspx page that I do not want indexed.

http://www.companysite.com/ca/anaheim/6008-e.-calle-cedro/4641217/?sorigin=hb

For the record, the URL is a profile page for a Real Estate listing.  We only list properties in California, so /ca/ is considered static text.  
http://www.companysite.com/ca/{city}/{address}/{propertyID}/{variable}.  

Based on the above URL, I do not want to crawl any /ca/{city}/{address} pages.   But I am okay with it crawling other city pages such as /ca/{city}/housingmarkettrends.  

So in laymen terms, below is what I figure is the pattern that I need to trap.  For ease of reading I have broken down each piece of the URL string in its own row below:

{any string of chars, including special chars: hyphens, periods, etc. that ends with a forward slash}
{string of chars that begin with a digit (zero thru nine) and ends with a forward slash}
{string of chars that only contain digits (zero thru nine) and ends with a forward slash}

I look forward to working with someone on this.  Thanks.  

Robert
PAEWINSAsked:
Who is Participating?
 
Terry WoodsIT GuruCommented:
This might do the trick:

http://www.companysite.com/ca/[^?&/]+/\d[^?&/]*/\d+/

Open in new window


Let me know how it goes.
0
 
PAEWINSAuthor Commented:
The regex created appears to work using an Expression Tester (http://www.regular-expressions.info/javascriptexample.html).  

Below are the testing variables.  

URL:   http://www.companysite.com/ca/corona/1518-beacon-ridge-way/4579453/?sorigin=hb 

REGEX:   http://www.companysite.com/ca/[^?&/]+/\d[^?&/]*/\d+/

However, when I supplied it in the program that called for it, it seems to not acknowledge it.  Could it be written for a certain platform.

The program I am using is Gmapper.  It is an XML sitemap generator.  http://www.g-mapper.co.uk/download/index.aspx  

Thank you anyway.  I will need to research this further.
0
 
PAEWINSAuthor Commented:
Terry,

I am trying to troubleshoot my issue and wrap my head around this regex string you provided.  Can you break this down for me?  I provided a URL with 5 forward slashes, and the regex you supplied has 7.  

Are the two forward slashed inside the brackets metacharacters and part of the regex set of commands?  

Why does the expression end with a forward slash?  

/ca/[^?&/]+/\d[^?&/]*/\d+/

Thanks.
0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.