Solved

Regex String Needed to Trap Certain URLs

Posted on 2013-02-04
3
297 Views
Last Modified: 2013-02-05
I am using a program to create for me a sitemap.xml file.  It's awesome at crawling my site, unfortunately I do not want it to index certain pages.  Fortunately, the program allows me to filter out pages using Regular Expressions.  I hope someone could help me write this regex string to plug into the program.  

Following is an example of an aspx page that I do not want indexed.

http://www.companysite.com/ca/anaheim/6008-e.-calle-cedro/4641217/?sorigin=hb

For the record, the URL is a profile page for a Real Estate listing.  We only list properties in California, so /ca/ is considered static text.  
http://www.companysite.com/ca/{city}/{address}/{propertyID}/{variable}.  

Based on the above URL, I do not want to crawl any /ca/{city}/{address} pages.   But I am okay with it crawling other city pages such as /ca/{city}/housingmarkettrends.  

So in laymen terms, below is what I figure is the pattern that I need to trap.  For ease of reading I have broken down each piece of the URL string in its own row below:

{any string of chars, including special chars: hyphens, periods, etc. that ends with a forward slash}
{string of chars that begin with a digit (zero thru nine) and ends with a forward slash}
{string of chars that only contain digits (zero thru nine) and ends with a forward slash}

I look forward to working with someone on this.  Thanks.  

Robert
0
Comment
Question by:PAEWINS
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 2
3 Comments
 
LVL 35

Accepted Solution

by:
Terry Woods earned 500 total points
ID: 38853070
This might do the trick:

http://www.companysite.com/ca/[^?&/]+/\d[^?&/]*/\d+/

Open in new window


Let me know how it goes.
0
 

Author Closing Comment

by:PAEWINS
ID: 38857586
The regex created appears to work using an Expression Tester (http://www.regular-expressions.info/javascriptexample.html).  

Below are the testing variables.  

URL:   http://www.companysite.com/ca/corona/1518-beacon-ridge-way/4579453/?sorigin=hb 

REGEX:   http://www.companysite.com/ca/[^?&/]+/\d[^?&/]*/\d+/

However, when I supplied it in the program that called for it, it seems to not acknowledge it.  Could it be written for a certain platform.

The program I am using is Gmapper.  It is an XML sitemap generator.  http://www.g-mapper.co.uk/download/index.aspx  

Thank you anyway.  I will need to research this further.
0
 

Author Comment

by:PAEWINS
ID: 38857678
Terry,

I am trying to troubleshoot my issue and wrap my head around this regex string you provided.  Can you break this down for me?  I provided a URL with 5 forward slashes, and the regex you supplied has 7.  

Are the two forward slashed inside the brackets metacharacters and part of the regex set of commands?  

Why does the expression end with a forward slash?  

/ca/[^?&/]+/\d[^?&/]*/\d+/

Thanks.
0

Featured Post

A new era in Cloud training has arrived.

A day that will go down in Cloud history.. But are you ready for it? Will you accept this Cloud challenge?

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Making a simple AJAX shopping cart Couple years ago I made my first shopping cart, I used iframe and JavaScript, it was very good at that time, there were no sessions or AJAX, I used cookies on clients machine. Today we have more advanced techno…
The Windows functions GetTickCount and timeGetTime retrieve the number of milliseconds since the system was started. However, the value is stored in a DWORD, which means that it wraps around to zero every 49.7 days. This article shows how to solve t…
The viewer will learn how to create and use a small PHP class to apply a watermark to an image. This video shows the viewer the setup for the PHP watermark as well as important coding language. Continue to Part 2 to learn the core code used in creat…
In this fourth video of the Xpdf series, we discuss and demonstrate the PDFinfo utility, which retrieves the contents of a PDF's Info Dictionary, as well as some other information, including the page count. We show how to isolate the page count in a…

624 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question