Regular Expression problem - certain values break .NET

I have a regular expression that I am using to validate Urls.  I got this regular expression somewhere when my previous one didn't work.

However, when some Urls are entered then submitting the page kills the entire web server.  I tested on oter sites like this regular expression tester site:
http://derekslager.com/blog/posts/2007/09/a-better-dotnet-regular-expression-tester.ashx
and it dies too.

Here is a Url that works:
http://www.amazon.com/Lois-Clark-Adventures-Superman-Complete/dp/B000HWZ4B6/

and a Url that dies:
http://www.amazon.com/Lois-Clark-Adventures-Superman-Complete/dp/B000HWZ4B6/ref=pd_bbs_sr_1?ie=UTF8&s=dvd&qid=1198029833&sr=8-1


Notice the only difference is the ref part at the end of the amazon link.

Here is the regular expression being used in .NET validators:
^(http(s?):\/\/|~/|/)?([a-zA-Z]{1}([\w-]+.)+([\w]{2,5}))(:[\d]{1,5})?/?(\w+.[\w]{3,4})?((\?\w+=\w+)?(&\w+=\w+)*)?$

The website throws a javascript error in the web resource file (the .NET generated file) and if you hit "stop script" then the javascript stops, but the web server then continues maxing out the CPU until it times out doing something....

The issue is that a user of the site will most likely copy a Url from another location (Amazon being a big place) and we don't want the user to have to know to remove the part after the last slash (where Amazon tracks how you hit that specific product).

Any help on how to fix or even a new regular expression would be fine.  It needs to validate any valid Url for websites and images, http or https (ftp, etc. is not needed).

Thanks
LVL 35
mrichmonAsked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

masterpassCommented:
try using this
^(http|https)\://[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(:[a-zA-Z0-9]*)?/?([a-zA-Z0-9\-\._\?\,\'/\\\+&%\$#\=~])*$

Open in new window

0
mrichmonAuthor Commented:
That ones doesn't even match correctly.  If you are going to post an alternate pattern at least test it against the samples I provided before posting.

Besides getting a working pattern I am also interested in why this one breaks .NET rather than just not matching.
0
masterpassCommented:
If you are going to post an alternate pattern at least test it against the samples I provided before posting.----> I had it tested against the sample you have given and it MATCHED !!!

My test bed : http://regexlib.com/RETester.aspx

If I key in you regex, I won't get response from the site

If you are going to post an alternate pattern at least test it against the samples I provided before posting ----> asking whether I have tested would have been good !!!
0
PMI ACP® Project Management

Prepare for the PMI Agile Certified Practitioner (PMI-ACP)® exam, which formally recognizes your knowledge of agile principles and your skill with agile techniques.

mrichmonAuthor Commented:
Okay you tested it on a site that doesn't properly test .NET handling of regular expressions (I have tested there) which is why I posted a test site that replicates how .NET tests regular expressions and that replicates the behavior I am getting in my application.

I ran your expression through another tester and you are correct it did work for that specific example on another test site, but not other valid URLs.  I have found many that work for that one case above but die on others.  The above is the most robust I have seen so far, with the exception of the amazon referrer at the end.

Really I would like to fix the above one since it handles everything I have thrown at it so far except this amazon referrer at the end.

I am willing to ignore IP address based Urls for now, but need it to match all standard http and https urls.  

Here is another test that most url regular expressions cannot match:
http://example.com/blah_blah_(wikipedia)_and_more_(parens)_eh
0
mrichmonAuthor Commented:
Okay I was able to narrow down that the - at the very end of the Amazon url referral part is what is causing the issue - but I don't know why that is crashing this (and some other) regular expressions in .NET as opposed to simply not matching...

So this matches:
http://www.amazon.com/Lois-Clark-Adventures-Superman-Complete/dp/B000HWZ4B6/ref=pd_bbs_sr_1?ie=UTF8&s=dvd&qid=1198029833&sr=8
and so does this:
http://www.amazon.com/Lois-Clark-Adventures-Superman-Complete/dp/B000HWZ4B6/ref=pd_bbs_sr_1?ie=UTF8&s=dvd&qid=1198029833&sr=81

while this kills the server:
http://www.amazon.com/Lois-Clark-Adventures-Superman-Complete/dp/B000HWZ4B6/ref=pd_bbs_sr_1?ie=UTF8&s=dvd&qid=1198029833&sr=8-1
0
mrichmonAuthor Commented:
I narrowed it down even more.. The issue is if the dash is the second to last character and not otherwise
0
tsellsCommented:
Why don't you try this one....yours seems a bit too rigid.  

(?<Protocol>\w+):\/\/(?<Domain>[\w@][\w.:@]+)\/?[\w\.?=%&=\-@/$,]*


public static Regex regex = new Regex(
      "(?<Protocol>\\w+):\\/\\/(?<Domain>[\\w@][\\w.:@]+)\\/?[\\w\\."+
      "?=%&=\\-@/$,]*",
    RegexOptions.IgnoreCase
    | RegexOptions.IgnorePatternWhitespace    
    );

Open in new window

0
tsellsCommented:
Proof it works....
using System;
using System.Text.RegularExpressions;

namespace ConsoleApplication1
{
    class Program
    {
        static void Main(string[] args)
        {
            
           
            Regex regex = new Regex(
                    "(?<Protocol>\\w+):\\/\\/(?<Domain>[\\w@][\\w.:@]+)\\/?[\\w\\." +
                    "?=%&=\\-@/$,]*",
                    RegexOptions.IgnoreCase
                    | RegexOptions.IgnorePatternWhitespace
                    );


            string url1 = "http://www.amazon.com/Lois-Clark-Adventures-Superman-Complete/dp/B000HWZ4B6/";
            string url2 =
                @"http://www.amazon.com/Lois-Clark-Adventures-Superman-Complete/dp/B000HWZ4B6/ref=pd_bbs_sr_1?ie=UTF8&s=dvd&qid=1198029833&sr=8-1";


            bool isMatch1 = regex.IsMatch(url1);
            bool isMatch2 = regex.IsMatch(url2);

            Console.WriteLine(isMatch1);
            Console.WriteLine(isMatch2);
            Console.ReadLine();
        }
    }
}

Open in new window

0
mrichmonAuthor Commented:
tsells,

When I try that one I get the same result as mine - it crashes the web server.  Also it crashes the server for the second example from a sample wikipedia regular expression, which mine didn't....

Maybe it is different for a windows form which is where you tested it?


0
tsellsCommented:
I tested it in a console application.   It sounds like maybe the input is masking characters or something - what kind of errors are you getting?  "Crashing the server" is about as vague as you can get.  Do you have any log files or anything?  Event Logs, etc?  I am assuming you are running IIS.  How about posting some code, etc.  
0
mrichmonAuthor Commented:
I realize that, but have nothing more really.  It just locks and the service maxes out the CPU on the server until it times out.  No errors provided except the generic timeout message.  The server is basically barely responsive until the timeout occurs.

It is basically acting like it would if you coded an endless loop.  However, it is not our specific server nor code - we have tested on others and the exact same thing happens.  The only thing in common is the regular expression(s) and the input value(s) that cause the issue.

Since .NET can generate javascript for validators, we tried that too.  The same thing kills the javascript validation.  The browser hangs until it reports back "Script not responding" and allows you to stop the script.

As I mentioned above I narrowed it down to only occur when the - character is the second to last character.  Otherwise these symptoms do not occur.

That is what is so strange.  If it simply did not match then i could handle updating the regular expression, but I can't figure out why some of these regular expressions die like this only on certain input strings.
0
käµfm³d 👽Commented:
I think you have over-complicated your RegEx pattern. For example, part of your pattern

    [a-zA-Z]{1}([\w-]+.)+

states to look for a single alpha character (of any case) followed by one ore more of the entire sequence: one or more word characters or dashes, followed by any character. This just feels a little redundant to me. If you could shed a bit more light on what you are trying to achieve, a better regex could be developed.

The reason I am asking for more clarity is that at the surface, it appears to me that all you are trying to do is trim the end of a URL string for a particularly-formatted  string (the "ref" stuff in this case). I would think this might be better handled by searching for this extraordinary circumstance, rather than try to validate the "norm". Please correct me where I am misinterpreting your requirement :)
0
tsellsCommented:
I would also be curious as to what the actual string is being validated.  Are you basing what the regex pattern is validating based on what you have typed into a text box or have you examined the actual string that has made it to the validation routine (debugging or some output command)?
0
mrichmonAuthor Commented:
I am not trying to trim the end of a url or detect a specific pattern.  I am trying to validate that what a user types (or more accurately copy/pastes) into a textbox is a valid url.  It could be as simple as http://www.experts-exchange.com/  or it could be a complex url from a major online site.  All I care is - is the Url entered valid?  I don't care if the url is currently working - just is it a valid url.

The regular expression I posted here that I am using I got from somewhere - I don't know the exact location anymore, but it has had the least issues of all I have found.

Here are the requirements:
Validate a http or https url that is entered into a textbox.
Most common Urls submitted are those copied by a user from Amazon, Buy.com, etc or other online sites.

Most URL regular expressions I have tried have issues on one or more of the following urls:

Amazon Image Urls which look like:
http://ecx.images-amazon.com/images/I/51ei0Vp72jL._SL500_SS75_.jpg

Amazon Urls with ref at the end like:
http://www.amazon.com/Lois-Clark-Adventures-Superman-Complete/dp/B000HWZ4B6/ref=pd_bbs_sr_1?ie=UTF8&s=dvd&qid=1198029833&sr=8-1

Wikipedia type urls with parenthesis in them like:
http://example.com/blah_blah_(wikipedia)_and_more_(parens)_eh

The one I am using works for most everything I have thrown at it except the amazon ursl with the ref at the end.  If it was just a matter of it didn't match I could easily change or modify the url - I understand and can write regular expressions well enough to do that.  But what I don't understand is the crash instead of the "no match" for a result.

I am willing to switch to another if it can handle all the cases this one can AND the ones this one can't.  If I switch to one that handles the problem case, but fails elsewhere I am no better off.

I am currently trying to adapt one I found that handles the problem case to also handle some of the other cases it failed on since at least it returns 0 matches instead of a crash, but it is slow going.

I was hoping I could get somewhere faster by posting here....


tsells,
It is having issues on the string being typed in.  I don't know why you would think it is different what is being validated since both javascript, my application, and test sites all return the same results....
0
mrichmonAuthor Commented:
Well, I ended up tinkering and writing something that seems to work for all of the above cases as well as IP.

Here is what I ended up with:
^(http(s?)\:\/\/)?(([a-zA-Z]{1}([\w-]+\.)+([\w]{2,5}))|(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?))(:[\d]{1,5})?((/?[\w-\.]+/)+|/?)([\w=_\.()~]+)?((\?[\w-]+=[\w-\.~]+)?(&[\w-]+=[\w-\.~]+)*)?(#([A-Za-z][\w:.-]*)?)?$
0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
C#

From novice to tech pro — start learning today.