Link to home
Create AccountLog in
Avatar of nbcit
nbcit

asked on

URL Parsing Regular Expression

I need a regular expression to return a match in the following conditions:

Between the last slash (/) of a URL and either the end of the string (with no GET statement) or the ? (with a GET statment), return a match if the contents DO NOT include a period (.).  The expression should not match if there is nothing between the last slash (/) and the end of the string or ? mark.

Basically I am trying to determine whether the URL looks like it is directed at a folder (instead of a page) but missing the trailing folder slash (/).  I will not be using the returned match(s), but merely seeing if there were matches or not.  I believe the Reg Expression engine I am using is based on the Perl syntax.

The incoming URL could be any of the following patterns (matched or no match included):
/  (no match)
/?var1=1&var2=2   (no match)
/folder  (match)
/page.asp  (no match)
/page.asp?var1=1&var2=2  (no match)
/folder1/folder2  (match)
/folder1/folder2?var1=1&var2=2  (match)
/folder1/page.asp  (no match)
/folder1/page.asp?var1=1&var2=2  (no match)

Thanks everyone!
Avatar of Richard Quadling
Richard Quadling
Flag of United Kingdom of Great Britain and Northern Ireland image

Try ...

^/(?!.*\.)(?!\?).*$

Options: case insensitive; ^ and $ match at line breaks

Assert position at the beginning of a line (at beginning of the string or after a line break character) «^»
Match the character / literally «/»
Assert that it is impossible to match the regex below starting at this position (negative lookahead) «(?!.*\.)»
   Match any single character that is not a line break character «.*»
      Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
   Match the character . literally «\.»
Assert that it is impossible to match the regex below starting at this position (negative lookahead) «(?!\?)»
   Match the character ? literally «\?»
Match any single character that is not a line break character «.*»
   Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
Assert position at the end of a line (at the end of the string or before a line break character) «$»


Created with RegexBuddy
Avatar of nbcit
nbcit

ASKER

RQuadling,

That looks like it might work, but I suppose it would be possible for the folder1 to have a period in it, which would not be handled properly by your regex.  That is why I said that only after the last slash in the URL.  Here is several additional examples that would show that, and how it SHOULD react:

/fol.der.1/folder2  (match)
/fol.der.1/folder2?var1=1&var2=2  (match)
/fol.der.1/page.asp  (no match)
/fol.der.1/page.asp?var1=1&var2=2  (no match)

Thanks!
Keep them coming and I can adapt it until you are happy.


Avatar of nbcit

ASKER

RQuadling,

So right now I see two distinct problems with the query you provided.
First, the period in the paths preceeding the last slash (/) are also considered.  For example:
/fol.der.1/page.asp  (shows a match, but should not match)
/fol.der.1/page.asp?var1=1&var2=2  (shows a match, but should not match)

Also your regex does not allow for no charactors between the last slash (/) and the end or question mark (?):
/ (shows a match, should not match)
/folder1/ (shows a match, but should not match)

If you could poke around with those examples we should be good to go.  Thanks for your help!


^/(?:[^?]+/[^./]++|[^/?.]+)$

Options: case insensitive; ^ and $ match at line breaks

Assert position at the beginning of a line (at beginning of the string or after a line break character) «^»
Match the character / literally «/»
Match the regular expression below «(?:[^?]+/[^./]++|[^/?.]+)»
   Match either the regular expression below (attempting the next alternative only if this one fails) «[^?]+/[^./]++»
      Match any character that is NOT a ? «[^?]+»
         Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
      Match the character / literally «/»
      Match a single character NOT present in the list ./ «[^./]++»
         Between one and unlimited times, as many times as possible, without giving back (possessive) «++»
   Or match regular expression number 2 below (the entire group fails if this one fails to match) «[^/?.]+»
      Match a single character NOT present in the list /?. «[^/?.]+»
         Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
Assert position at the end of a line (at the end of the string or before a line break character) «$»


Created with RegexBuddy
Not matched

/
/?var1=1&var2=2
/page.asp
/page.asp?var1=1&var2=2
/folder1/page.asp
/folder1/fol.der.2/page.asp
/folder1/page.asp?var1=1&var2=2
/fol.der.1/page.asp
/fol.der.1/page.asp?var1=1&var2=2



Matched

/folder
/folder1/folder2/folder3
/folder1/folder2?var1=1&var2=2
/fol.der.1/folder2
/fol.der.1/folder2?var1=1&var2=2
Avatar of nbcit

ASKER

RQuadling,

I am having trouble testing the last expression you supplied.  Are you sure it is in Perl-compatible format?
I have tried to test it in both nregexp (http://www.nregex.com/nregex/default.aspx), and in PHP using preg_match (I noticed you are a PHP Sage).  Both worked fine with the first expression you supplied, but both fail with the second one.

nregexp shows 'undefined', which often happens when you supply an incomplete or incompatible expression, and PHP preg_match throws a "Unknown modifier '?'" error.

Any ideas?
Hmmm

The snippet below outputs ...

/folder
/folder1/folder2
/folder1/folder2?var1=1&var2=2
/fol.der.1/folder2
/fol.der.1/folder2?var1=1&var2=2



<?php
$s_Test = <<< END_TEST
/
/?var1=1&var2=2
/page.asp
/page.asp?var1=1&var2=2
/folder1/page.asp
/folder1/page.asp?var1=1&var2=2
/fol.der.1/page.asp
/fol.der.1/page.asp?var1=1&var2=2
 
/folder
/folder1/folder2
/folder1/folder2?var1=1&var2=2
/fol.der.1/folder2
/fol.der.1/folder2?var1=1&var2=2
 
END_TEST;
 
foreach(explode(PHP_EOL, $s_Test) as $s_Line)
	{
	if (1 == preg_match('%^/(?:[^?]+/[^./]++|[^/?.]+)$%im', $s_Line, $a_Match))
		{
		echo $s_Line, PHP_EOL;
		}
	}

Open in new window

Avatar of nbcit

ASKER

OK, I was just doing it wrong.  Nice job, it looks like that solves the problem completely.  I have a question relating to this.  I could post it as another question if you think that is appropriate, but it is 100% about making this particular regexp even better for me.  

I know I said I didn't need the matches returned, but now I think that that could really be useful.  Can grouping somehow be used to return everything from the beginning of the string to the question mark (?), if it exists (or the end of the string if it does not)?
Avatar of nbcit

ASKER

Oh, I mean everything PRIOR TO the question mark, not including it.
So, just the path and not the query?

Hmm.

Probably, but I would do 2 passes. Collect matches and then simply split the result in ?

The code below outputs ...

/folder vs /folder
/folder1/folder2 vs /folder1/folder2
/folder1/folder2 vs /folder1/folder2?var1=1&var2=2
/fol.der.1/folder2 vs /fol.der.1/folder2
/fol.der.1/folder2 vs /fol.der.1/folder2?var1=1&var2=2


<?php
$s_Test = <<< END_TEST
/
/?var1=1&var2=2
/page.asp
/page.asp?var1=1&var2=2
/folder1/page.asp
/folder1/page.asp?var1=1&var2=2
/fol.der.1/page.asp
/fol.der.1/page.asp?var1=1&var2=2
 
/folder
/folder1/folder2
/folder1/folder2?var1=1&var2=2
/fol.der.1/folder2
/fol.der.1/folder2?var1=1&var2=2
 
END_TEST;
 
foreach(explode(PHP_EOL, $s_Test) as $s_Line)
	{
	if (1 == preg_match('%^/(?:[^?]+/[^./]++|[^/?.]+)$%im', $s_Line, $a_Match))
		{
		list($s_Path) = explode('?', $s_Line);
		echo $s_Path, ' vs ', $s_Line, PHP_EOL;
		}
	}

Open in new window

Avatar of nbcit

ASKER

I'm actually not using PHP for the final project, I was just testing it there because it made it easy to run all the tests at once.  This is actually going into a IIS URL Rewrite ISAPI module configuration file.  I have only the embedded commands the module provides to work with, and splitting the output is not possible.

Would it be difficult to change the expression to output that somehow as one of the matches?
Avatar of nbcit

ASKER

What would be amazing (the coolest) would be to have everything before the ? as one match, then everything after the ? as another match, but following the same rules as before.

That way when the module rewrites the URL, it can insert a / between the two.
Try this ...

Outputs ...

Array
(
    [0] => /folder
    [1] => /folder
    [2] =>
)
Array
(
    [0] => /folder1/folder2
    [1] => /folder1/folder2
    [2] =>
)
Array
(
    [0] => /folder1/folder2?var1=1&var2=2
    [1] => /folder1/folder2
    [2] => var1=1&var2=2
)
Array
(
    [0] => /fol.der.1/folder2
    [1] => /fol.der.1/folder2
    [2] =>
)
Array
(
    [0] => /fol.der.1/folder2?var1=1&var2=2
    [1] => /fol.der.1/folder2
    [2] => var1=1&var2=2
)

So $1 is the path and $2 is the query with no ?
<?php
$s_Test = <<< END_TEST
/
/?var1=1&var2=2
/page.asp
/page.asp?var1=1&var2=2
/folder1/page.asp
/folder1/page.asp?var1=1&var2=2
/fol.der.1/page.asp
/fol.der.1/page.asp?var1=1&var2=2
 
/folder
/folder1/folder2
/folder1/folder2?var1=1&var2=2
/fol.der.1/folder2
/fol.der.1/folder2?var1=1&var2=2
 
END_TEST;
 
foreach(explode(PHP_EOL, $s_Test) as $s_Line)
	{
	if (1 == preg_match('%(?=^/[^?]+/[^./]++$|^/[^/?.]+$)([^?]+)\??(.*)%im', $s_Line, $a_Match))
		{
		print_r($a_Match);
		}
	}

Open in new window

I'm off now. Hope this all works out for you. May be able to drop in again in around 10 hours time.
# following raw regex should work:

(?:/[^?.]+(?:\?.*)?)$
ahoffmann, that does match, but doesn't extract the path and query.

Nice and short regex though.

This does do the extract though ...

(?:(/[^?.]+)(?:\?(.*))?)$

missed the required back reference (as described in the question), thanks for correcting RQuadling
NP.

Using my previous code and your amended regex, the output is ...

Array
(
    [0] => /folder
    [1] => /folder
)
Array
(
    [0] => /folder1/folder2
    [1] => /folder1/folder2
)
Array
(
    [0] => /folder1/folder2?var1=1&var2=2
    [1] => /folder1/folder2
    [2] => var1=1&var2=2
)
Array
(
    [0] => /folder2
    [1] => /folder2
)
Array
(
    [0] => /folder2?var1=1&var2=2
    [1] => /folder2
    [2] => var1=1&var2=2
)

Note that sometimes there is no [2]. Not sure on the effect in a mod_rewrite rule for this.
> Not sure on the effect in a mod_rewrite rule ..
assuming that you tested with PHP's so-called PCRE, it's general a bad idea to copare PHP regex against other more regex (mod_rewrite for example).
I doubt that PHP could be considered reliable here.

For mod_security a unresolved back reference is simply the empty string (as expected by humans:)
"So-called? "

I use RegexBuddy which has a whole host of tools for comparing different regex engines. For PHP, it has both PCRE and EREG, as well as C#, Delphi (.NET and Win32), Java, Javascript, MySQL, Oracle, PCRE (C lib based), Perl, PHP (Preg/Ereg), PostgreSQL, PowerShell, Python, R Language, REALBasic, Ruby, TCL, VBScript, VB6, VB.NET, wxWidgets, XML Schema and XPath.

So, a WIDE range of engines. The code I used WAS PHP Preg - Ereg won't work for this.

Below are some other code snippets generated by RegexBuddy.

Ignoring the DB specific searches, the regex is acceptable in all but Ereg. As such , I would have high confidence in the regex working in mod_rewrite.

A few of the engines require more escaping of some symbols and RegexBuddy handles this for me.
string resultString = null;
try {
	Regex regexObj = new Regex(@"(?:(/[^?.]+)(?:\?(.*))?)$", RegexOptions.IgnoreCase | RegexOptions.Multiline);
	resultString = regexObj.Match(subjectString).Groups[1].Value;
} catch (ArgumentException ex) {
	// Syntax error in the regular expression
}
===========
var
	RegexObj: Regex;
	ResultString: string;
 
RegexObj := nil;
ResultString := '';
try
	RegexObj := Regex.Create('(?:(/[^?.]+)(?:\?(.*))?)$', RegexOptions.IgnoreCase or RegexOptions.Multiline);
	ResultString := RegexObj.Match(SubjectString).Groups[1].Value;
except
	on E: ArgumentException do begin
		// Syntax error in the regular expression
	end;
end;
 
============
var
	Regex: TPerlRegEx;
	ResultString: string;
 
Regex := TPerlRegEx.Create(nil);
Regex.RegEx := '(?:(/[^?.]+)(?:\?(.*))?)$';
Regex.Options := [preCaseless, preMultiLine];
Regex.Subject := SubjectString;
if Regex.Match then begin
	if Regex.SubExpressionCount >= 1 then begin
		ResultString := Regex.SubExpressions[1];
	end
	else begin
		ResultString := '';
	end;
end
else begin
	ResultString := '';
end;
=============
String ResultString = null;
try {
	Pattern regex = Pattern.compile("(?:(/[^?.]+)(?:\\?(.*))?)$", Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CASE | Pattern.MULTILINE);
	Matcher regexMatcher = regex.matcher(subjectString);
	if (regexMatcher.find()) {
		ResultString = regexMatcher.group(1);
	} 
} catch (PatternSyntaxException ex) {
	// Syntax error in the regular expression
}
 
===========
var rx_Email = /(?:(\/[^?.]+)(?:\?(.*))?)$/im;
var match = rx_Email.exec(subject);
if (match != null) {
	result = match[1];
} else {
	result = "";
}
 
===========
if ($subject =~ m!(?:(/[^?.]+)(?:\?(.*))?)$!im) {
	# Successful match
} else {
	# Match attempt failed
}
===========
# Your regular expression could not be converted to the flavor required by this language:
# A POSIX Extended RE cannot match the start and the end of a line with ^ and $
 
# Because of this, the code snippet below will not work as you intended, if at all.
 
if (eregi('((/[^?.]+)(\?(.*))?)$', $subject)) {
	# Successful match
} else {
	# Match attempt failed
}
===========
$regex = [regex] '(?im)(?:(/[^?.]+)(?:\?(.*))?)$'
$result = $regex.Match($subject).Groups[1].Value;
===========
wxString resultString;
wxRegEx regexObj(_T("(?pw)(?:(/[^?.]+)(?:\\?(.*))?)$"), wxRE_ADVANCED + wxRE_ICASE);
if (regexObj.Matches(subjectString)) {
  resultString = regexObj.GetMatch(subjectString, 1);
}
==========
 
Dim ResultString As String
Try
	Dim RegexObj As New Regex("(?:(/[^?.]+)(?:\?(.*))?)$", RegexOptions.IgnoreCase Or RegexOptions.Multiline)
	ResultString = RegexObj.Match(SubjectString).Groups(1).Value
Catch ex As ArgumentException
	'Syntax error in the regular expression
End Try
 
===============
 
Dim ResultString As String
Dim myMatches As MatchCollection
Dim myMatch As Match
Dim myRegExp As RegExp
Set myRegExp = New RegExp
myRegExp.IgnoreCase = True
myRegExp.MultiLine = True
myRegExp.Pattern = "(?:(/[^?.]+)(?:\?(.*))?)$"
Set myMatches = myRegExp.Execute(SubjectString)
If myMatches.Count >= 1 Then
	Set myMatch = myMatches(0)
	If myMatch.SubMatches.Count >= 3 Then
		ResultString = myMatch.SubMatches(3-1)
	Else
		ResultString = ""
	End If
Else
	ResultString = ""
End If
==================
 
Dim myRegExp, ResultString, myMatches, myMatch As Match
Dim myRegExp As RegExp
Set myRegExp = New RegExp
myRegExp.IgnoreCase = True
myRegExp.MultiLine = True
myRegExp.Pattern = "(?:(/[^?.]+)(?:\?(.*))?)$"
Set myMatches = myRegExp.Execute(SubjectString)
If myMatches.Count >= 1 Then
	Set myMatch = myMatches(0)
	If myMatch.SubMatches.Count >= 3 Then
		ResultString = myMatch.SubMatches(3-1)
	Else
		ResultString = ""
	End If
Else
	ResultString = ""
End If
===============
regexp = /(?:(\/[^?.]+)(?:\?(.*))?)$/i
match = regexp.match(subject)
if match
	match = match[1]
else
	match = ""
end
============
Dim myRegEx As RegEx
Dim myMatch As RegExMatch
Dim ResultString As String
myRegEx = New RegEx
myRegEx.SearchPattern = "(?:(/[^?.]+)(?:\?(.*))?)$"
myMatch = myRegEx.Search(SubjectString)
If myMatch <> Nil Then
	ResultString = myMatch.SubExpressionString(1)
Else
	ResultString = ""
End If
===================
reobj = re.compile(r"(?:(/[^?.]+)(?:\?(.*))?)$", re.IGNORECASE | re.MULTILINE)
match = reobj.search(subject)
if match:
	result = match.group(1)
else:
	result = ""
======================
 
$regex = [regex] '(?im)(?:(/[^?.]+)(?:\?(.*))?)$'
$result = $regex.Match($subject).Groups[1].Value;
============
etc.

Open in new window

Avatar of nbcit

ASKER

I finally got back in the office and am trying to get this settled.  The regex I am using now is the one you all supplied:  

However, it seems that with this regex, this matches, but should not (due to the slash after the 123):
/123/?1=2

This is working correctly, as it should match and does properly:
/123?1=2

Any ideas?
Avatar of nbcit

ASKER

I meant to put in the regex I am using, it is:  (?:(/[^?.]+)(?:\?(.*))?)$'

Thanks!
Avatar of nbcit

ASKER

I wish you could edit posts here.... The ' was extra.  The correct regex I am using is:

(?:(/[^?.]+)(?:\?(.*))?)$

Sorry! ;-)
ASKER CERTIFIED SOLUTION
Avatar of Richard Quadling
Richard Quadling
Flag of United Kingdom of Great Britain and Northern Ireland image

Link to home
membership
Create an account to see this answer
Signing up is free. No credit card required.
Create Account
Avatar of nbcit

ASKER

I ran the PHP test that you provided, and see that PHP does not match that string, but for some reason the engine I am using (A perl-compatible PRCE engine in the Ionics ISAPI Rewrite module) reacts otherwise.  Here is an extract from the debug log:

Wed Mar 04 09:01:14 -  1712 - EvalCondition: checking '/123/?1=2' against pattern '(?:(/[^?.]+)(?:\?(.*))?)$'
Wed Mar 04 09:01:14 -  1712 - EvalCondition: match result: 3 (match)
Wed Mar 04 09:01:14 -  1712 - EvalCondition: returning TRUE
Wed Mar 04 09:01:14 -  1712 - EvalCondition: returning TRUE
Wed Mar 04 09:01:14 -  1712 - EvalConditionList: rule 4, TRUE, Rule will apply

The rule runs properly (true) against '/123?1=2', but should not return a match (false) for '/123/?1=2'.  
SOLUTION
Link to home
membership
Create an account to see this answer
Signing up is free. No credit card required.
Create Account
Avatar of nbcit

ASKER

Sorry, I hadn't realized you made any changes to the regex today.

It works great!  You're a wizard!  Thank you so much, this has been driving me crazy!
I would give 2 million points if I could, but 500 will have to do.  Thanks again!
the result is a nice example of co-works, isn't it?
Yep. This should be a split. The regex you've got is mainly ahoffmann's. I just added the capture and [^/].