questions about a RegEx used to analyze URL's

curiouswebster
curiouswebster used Ask the Experts™
on
Question about a RegEx:

 @"[&|?](" + "myDomain.com" + ")=(.*?[^&]+)?";

what do these require or prevent before the domain?

[&|?]

and what does this require or prevent after the domain?

(.*?[^&]+)?

Thanks.
Comment
Watch Question

Do more with

Expert Office
EXPERT OFFICE® is a registered trademark of EXPERTS EXCHANGE®
Dr. KlahnPrincipal Software Engineer
Commented:
[&|?] is an unusual regex for parsing URLs.  What it matches is "Any single character from the three-character set &|?"

http://regexstorm.net/tester
curiouswebsterSoftware Engineer

Author

Commented:
Ah, I was tied up thinking it was

& OR ?

The person who wrote this was focused on the query string parameters.

I have seen & AND ? with query string params, but do not recall seeing the | sign being used with query string params.
curiouswebsterSoftware Engineer

Author

Commented:
then, comes a capture set:

(.*?[^&]+)?"

It looks like any number of characters, NOT containing a &

am I seeing that right?

And what does the trailing ? mean?

and the ? after .* means "lazy" but I am not sure what that means.
Build an E-Commerce Site with Angular 5

Learn how to build an E-Commerce site with Angular 5, a JavaScript framework used by developers to build web, desktop, and mobile applications.

Lead SaaS Infrastructure Engineer
Commented:
 @"[&|?](" + "myDomain.com" + ")=(.*?[^&]+)?";

Open in new window


essentially becomes this regex:

@[&|?](myDomain.com)=(.*?[^&]+)?

Open in new window


this means it matches:
First: "@"
Then: "&", or "|", or "?"
Then: "myDomain" ( and Captured as Group1 )
Then:"="
Lastly: The following set is selected entirely once or zero times ( and Captured as Group 2 )
Part 1: Any Character (Except Carriage Returns and New Lines) any number of times matching as few times as possible (can't see how this would be useful on an .* to have the ?)
Part 2: Any Character except "&" at least once up to an unlimited number of times (Will match Newline/carriage return!)


SO here are some example matches and their capture groups:
String: "@&myDomain.com=hello&then"
Matched: "@&myDomain.com=hello"
Group1: "myDomain.com"
Group2: "hello"

Open in new window


String: "@?myDomain.com=There&now"
Matched: "@?myDomain.com=There"
Group1: "myDomain.com"
Group2: "There"

Open in new window


String: "@&myDomain.com=There

is more match here

also more match here

oops&now this is unmatched"
Matched: "@&myDomain.com=There

is more match here

also more match here

oops"
Group1: "myDomain.com"
Group2: "There

is more match here

also more match here

oops"

Open in new window


Here is a great utility I recently found to help with complex regexes if you get stuck, I have saved the example regexes I put here in this link which shows additional information and what the matches are :)

https://regex101.com/r/3nQKTA/1


Here is what it looks like on that website, there are some better explanations then I initially put forward due to the syntax highlighting and breaking down of every part of the command, so take a look.

2018-02-22-15_26_28-Clipboard.png2018-02-22-15_17_11-Online-regex-tes.png
Ben Personick (Previously QCubed)Lead SaaS Infrastructure Engineer
Commented:
Oh neat!

So I didn't realize that the @ is actually part of the variable assignment in C#, but the site I use allows you to re-create the example re-ex matching in a set of languages including C#, when I did that I see the @ is part of the C# handling of the regex.

So it only changes my answer slightly, as @ is NOT matched.

The Website Link still works as I updated the regex.  https://regex101.com/r/3nQKTA/4

Here is the example code the sight create for C# to see the example test strings I had used:

using System;
using System.Text.RegularExpressions;

public class Example
{
    public static void Main()
    {
        string pattern = @"[&|?](myDomain.com)=(.*?[^&]+)?";
        string input = @"
something|myDomain.com=&&&&&&& &

@&myDomain.com=hello&then

@?myDomain.com=There&now

@&myDomain.com=There

is more match here

also more match here

oops&now this is unmatched

";
        
        foreach (Match m in Regex.Matches(input, pattern))
        {
            Console.WriteLine("'{0}' found at index {1}.", m.Value, m.Index);
        }
    }
}

Open in new window


Updated Regex as shown on the regex101 website
Also, I forgot to mention I realized why the .*? is there, it is so the "&" character is only captured when it starts at the = sign, as many times as it may, until any other character is used, and then not captured again because that will move the section on to the next match which is any character except ampersand.
curiouswebsterSoftware Engineer

Author

Commented:
Awesome tool! I am still trying to get my arms around it, but this is THE BEST RegEx site I have seen!

What's the best Flavor for me to use, given my target platform is C#?
Ben Personick (Previously QCubed)Lead SaaS Infrastructure Engineer
Commented:
Right?  I stumbled on this site a couple months ago from someone else's Q on Regex somewhere and just remembered it after trying to parse your regex in my head.

Now..I know your Q was "What does this Regex Do"

but, perhaps a better Question to answer would be for me to ask you "What do you WANT the regex to match?"
curiouswebsterSoftware Engineer

Author

Commented:
Well, funny you should ask.

I had this super RegEx working to enforce the domain was on a white list:

            string testRegEx = @"^https?:\/\/(" + whitelistedRedirects + ")[^.].*\\/?((goto|returnurl)=https?:\/\/(" + whitelistedRedirects + ")[:|\\/].*)?";

but it enforced that sub-domains must also be white-listed. The whitelist was to look thusly:

whitelistedRedirects = "mydomain.org|sso.mydomain.org";

But I wanted to have a version that mandated only that "mydomain.org" was in the whitelist, when it was part of the ReturnURL. (is this risky? Or does it add no value to force ALL domains to be in the whitelist?)

Another developer on the team came up with that other one I posted up top, but I did not understand it like the above one, since mine was created via multiple posts on EE, and I actually understand it (for the most part)

I feel better being more expressive, to make the RegEx more reaqdable. For example, if goto or returnurl is always in a return url, then it helps me to see it there. Brevity is confusing when reading both hieroglyphics AND RegEx.

Plus, I have never gotten the other guy's to return True, which normally means I am dead in the water. Mine return true, when expected, so I can take baby steps to bring it to the next level of functionality.

I am fine updating my latest RegEx, but it needs to no longer have the requirement that sub-domains be listed on the whitelist.

It seem the following "https?://" needs to be replaced with a wildcard of any number of characters which could make up a sub-domain.


Also, I added "[^.].*"

to prevent a hacker from making my domain into a sub-domain on HIS domain, thusly

mydomain.org.EVILSITE.COM

and having my RegEx thing it was a success,
Ben Personick (Previously QCubed)Lead SaaS Infrastructure Engineer
Commented:
How about this:

^(https?:\/\/(.*\.mydomain.org|mydomain.org)[^.\r\n]*\\/?)((goto|returnurl)=\1)?

Open in new window

curiouswebsterSoftware Engineer

Author

Commented:
That still returns false...But we are getting snow and I gotta head out until Tuesday AM. I can leave this issue open....

until then.

Cheers.
Ben Personick (Previously QCubed)Lead SaaS Infrastructure Engineer
Commented:
Yeah it won't be the same return URL, so how about this:

^HTTPS?:\/\/(.*\.mydomain\.org|mydomain\.org)[^.\r\n]*(goto|return)=HTTPS?:\/\/(.*\.mydomain\.org|mydomain\.org)[^.\r\n]*

Open in new window


here as C# Code:
string pattern = @"^HTTPS?:\/\/(.*\.mydomain\.org|mydomain\.org)[^.\r\n]*(goto|return)=HTTPS?:\/\/(.*\.mydomain\.org|mydomain\.org)[^.\r\n]*";

Open in new window


https://regex101.com/r/PSfEXK/1
curiouswebsterSoftware Engineer

Author

Commented:
thanks
Ben Personick (Previously QCubed)Lead SaaS Infrastructure Engineer

Commented:
Glad to help :)

Do more with

Expert Office
Submit tech questions to Ask the Experts™ at any time to receive solutions, advice, and new ideas from leading industry professionals.

Start 7-Day Free Trial