questions about a RegEx used to analyze URL's

Question about a RegEx:

 @"[&|?](" + "myDomain.com" + ")=(.*?[^&]+)?";

what do these require or prevent before the domain?

[&|?]

and what does this require or prevent after the domain?

(.*?[^&]+)?

Thanks.
newbiewebSr. Software EngineerAsked:
Who is Participating?

[Product update] Infrastructure Analysis Tool is now available with Business Accounts.Learn More

x
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

Dr. KlahnPrincipal Software EngineerCommented:
[&|?] is an unusual regex for parsing URLs.  What it matches is "Any single character from the three-character set &|?"

http://regexstorm.net/tester
0
newbiewebSr. Software EngineerAuthor Commented:
Ah, I was tied up thinking it was

& OR ?

The person who wrote this was focused on the query string parameters.

I have seen & AND ? with query string params, but do not recall seeing the | sign being used with query string params.
0
newbiewebSr. Software EngineerAuthor Commented:
then, comes a capture set:

(.*?[^&]+)?"

It looks like any number of characters, NOT containing a &

am I seeing that right?

And what does the trailing ? mean?

and the ? after .* means "lazy" but I am not sure what that means.
0
OWASP Proactive Controls

Learn the most important control and control categories that every architect and developer should include in their projects.

Ben Personick (Previously QCubed)Lead Network EngineerCommented:
 @"[&|?](" + "myDomain.com" + ")=(.*?[^&]+)?";

Open in new window


essentially becomes this regex:

@[&|?](myDomain.com)=(.*?[^&]+)?

Open in new window


this means it matches:
First: "@"
Then: "&", or "|", or "?"
Then: "myDomain" ( and Captured as Group1 )
Then:"="
Lastly: The following set is selected entirely once or zero times ( and Captured as Group 2 )
Part 1: Any Character (Except Carriage Returns and New Lines) any number of times matching as few times as possible (can't see how this would be useful on an .* to have the ?)
Part 2: Any Character except "&" at least once up to an unlimited number of times (Will match Newline/carriage return!)


SO here are some example matches and their capture groups:
String: "@&myDomain.com=hello&then"
Matched: "@&myDomain.com=hello"
Group1: "myDomain.com"
Group2: "hello"

Open in new window


String: "@?myDomain.com=There&now"
Matched: "@?myDomain.com=There"
Group1: "myDomain.com"
Group2: "There"

Open in new window


String: "@&myDomain.com=There

is more match here

also more match here

oops&now this is unmatched"
Matched: "@&myDomain.com=There

is more match here

also more match here

oops"
Group1: "myDomain.com"
Group2: "There

is more match here

also more match here

oops"

Open in new window


Here is a great utility I recently found to help with complex regexes if you get stuck, I have saved the example regexes I put here in this link which shows additional information and what the matches are :)

https://regex101.com/r/3nQKTA/1


Here is what it looks like on that website, there are some better explanations then I initially put forward due to the syntax highlighting and breaking down of every part of the command, so take a look.

2018-02-22-15_26_28-Clipboard.png2018-02-22-15_17_11-Online-regex-tes.png
0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
Ben Personick (Previously QCubed)Lead Network EngineerCommented:
Oh neat!

So I didn't realize that the @ is actually part of the variable assignment in C#, but the site I use allows you to re-create the example re-ex matching in a set of languages including C#, when I did that I see the @ is part of the C# handling of the regex.

So it only changes my answer slightly, as @ is NOT matched.

The Website Link still works as I updated the regex.  https://regex101.com/r/3nQKTA/4

Here is the example code the sight create for C# to see the example test strings I had used:

using System;
using System.Text.RegularExpressions;

public class Example
{
    public static void Main()
    {
        string pattern = @"[&|?](myDomain.com)=(.*?[^&]+)?";
        string input = @"
something|myDomain.com=&&&&&&& &

@&myDomain.com=hello&then

@?myDomain.com=There&now

@&myDomain.com=There

is more match here

also more match here

oops&now this is unmatched

";
        
        foreach (Match m in Regex.Matches(input, pattern))
        {
            Console.WriteLine("'{0}' found at index {1}.", m.Value, m.Index);
        }
    }
}

Open in new window


Updated Regex as shown on the regex101 website
Also, I forgot to mention I realized why the .*? is there, it is so the "&" character is only captured when it starts at the = sign, as many times as it may, until any other character is used, and then not captured again because that will move the section on to the next match which is any character except ampersand.
0
newbiewebSr. Software EngineerAuthor Commented:
Awesome tool! I am still trying to get my arms around it, but this is THE BEST RegEx site I have seen!

What's the best Flavor for me to use, given my target platform is C#?
1
Ben Personick (Previously QCubed)Lead Network EngineerCommented:
Right?  I stumbled on this site a couple months ago from someone else's Q on Regex somewhere and just remembered it after trying to parse your regex in my head.

Now..I know your Q was "What does this Regex Do"

but, perhaps a better Question to answer would be for me to ask you "What do you WANT the regex to match?"
0
newbiewebSr. Software EngineerAuthor Commented:
Well, funny you should ask.

I had this super RegEx working to enforce the domain was on a white list:

            string testRegEx = @"^https?:\/\/(" + whitelistedRedirects + ")[^.].*\\/?((goto|returnurl)=https?:\/\/(" + whitelistedRedirects + ")[:|\\/].*)?";

but it enforced that sub-domains must also be white-listed. The whitelist was to look thusly:

whitelistedRedirects = "mydomain.org|sso.mydomain.org";

But I wanted to have a version that mandated only that "mydomain.org" was in the whitelist, when it was part of the ReturnURL. (is this risky? Or does it add no value to force ALL domains to be in the whitelist?)

Another developer on the team came up with that other one I posted up top, but I did not understand it like the above one, since mine was created via multiple posts on EE, and I actually understand it (for the most part)

I feel better being more expressive, to make the RegEx more reaqdable. For example, if goto or returnurl is always in a return url, then it helps me to see it there. Brevity is confusing when reading both hieroglyphics AND RegEx.

Plus, I have never gotten the other guy's to return True, which normally means I am dead in the water. Mine return true, when expected, so I can take baby steps to bring it to the next level of functionality.

I am fine updating my latest RegEx, but it needs to no longer have the requirement that sub-domains be listed on the whitelist.

It seem the following "https?://" needs to be replaced with a wildcard of any number of characters which could make up a sub-domain.


Also, I added "[^.].*"

to prevent a hacker from making my domain into a sub-domain on HIS domain, thusly

mydomain.org.EVILSITE.COM

and having my RegEx thing it was a success,
0
Ben Personick (Previously QCubed)Lead Network EngineerCommented:
How about this:

^(https?:\/\/(.*\.mydomain.org|mydomain.org)[^.\r\n]*\\/?)((goto|returnurl)=\1)?

Open in new window

0
newbiewebSr. Software EngineerAuthor Commented:
That still returns false...But we are getting snow and I gotta head out until Tuesday AM. I can leave this issue open....

until then.

Cheers.
0
Ben Personick (Previously QCubed)Lead Network EngineerCommented:
Yeah it won't be the same return URL, so how about this:

^HTTPS?:\/\/(.*\.mydomain\.org|mydomain\.org)[^.\r\n]*(goto|return)=HTTPS?:\/\/(.*\.mydomain\.org|mydomain\.org)[^.\r\n]*

Open in new window


here as C# Code:
string pattern = @"^HTTPS?:\/\/(.*\.mydomain\.org|mydomain\.org)[^.\r\n]*(goto|return)=HTTPS?:\/\/(.*\.mydomain\.org|mydomain\.org)[^.\r\n]*";

Open in new window


https://regex101.com/r/PSfEXK/1
0
newbiewebSr. Software EngineerAuthor Commented:
thanks
0
Ben Personick (Previously QCubed)Lead Network EngineerCommented:
Glad to help :)
0
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Regular Expressions

From novice to tech pro — start learning today.