Regex help/clarity needed

Hi folks,

I am working on a sign up page (perl back-end) that takes a URL as one of the required fields during the registration step. When submitted, the perl script uses the following regex to determine if the user submitted their domain only or the full URL with http or https included at the beginning.

my $url_is_good = $vars->{web_site_url} =~ m|https?://[^.]+\..+|;

The problem we're having is that it sometimes will store the submitted URL as http://http://www.domain.com.

I am still trying to wrap my head around regex and could use the eyes of an expert to tell me if the above code contains something that is not correct. The way I am reading it is that it will only check if there is an "https://" and include it, but the rest seems a little confusing because of the [^.] character class part.

Thanks folks.

~A~
LVL 12
Richard DavisSenior Web DeveloperAsked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

Fairlight2cxCommented:
Anchor your http and exclude a couple more characters:

m|^https?://[^.:/]+\..+|

The anchor (caret) means that the pattern can -only- match at the beginning of the line.  Additionally, they can't double-iterate http:// if you ban : and / via the negation set.
0
Richard DavisSenior Web DeveloperAuthor Commented:
Thanks Fairlight2cx,

The logic is such that we want to determine if the user did not supply it and if they did, then slate the $url_is_good var as not good basically.

Does what you just said accomplish that? Also, what substitution would I use to put it into a string if http:// or https:// were omitted at submission time?
0
ozoCommented:
How would one know whether it was  http:// that was omitted, or https:// that was omitted?
0
Bootstrap 4: Exploring New Features

Learn how to use and navigate the new features included in Bootstrap 4, the most popular HTML, CSS, and JavaScript framework for developing responsive, mobile-first websites.

Richard DavisSenior Web DeveloperAuthor Commented:
Ozo, because the https? part of the regex states that http has to be there and the s? is only detected if it's present.

~A~
0
ozoCommented:
If if http:// or https:// were omitted at submission time, then neither would be present.
Or do I misunderstand what you are asking about what substitution would I use to put it into a string ?
0
Richard DavisSenior Web DeveloperAuthor Commented:
Okay, basically the desired result is that that I need to read a variable's value. The value will either be just www.domain.tld or begin with http:// or https:// then the domain (e.g. "http://www.domain.tld").

If the string is missing the http:// or the https:// then I need to prepend it to the string so that the final string will, at the very least, have "http://" preceeding the domain string.

Hope that helped clarify it better, Ozo.

~A~
0
ozoCommented:
#so they are missing, assume it was http:// ?
$string =~ s#^(?:http(s)?://)?#http$1://#;

0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
Richard DavisSenior Web DeveloperAuthor Commented:
Okay. A few questions just so I understand your thinking here, if you don't mind.

1) What is the first ? doing since the only thing preceding it is the opening parentheses?
2) The 3rd ? is allowing for 0 or more occurrences of the contents of the parentheses, yes?
3) The $1 is confusing me. I know it's meant as a placeholder var for data, but between the http and the ;// is what's throwing me.

Thanks, Ozo.

~A~
0
Fairlight2cxCommented:
The ?: is notation that indicates that the parentheses should not create a back-reference like $1, $2, etc.  In other words, use the "or" grouping but don't waste resources creating backrefs you won't use.

The ()'s he has around the 's' and before the '?' are creating a backref to the 's' in https, if one is there, or an empty string if not.

Yes, the third '?' matches zero or more of the preceeding group.

The $1 is a back-reference to that 's' paren match.  If one was provided, it will be tacked in; if not, it will be empty.

man perlre

Look up back-references for the full mechanics.
0
Richard DavisSenior Web DeveloperAuthor Commented:
Fairlight2cx,

Thank you for that very concise explanation. I am attempting to make as much sense as I can of regex since my new job seems to be seated very firmly in an all Perl environment.

That was very helpful.

So, would Ozo's regex solution then provide me the condition insertion/substitution in the event that the http:// or https:// were absent from the string being tested?

~A~
0
Fairlight2cxCommented:
Yes, it would.  Let me break it down for you atomically, so you see and understand fully what's going on:

s#^(?:http(s)?://)?#http$1://#

s#
Start the substitution's matching section

^
Anchor the pattern matching to the beginning of the string.

(?:
Start a grouped expression, but create -no- back-references.  This is both for efficiency and to not needlessly confuse one's self by creating back-references you won't need when you also create ones you will need.

http
Match that literally.

(s)?
Match zero or one instances of 's'.  Create a back-reference in $1 (since it's our first paren set that doesn't say "don't make a back-reference") that we can use later, containing the 's' or not, depending on its presence.

://
Match that literally.

)?
Close the grouped expression, and the entire grouped expression may be present zero or one times--only at the beginning of the string (from the ^ earlier).

#
Start the substitution section.

http
Literally put 'http'  there.

$1
Put the 's' there if it was found and matched, otherwise put nothing there.

://
Literally put '://' there.

#
End the substitution.

Since the entire match in the pattern match is a zero-or-one match, it can be there or it can not.  If it's there, it'll be preserved accurately, including differentiating between http and https.  If it wasn't there at all, a substitution is still made, thus inserting 'http://', because the anchoring says that we matched the beginning of the line, and the whole next segment was optional--but it has a place at which to "substitute"...it just would have happened to be the beginning of the string alone.  So if neither 'http://', nor 'https://' were present, 'http://' will be provided, as the back-reference $1 will be empty due to never actually having been populated (since it's part of a pattern that fails to be matched entirely).

The short answer is, "Yes, Ozo's expression will work if you're defaulting to http rather than https if neither is specified."

Hope this helps.
0
ozoCommented:
see
perldoc perlre
and the   Regexp Quote-Like Operators  section of
perldoc perlop
0
ozoCommented:
perl -MYAPE::Regex::Explain -e "print YAPE::Regex::Explain->new(qr#^(?:http(s)?://)?#)->explain"
The regular expression:

(?-imsx:^(?:http(s)?://)?)

matches as follows:
 
NODE                     EXPLANATION
----------------------------------------------------------------------
(?-imsx:                 group, but do not capture (case-sensitive)
                         (with ^ and $ matching normally) (with . not
                         matching \n) (matching whitespace and #
                         normally):
----------------------------------------------------------------------
  ^                        the beginning of the string
----------------------------------------------------------------------
  (?:                      group, but do not capture (optional
                           (matching the most amount possible)):
----------------------------------------------------------------------
    http                     'http'
----------------------------------------------------------------------
    (                        group and capture to \1 (optional
                             (matching the most amount possible)):
----------------------------------------------------------------------
      s                        's'
----------------------------------------------------------------------
    )?                       end of \1 (NOTE: because you're using a
                             quantifier on this capture, only the
                             LAST repetition of the captured pattern
                             will be stored in \1)
----------------------------------------------------------------------
    ://                      '://'
----------------------------------------------------------------------
  )?                       end of grouping
----------------------------------------------------------------------
)                        end of grouping
----------------------------------------------------------------------
0
Fairlight2cxCommented:
Neat as hell!  I didn't know there was a module that could do an explanation like that.  Pretty sweet side-benefit I got out of this discussion.  Thanks, Ozo!
0
Richard DavisSenior Web DeveloperAuthor Commented:
Wow!...I think between the two of you, this had to be one of the MOST comprehensive answers I have ever had on EE yet.

Thank you both for an impeccable job. It's a pity I can't award 500 points to each of you as you both most definitely deserve it.

Many many thanks to you both and kudos on the Regex clarity!

~A~
0
Richard DavisSenior Web DeveloperAuthor Commented:
This was a supurb example of why EE is so successful! Outstanding job, to the both of you! :)
0
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Regular Expressions

From novice to tech pro — start learning today.