We help IT Professionals succeed at work.

Check out our new AWS podcast with Certified Expert, Phil Phillips! Listen to "How to Execute a Seamless AWS Migration" on EE or on your favorite podcast platform. Listen Now

x

Regex help/clarity needed

Richard Davis
on
Medium Priority
347 Views
Last Modified: 2012-05-07
Hi folks,

I am working on a sign up page (perl back-end) that takes a URL as one of the required fields during the registration step. When submitted, the perl script uses the following regex to determine if the user submitted their domain only or the full URL with http or https included at the beginning.

my $url_is_good = $vars->{web_site_url} =~ m|https?://[^.]+\..+|;

The problem we're having is that it sometimes will store the submitted URL as http://http://www.domain.com.

I am still trying to wrap my head around regex and could use the eyes of an expert to tell me if the above code contains something that is not correct. The way I am reading it is that it will only check if there is an "https://" and include it, but the rest seems a little confusing because of the [^.] character class part.

Thanks folks.

~A~
Comment
Watch Question

Anchor your http and exclude a couple more characters:

m|^https?://[^.:/]+\..+|

The anchor (caret) means that the pattern can -only- match at the beginning of the line.  Additionally, they can't double-iterate http:// if you ban : and / via the negation set.
Richard DavisSenior Web Developer
CERTIFIED EXPERT

Author

Commented:
Thanks Fairlight2cx,

The logic is such that we want to determine if the user did not supply it and if they did, then slate the $url_is_good var as not good basically.

Does what you just said accomplish that? Also, what substitution would I use to put it into a string if http:// or https:// were omitted at submission time?
ozo
CERTIFIED EXPERT
Most Valuable Expert 2014
Top Expert 2015

Commented:
How would one know whether it was  http:// that was omitted, or https:// that was omitted?
Richard DavisSenior Web Developer
CERTIFIED EXPERT

Author

Commented:
Ozo, because the https? part of the regex states that http has to be there and the s? is only detected if it's present.

~A~
ozo
CERTIFIED EXPERT
Most Valuable Expert 2014
Top Expert 2015

Commented:
If if http:// or https:// were omitted at submission time, then neither would be present.
Or do I misunderstand what you are asking about what substitution would I use to put it into a string ?
Richard DavisSenior Web Developer
CERTIFIED EXPERT

Author

Commented:
Okay, basically the desired result is that that I need to read a variable's value. The value will either be just www.domain.tld or begin with http:// or https:// then the domain (e.g. "http://www.domain.tld").

If the string is missing the http:// or the https:// then I need to prepend it to the string so that the final string will, at the very least, have "http://" preceeding the domain string.

Hope that helped clarify it better, Ozo.

~A~
CERTIFIED EXPERT
Most Valuable Expert 2014
Top Expert 2015
Commented:
Unlock this solution with a free trial preview.
(No credit card required)
Get Preview
Richard DavisSenior Web Developer
CERTIFIED EXPERT

Author

Commented:
Okay. A few questions just so I understand your thinking here, if you don't mind.

1) What is the first ? doing since the only thing preceding it is the opening parentheses?
2) The 3rd ? is allowing for 0 or more occurrences of the contents of the parentheses, yes?
3) The $1 is confusing me. I know it's meant as a placeholder var for data, but between the http and the ;// is what's throwing me.

Thanks, Ozo.

~A~
The ?: is notation that indicates that the parentheses should not create a back-reference like $1, $2, etc.  In other words, use the "or" grouping but don't waste resources creating backrefs you won't use.

The ()'s he has around the 's' and before the '?' are creating a backref to the 's' in https, if one is there, or an empty string if not.

Yes, the third '?' matches zero or more of the preceeding group.

The $1 is a back-reference to that 's' paren match.  If one was provided, it will be tacked in; if not, it will be empty.

man perlre

Look up back-references for the full mechanics.
Richard DavisSenior Web Developer
CERTIFIED EXPERT

Author

Commented:
Fairlight2cx,

Thank you for that very concise explanation. I am attempting to make as much sense as I can of regex since my new job seems to be seated very firmly in an all Perl environment.

That was very helpful.

So, would Ozo's regex solution then provide me the condition insertion/substitution in the event that the http:// or https:// were absent from the string being tested?

~A~
Unlock this solution with a free trial preview.
(No credit card required)
Get Preview
ozo
CERTIFIED EXPERT
Most Valuable Expert 2014
Top Expert 2015

Commented:
see
perldoc perlre
and the   Regexp Quote-Like Operators  section of
perldoc perlop
ozo
CERTIFIED EXPERT
Most Valuable Expert 2014
Top Expert 2015

Commented:
perl -MYAPE::Regex::Explain -e "print YAPE::Regex::Explain->new(qr#^(?:http(s)?://)?#)->explain"
The regular expression:

(?-imsx:^(?:http(s)?://)?)

matches as follows:
 
NODE                     EXPLANATION
----------------------------------------------------------------------
(?-imsx:                 group, but do not capture (case-sensitive)
                         (with ^ and $ matching normally) (with . not
                         matching \n) (matching whitespace and #
                         normally):
----------------------------------------------------------------------
  ^                        the beginning of the string
----------------------------------------------------------------------
  (?:                      group, but do not capture (optional
                           (matching the most amount possible)):
----------------------------------------------------------------------
    http                     'http'
----------------------------------------------------------------------
    (                        group and capture to \1 (optional
                             (matching the most amount possible)):
----------------------------------------------------------------------
      s                        's'
----------------------------------------------------------------------
    )?                       end of \1 (NOTE: because you're using a
                             quantifier on this capture, only the
                             LAST repetition of the captured pattern
                             will be stored in \1)
----------------------------------------------------------------------
    ://                      '://'
----------------------------------------------------------------------
  )?                       end of grouping
----------------------------------------------------------------------
)                        end of grouping
----------------------------------------------------------------------
Neat as hell!  I didn't know there was a module that could do an explanation like that.  Pretty sweet side-benefit I got out of this discussion.  Thanks, Ozo!
Richard DavisSenior Web Developer
CERTIFIED EXPERT

Author

Commented:
Wow!...I think between the two of you, this had to be one of the MOST comprehensive answers I have ever had on EE yet.

Thank you both for an impeccable job. It's a pity I can't award 500 points to each of you as you both most definitely deserve it.

Many many thanks to you both and kudos on the Regex clarity!

~A~
Richard DavisSenior Web Developer
CERTIFIED EXPERT

Author

Commented:
This was a supurb example of why EE is so successful! Outstanding job, to the both of you! :)
Unlock the solution to this question.
Thanks for using Experts Exchange.

Please provide your email to receive a free trial preview!

*This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

OR

Please enter a first name

Please enter a last name

8+ characters (letters, numbers, and a symbol)

By clicking, you agree to the Terms of Use and Privacy Policy.