Link to home
Start Free TrialLog in
Avatar of Richard Davis
Richard DavisFlag for United States of America

asked on

Regex help/clarity needed

Hi folks,

I am working on a sign up page (perl back-end) that takes a URL as one of the required fields during the registration step. When submitted, the perl script uses the following regex to determine if the user submitted their domain only or the full URL with http or https included at the beginning.

my $url_is_good = $vars->{web_site_url} =~ m|https?://[^.]+\..+|;

The problem we're having is that it sometimes will store the submitted URL as http://http://www.domain.com.

I am still trying to wrap my head around regex and could use the eyes of an expert to tell me if the above code contains something that is not correct. The way I am reading it is that it will only check if there is an "https://" and include it, but the rest seems a little confusing because of the [^.] character class part.

Thanks folks.

~A~
Avatar of Fairlight2cx
Fairlight2cx
Flag of United States of America image

Anchor your http and exclude a couple more characters:

m|^https?://[^.:/]+\..+|

The anchor (caret) means that the pattern can -only- match at the beginning of the line.  Additionally, they can't double-iterate http:// if you ban : and / via the negation set.
Avatar of Richard Davis

ASKER

Thanks Fairlight2cx,

The logic is such that we want to determine if the user did not supply it and if they did, then slate the $url_is_good var as not good basically.

Does what you just said accomplish that? Also, what substitution would I use to put it into a string if http:// or https:// were omitted at submission time?
How would one know whether it was  http:// that was omitted, or https:// that was omitted?
Ozo, because the https? part of the regex states that http has to be there and the s? is only detected if it's present.

~A~
If if http:// or https:// were omitted at submission time, then neither would be present.
Or do I misunderstand what you are asking about what substitution would I use to put it into a string ?
Okay, basically the desired result is that that I need to read a variable's value. The value will either be just www.domain.tld or begin with http:// or https:// then the domain (e.g. "http://www.domain.tld").

If the string is missing the http:// or the https:// then I need to prepend it to the string so that the final string will, at the very least, have "http://" preceeding the domain string.

Hope that helped clarify it better, Ozo.

~A~
ASKER CERTIFIED SOLUTION
Avatar of ozo
ozo
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Okay. A few questions just so I understand your thinking here, if you don't mind.

1) What is the first ? doing since the only thing preceding it is the opening parentheses?
2) The 3rd ? is allowing for 0 or more occurrences of the contents of the parentheses, yes?
3) The $1 is confusing me. I know it's meant as a placeholder var for data, but between the http and the ;// is what's throwing me.

Thanks, Ozo.

~A~
The ?: is notation that indicates that the parentheses should not create a back-reference like $1, $2, etc.  In other words, use the "or" grouping but don't waste resources creating backrefs you won't use.

The ()'s he has around the 's' and before the '?' are creating a backref to the 's' in https, if one is there, or an empty string if not.

Yes, the third '?' matches zero or more of the preceeding group.

The $1 is a back-reference to that 's' paren match.  If one was provided, it will be tacked in; if not, it will be empty.

man perlre

Look up back-references for the full mechanics.
Fairlight2cx,

Thank you for that very concise explanation. I am attempting to make as much sense as I can of regex since my new job seems to be seated very firmly in an all Perl environment.

That was very helpful.

So, would Ozo's regex solution then provide me the condition insertion/substitution in the event that the http:// or https:// were absent from the string being tested?

~A~
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
see
perldoc perlre
and the   Regexp Quote-Like Operators  section of
perldoc perlop
perl -MYAPE::Regex::Explain -e "print YAPE::Regex::Explain->new(qr#^(?:http(s)?://)?#)->explain"
The regular expression:

(?-imsx:^(?:http(s)?://)?)

matches as follows:
 
NODE                     EXPLANATION
----------------------------------------------------------------------
(?-imsx:                 group, but do not capture (case-sensitive)
                         (with ^ and $ matching normally) (with . not
                         matching \n) (matching whitespace and #
                         normally):
----------------------------------------------------------------------
  ^                        the beginning of the string
----------------------------------------------------------------------
  (?:                      group, but do not capture (optional
                           (matching the most amount possible)):
----------------------------------------------------------------------
    http                     'http'
----------------------------------------------------------------------
    (                        group and capture to \1 (optional
                             (matching the most amount possible)):
----------------------------------------------------------------------
      s                        's'
----------------------------------------------------------------------
    )?                       end of \1 (NOTE: because you're using a
                             quantifier on this capture, only the
                             LAST repetition of the captured pattern
                             will be stored in \1)
----------------------------------------------------------------------
    ://                      '://'
----------------------------------------------------------------------
  )?                       end of grouping
----------------------------------------------------------------------
)                        end of grouping
----------------------------------------------------------------------
Neat as hell!  I didn't know there was a module that could do an explanation like that.  Pretty sweet side-benefit I got out of this discussion.  Thanks, Ozo!
Wow!...I think between the two of you, this had to be one of the MOST comprehensive answers I have ever had on EE yet.

Thank you both for an impeccable job. It's a pity I can't award 500 points to each of you as you both most definitely deserve it.

Many many thanks to you both and kudos on the Regex clarity!

~A~
This was a supurb example of why EE is so successful! Outstanding job, to the both of you! :)