Solved

Regex help/clarity needed

Posted on 2009-07-03
16
327 Views
Last Modified: 2012-05-07
Hi folks,

I am working on a sign up page (perl back-end) that takes a URL as one of the required fields during the registration step. When submitted, the perl script uses the following regex to determine if the user submitted their domain only or the full URL with http or https included at the beginning.

my $url_is_good = $vars->{web_site_url} =~ m|https?://[^.]+\..+|;

The problem we're having is that it sometimes will store the submitted URL as http://http://www.domain.com.

I am still trying to wrap my head around regex and could use the eyes of an expert to tell me if the above code contains something that is not correct. The way I am reading it is that it will only check if there is an "https://" and include it, but the rest seems a little confusing because of the [^.] character class part.

Thanks folks.

~A~
0
Comment
Question by:adrian_brooks
  • 7
  • 5
  • 4
16 Comments
 
LVL 7

Expert Comment

by:Fairlight2cx
Comment Utility
Anchor your http and exclude a couple more characters:

m|^https?://[^.:/]+\..+|

The anchor (caret) means that the pattern can -only- match at the beginning of the line.  Additionally, they can't double-iterate http:// if you ban : and / via the negation set.
0
 
LVL 12

Author Comment

by:adrian_brooks
Comment Utility
Thanks Fairlight2cx,

The logic is such that we want to determine if the user did not supply it and if they did, then slate the $url_is_good var as not good basically.

Does what you just said accomplish that? Also, what substitution would I use to put it into a string if http:// or https:// were omitted at submission time?
0
 
LVL 84

Expert Comment

by:ozo
Comment Utility
How would one know whether it was  http:// that was omitted, or https:// that was omitted?
0
 
LVL 12

Author Comment

by:adrian_brooks
Comment Utility
Ozo, because the https? part of the regex states that http has to be there and the s? is only detected if it's present.

~A~
0
 
LVL 84

Expert Comment

by:ozo
Comment Utility
If if http:// or https:// were omitted at submission time, then neither would be present.
Or do I misunderstand what you are asking about what substitution would I use to put it into a string ?
0
 
LVL 12

Author Comment

by:adrian_brooks
Comment Utility
Okay, basically the desired result is that that I need to read a variable's value. The value will either be just www.domain.tld or begin with http:// or https:// then the domain (e.g. "http://www.domain.tld").

If the string is missing the http:// or the https:// then I need to prepend it to the string so that the final string will, at the very least, have "http://" preceeding the domain string.

Hope that helped clarify it better, Ozo.

~A~
0
 
LVL 84

Accepted Solution

by:
ozo earned 250 total points
Comment Utility
#so they are missing, assume it was http:// ?
$string =~ s#^(?:http(s)?://)?#http$1://#;

0
 
LVL 12

Author Comment

by:adrian_brooks
Comment Utility
Okay. A few questions just so I understand your thinking here, if you don't mind.

1) What is the first ? doing since the only thing preceding it is the opening parentheses?
2) The 3rd ? is allowing for 0 or more occurrences of the contents of the parentheses, yes?
3) The $1 is confusing me. I know it's meant as a placeholder var for data, but between the http and the ;// is what's throwing me.

Thanks, Ozo.

~A~
0
What Security Threats Are You Missing?

Enhance your security with threat intelligence from the web. Get trending threat insights on hackers, exploits, and suspicious IP addresses delivered to your inbox with our free Cyber Daily.

 
LVL 7

Expert Comment

by:Fairlight2cx
Comment Utility
The ?: is notation that indicates that the parentheses should not create a back-reference like $1, $2, etc.  In other words, use the "or" grouping but don't waste resources creating backrefs you won't use.

The ()'s he has around the 's' and before the '?' are creating a backref to the 's' in https, if one is there, or an empty string if not.

Yes, the third '?' matches zero or more of the preceeding group.

The $1 is a back-reference to that 's' paren match.  If one was provided, it will be tacked in; if not, it will be empty.

man perlre

Look up back-references for the full mechanics.
0
 
LVL 12

Author Comment

by:adrian_brooks
Comment Utility
Fairlight2cx,

Thank you for that very concise explanation. I am attempting to make as much sense as I can of regex since my new job seems to be seated very firmly in an all Perl environment.

That was very helpful.

So, would Ozo's regex solution then provide me the condition insertion/substitution in the event that the http:// or https:// were absent from the string being tested?

~A~
0
 
LVL 7

Assisted Solution

by:Fairlight2cx
Fairlight2cx earned 250 total points
Comment Utility
Yes, it would.  Let me break it down for you atomically, so you see and understand fully what's going on:

s#^(?:http(s)?://)?#http$1://#

s#
Start the substitution's matching section

^
Anchor the pattern matching to the beginning of the string.

(?:
Start a grouped expression, but create -no- back-references.  This is both for efficiency and to not needlessly confuse one's self by creating back-references you won't need when you also create ones you will need.

http
Match that literally.

(s)?
Match zero or one instances of 's'.  Create a back-reference in $1 (since it's our first paren set that doesn't say "don't make a back-reference") that we can use later, containing the 's' or not, depending on its presence.

://
Match that literally.

)?
Close the grouped expression, and the entire grouped expression may be present zero or one times--only at the beginning of the string (from the ^ earlier).

#
Start the substitution section.

http
Literally put 'http'  there.

$1
Put the 's' there if it was found and matched, otherwise put nothing there.

://
Literally put '://' there.

#
End the substitution.

Since the entire match in the pattern match is a zero-or-one match, it can be there or it can not.  If it's there, it'll be preserved accurately, including differentiating between http and https.  If it wasn't there at all, a substitution is still made, thus inserting 'http://', because the anchoring says that we matched the beginning of the line, and the whole next segment was optional--but it has a place at which to "substitute"...it just would have happened to be the beginning of the string alone.  So if neither 'http://', nor 'https://' were present, 'http://' will be provided, as the back-reference $1 will be empty due to never actually having been populated (since it's part of a pattern that fails to be matched entirely).

The short answer is, "Yes, Ozo's expression will work if you're defaulting to http rather than https if neither is specified."

Hope this helps.
0
 
LVL 84

Expert Comment

by:ozo
Comment Utility
see
perldoc perlre
and the   Regexp Quote-Like Operators  section of
perldoc perlop
0
 
LVL 84

Expert Comment

by:ozo
Comment Utility
perl -MYAPE::Regex::Explain -e "print YAPE::Regex::Explain->new(qr#^(?:http(s)?://)?#)->explain"
The regular expression:

(?-imsx:^(?:http(s)?://)?)

matches as follows:
 
NODE                     EXPLANATION
----------------------------------------------------------------------
(?-imsx:                 group, but do not capture (case-sensitive)
                         (with ^ and $ matching normally) (with . not
                         matching \n) (matching whitespace and #
                         normally):
----------------------------------------------------------------------
  ^                        the beginning of the string
----------------------------------------------------------------------
  (?:                      group, but do not capture (optional
                           (matching the most amount possible)):
----------------------------------------------------------------------
    http                     'http'
----------------------------------------------------------------------
    (                        group and capture to \1 (optional
                             (matching the most amount possible)):
----------------------------------------------------------------------
      s                        's'
----------------------------------------------------------------------
    )?                       end of \1 (NOTE: because you're using a
                             quantifier on this capture, only the
                             LAST repetition of the captured pattern
                             will be stored in \1)
----------------------------------------------------------------------
    ://                      '://'
----------------------------------------------------------------------
  )?                       end of grouping
----------------------------------------------------------------------
)                        end of grouping
----------------------------------------------------------------------
0
 
LVL 7

Expert Comment

by:Fairlight2cx
Comment Utility
Neat as hell!  I didn't know there was a module that could do an explanation like that.  Pretty sweet side-benefit I got out of this discussion.  Thanks, Ozo!
0
 
LVL 12

Author Comment

by:adrian_brooks
Comment Utility
Wow!...I think between the two of you, this had to be one of the MOST comprehensive answers I have ever had on EE yet.

Thank you both for an impeccable job. It's a pity I can't award 500 points to each of you as you both most definitely deserve it.

Many many thanks to you both and kudos on the Regex clarity!

~A~
0
 
LVL 12

Author Closing Comment

by:adrian_brooks
Comment Utility
This was a supurb example of why EE is so successful! Outstanding job, to the both of you! :)
0

Featured Post

6 Surprising Benefits of Threat Intelligence

All sorts of threat intelligence is available on the web. Intelligence you can learn from, and use to anticipate and prepare for future attacks.

Join & Write a Comment

I've just discovered very important differences between Windows an Unix formats in Perl,at least 5.xx.. MOST IMPORTANT: Use Unix file format while saving Your script. otherwise it will have ^M s or smth likely weird in the EOL, Then DO NOT use m…
In the distant past (last year) I hacked together a little toy that would allow a couple of Manager types to query, preview, and extract data from a number of MongoDB instances, to their tool of choice: Excel (http://dilbert.com/strips/comic/2007-08…
Learn how to match and substitute tagged data using PHP regular expressions. Demonstrated on Windows 7, but also applies to other operating systems. Demonstrated technique applies to PHP (all versions) and Firefox, but very similar techniques will w…
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…

763 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

12 Experts available now in Live!

Get 1:1 Help Now