Splitting an URL string


As a novice in regular expression I am struggling to find a short (and easy?) way to split any given URL in its subparts

As far as I know an URL can consist of the following  parts

Protocol (http: or https or ..) (required)
host (eg: www.host.org) (required)
port (eg 80) (optional)
username (user) (optional)
password (pass) (optional)
webpage (eg: /home/index.html) (optional)
data (eg: val1=test&val2=name) (optional)

The most I find on Internet is how to get the data (and to split these) but not how the get all the rest as well. So, what Perl regex or coding will give this back to my code?



Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

use Regexp::Common qw /URI/;
 my($uri,$scheme,$host,$port,undef,undef,$path,$query) = /$RE{URI}{HTTP}{-keep}/;
Marc_EngrieAuthor Commented:
It looks like that could solve my problem.
However I got 2 more questions:

I guess this statement takes $_ as input? If so, what if the URL must come from a var eg: $url_string

I am trying your coding using ActivePerl (on Windosw). But Perl complains that it can not find Regexp module. I know how to use ppm to install an extra module but I can not locate the module. Do you happen to know the module name to install?

Thx in advance

$url_string  = 'http://search.cpan.org/~abigail/Regexp-Common-2.120/lib/Regexp/Common/URI.pm';
$url_string =~ /$RE{URI}{HTTP}{-keep}/
Get your problem seen by more experts

Be seen. Boost your question’s priority for more expert views and faster solutions

ppm i Rexexp-Common
Marc_EngrieAuthor Commented:
Got the module -> Thx

Still one more issue

#my $url_string  = 'http://search.cpan.org/~abigail/Regexp-Common-2.120/lib/Regexp/Common/URI.pm?test=1';
my $url_string  = 'http://search.cpan.org:80/~abigail/Regexp-Common-2.120/lib/Regexp/Common/URI.pm?test=1';
$url_string =~ /$RE{URI}{HTTP}{-keep}/;
printf("url: %s\n,scheme: %s\n,host2: %s\n,port2: %s\n,path: %s\n\n",$1,$2,$3,$4,$5);

running above will work.
But running is with the commented line will give uninitialized value in the print because there is no port in the URL. Is there a trick to prevent/capture the uninitialized value?

(sorry for this probably basic Perl question :-( )
printf("url: %s\n,scheme: %s\n,host2: %s\n,port2: %s\n,path: %s\n\n",$1,$2,$3||'',$4,$5);

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
Marc_EngrieAuthor Commented:
from here on I can walk along again :-)

Thx a lot for helping me out and 'teaching' me extra tricks in Perl.

Have a great WE

It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today

From novice to tech pro — start learning today.