Still celebrating National IT Professionals Day with 3 months of free Premium Membership. Use Code ITDAY17

x
?
Solved

Perl regex help

Posted on 2011-03-06
19
Medium Priority
?
476 Views
Last Modified: 2012-05-11
0
Comment
Question by:fac66
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 11
  • 8
19 Comments
 
LVL 16

Expert Comment

by:sjklein42
ID: 35049917
while ( <> )
{
    if ( /^http\:\/\/([^\/]+)/ )
    {
        $refCount{$1}++;
    }
}

foreach $domain (keys(%refCount))
{
    push @lines, sprintf("%6d", $refCount{$domain}) . "\t" . $domain . "\n";
}

print sort {b <=> a} @lines;

Open in new window

0
 
LVL 16

Expert Comment

by:sjklein42
ID: 35049995
This version only shows the domain (not subdomain) and includes N/A line:

while ( <> )
{
    s/[\r\n]//g;
    if ( /^http\:\/\/([^\/]+)/ )
    {
        $domain = $1;
        $domain =~ /([^\.]+\.[^\.]+)$/;
        $refCount{$1}++;
    }
    else
    {
        $refCount{$_}++;
    }
}

foreach $domain (keys(%refCount))
{
    push @lines, sprintf("%6d", $refCount{$domain}) . "\t" . $domain . "\n";
}

print sort {b <=> a} @lines;

Open in new window

0
 
LVL 16

Expert Comment

by:sjklein42
ID: 35050032
With percentages and header.

print join("\t", ' Hits ', '%-Age', 'Resource') . "\n";
print join("\t", '------', '-----', '--------') . "\n";

while ( <> )
{
    s/[\r\n]//g;
    if ( /^http\:\/\/([^\/]+)/ )
    {
        $domain = $1;
        $domain =~ /([^\.]+\.[^\.]+)$/;
        $refCount{$1}++;
    }
    else
    {
        $refCount{$_}++;
    }

	$totRefCount++;
}

foreach $domain (keys(%refCount))
{
	$pct = sprintf("%3.2f", $refCount{$domain} / $totRefCount * 100);
    push @lines, join("\t", sprintf("%6d", $refCount{$domain}), $pct, $domain) . "\n";
}

print sort {b <=> a} @lines;

Open in new window

0
Independent Software Vendors: We Want Your Opinion

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 

Author Comment

by:fac66
ID: 35052159
Not following you.
I got a string with the following data:

http://my.una.edu/webapps/blackboard/content/courseMenu.jsp?course_id=_147588_1&mini=Y
N/A
http://blizzard.ist.una.edu/~lschaller/a5/
N/A
http://blizzard.ist.una.edu/~jdfitzpatrick/a4/otherpage.html
http://blizzard.ist.una.edu/~handersen/a5/index.html
N/A
http://blizzard.ist.una.edu/~bmmurray/a5/

Should I begin like this?
Or should I copy into an array?
while ( $string )
{
    s/[\r\n]//g;
    if ( /^http\:\/\/([^\/]+)/ )
    {
        $domain = $1;

Open in new window

0
 
LVL 16

Expert Comment

by:sjklein42
ID: 35052260
Ok. I thought it was in a file.

If the input is in one big long string called $string, with embedded newlines between the lines,

replace this line:

while ( <> )

with this

foreach $_ (split(/[\r\n]+/, $string))

Open in new window

0
 

Author Comment

by:fac66
ID: 35052673
This is how I have it..
This is what it prints:

1  100.00  -http://blizzard.ist.una.edu/~jdabestani/a5/http://blizzard.ist.una.edu/~fackermann/a5/index.html-http://my.una.edu/webapps/blackboard/content/courseMenu.jsp?course_id=_147595_1&mini=Yhttp://blizzard.ist.una.edu/~jrauscher/a7/http://blizzard.ist.una.edu/~dtwilliams/a5/index.htmlhttp://validator.w3.org/check?uri=http%3A%2F%2Fblizzard.ist.una.edu%2F%7Ekholtz%2Fa5%2Fotherpage.html&charset=%28detect+a

Did I confugire it correct?
my $domain;
my (%refCount,$totRefCount);
my $pct;
my @lines = ();
 foreach $_ (split(/[\r\n]+/, $string7))
{
   s/[\r\n]//g;
    if ( /^http\:\/\/([^\/]+)/ )
    {
        $domain = $1;
        $domain =~ /([^\.]+\.[^\.]+)$/;
        $refCount{$1}++;
    }
    else
    {
        $refCount{$_}++;
    }

        $totRefCount++;
}

foreach $domain (keys(%refCount))
{
        $pct = sprintf("%3.2f", $refCount{$domain} / $totRefCount * 100);
    push @lines, join("\t", sprintf("%6d", $refCount{$domain}), $pct, $domain) . "\n";
}
print sort {$b cmp $a} @lines;

Open in new window

0
 
LVL 16

Expert Comment

by:sjklein42
ID: 35052933
It looks like there is a '-' at the beginnng of the URL.  This version should handle it.

There are newline characters between the records in $string7, right?

my $domain;
my (%refCount,$totRefCount);
my $pct;
my @lines = ();
 foreach $_ (split(/[\r\n]+/, $string7))
{
   s/[\r\n]//g;
    if ( /http\:\/\/([^\/]+)/ )
    {
        $domain = $1;
        $domain =~ /([^\.]+\.[^\.]+)$/;
        $refCount{$1}++;
    }
    else
    {
        $refCount{$_}++;
    }

        $totRefCount++;
}

foreach $domain (keys(%refCount))
{
        $pct = sprintf("%3.2f", $refCount{$domain} / $totRefCount * 100);
    push @lines, join("\t", sprintf("%6d", $refCount{$domain}), $pct, $domain) . "\n";
}
print sort {$b cmp $a} @lines;

Open in new window

0
 
LVL 16

Expert Comment

by:sjklein42
ID: 35053174
Just so we're on the same page, this is how I think you're calling it.  Maybe I'm wrong about the input:


$string7 = q{
http://blizzard.ist.una.edu/~dtwilliams/a5/index.html
http://validator.w3.org/check?uri=http%3A%2F%2Fblizzard.ist.unomaha.edu%2F%7Ekholtz%2Fa5%2Fotherpage.html&charset=%28detect+automatically%29&doctype=Inline&group=0&accept=text%2Fhtml%2Capplication%2Fxhtml%2Bxml%2Capplication%2Fxml%3Bq%3D0.9%2C*%2F*%3Bq%3D0.8&accept-language=en-us%2Cen%3Bq%3D0.5&accept-charset=ISO-8859-1%2Cutf-8%3Bq%3D0.7%2C*%3Bq%3D0.7&user-agent=W3C_Validator%2F1.2
N/A
http://blizzard.ist.una.edu/~fackermann/a5/index.html
http://blizzard.ist.una.edu/~asatterfield/projectammon/art.html
http://blizzard.ist.una.edu/~jhperez/a5/index.html
N/A
N/A
N/A
};


my $domain;
my (%refCount,$totRefCount);
my $pct;
my @lines = ();
 foreach $_ (split(/[\r\n]+/, $string7))
{
    s/[\r\n]//g;
    if ( /http\:\/\/([^\/]+)/ )
    {
        $domain = $1;
        $domain =~ /([^\.]+\.[^\.]+)$/;
        $refCount{$1}++;
		$totRefCount++;
    }
    elsif ( $_ ne '' )
    {
        $refCount{$_}++;
		$totRefCount++;
    }
}

foreach $domain (keys(%refCount))
{
    $pct = sprintf("%3.2f", $refCount{$domain} / $totRefCount * 100);
    push @lines, join("\t", sprintf("%6d", $refCount{$domain}), $pct, $domain) . "\n";
}

print sort {$b cmp $a} @lines;

Open in new window



and the output:
c:\temp>perl foob.pl
     4  44.44   una.edu
     4  44.44   N/A
     1  11.11   w3.org

Open in new window

0
 

Author Comment

by:fac66
ID: 35053192
Thanks for you help.

Getting real close but it prints only 1 line.

Hits   %-Age   Resource
------  -----   --------
     1  100.00  una.edu

Also need to account for the N/A
For example:
 Hits  %-age   Resource
  ----  -----   --------
    56  55.45   N/A
    44  43.56   una.edu
     1   0.99   w3.org
0
 
LVL 16

Expert Comment

by:sjklein42
ID: 35053265
Something is different about the input.

How are the lines separated?

Please add this to the code and then show me the output it generates.

print "\n------------\n$string7\n---------------\n";

Open in new window

0
 

Author Comment

by:fac66
ID: 35053284
This is the output:



------------
-http://blizzard.ist.unomaha.edu/~jdabestani/a5/http://blizzard.ist.unomaha.edu/~fackermann/a5/index.html-http://myuno.unomaha.edu/webapps/blackboard/content/courseMenu.jsp?course_id=_147595_1&mini=Yhttp://blizzard.ist.unomaha.edu/~jrauscher/a7/http://blizzard.ist.unomaha.edu/~dtwilliams/a5/index.htmlhttp://validator.w3.org/check?uri=http%3A%2F%2Fblizzard.ist.unomaha.edu%2F%7Ekholtz%2Fa5%2Fotherpage.html&charset=%28detect+automatically%29&doctype=Inline&group=0&accept=text%2Fhtml%2Capplication%2Fxhtml%2Bxml%2Capplication%2Fxml%3Bq%3D0.9%2C*%2F*%3Bq%3D0.8&accept-language=en-us%2Cen%3Bq%3D0.5&accept-charset=ISO-8859-1%2Cutf-8%3Bq%3D0.7%2C*%3Bq%3D0.7&user-agent=W3C_Validator%2F1.2-http://blizzard.ist.unomaha.edu/~fackermann/a5/index.htmlhttp://blizzard.ist.unomaha.edu/~asatterfield/projectammon/art.htmlhttp://blizzard.ist.unomaha.edu/~jhperez/a5/index.html---------http://blizzard.ist.unomaha.edu/~ppickett/project/research.htmlhttp://blizzard.ist.unomaha.edu/~jhperez/a4/index.html---http://myuno.unomaha.edu/webapps/blackboard/content/courseMenu.jsp?course_id=_147588_1&mini=Y-http://blizzard.ist.unomaha.edu/~lschaller/a5/-http://blizzard.ist.unomaha.edu/~jdfitzpatrick/a4/otherpage.htmlhttp://blizzard.ist.unomaha.edu/~handersen/a5/index.html-http://blizzard.ist.unomaha.edu/~bmmurray/a5/---http://blizzard.ist.unomaha.edu/~scarmody/a5/otherpage.htmlhttp://blizzard.ist.unomaha.edu/~gfosmer/a4/otherpage.htmlhttp://blizzard.ist.unomaha.edu/1300-2-css/http://blizzard.ist.unomaha.edu/~gcnielsen/a4/index.htmlhttp://blizzard.ist.unomaha.edu/~gfosmer/a5/-http://myuno.unomaha.edu/webapps/blackboard/content/listContentEditable.jsp?content_id=_2151076_1&course_id=_147587_1-http://blizzard.ist.unomaha.edu/~ksebastian/a5/-http://blizzard.ist.unomaha.edu/~rbeasley/a4/otherpage.html--http://blizzard.ist.unomaha.edu/~drgulick/a5/otherpage.html-http://blizzard.ist.unomaha.edu/~jrauscher/a7/http://myuno.unomaha.edu/webapps/blackboard/content/courseMenu.jsp?course_id=_147588_1&mini=Yhttp://blizzard.ist.unomaha.edu/~gcnielsen/a3/index.htmlhttp://blizzard.ist.unomaha.edu/~jrauscher/a4/otherpage.htmlhttp://blizzard.ist.unomaha.edu/~adubey/project/album.html--http://blizzard.ist.unomaha.edu/~jblackmore/---http://blizzard.ist.unomaha.edu/~gfosmer/a4/otherpage.html---http://blizzard.ist.unomaha.edu/~jwfitzpatrick/a4/otherpage.htmlhttp://myuno.unomaha.edu/webapps/blackboard/content/courseMenu.jsp?course_id=_147588_1&mini=Yhttp://blizzard.ist.unomaha.edu/~jwfitzpatrick/brew/index.html---http://blizzard.ist.unomaha.edu/~jdabestani/a4/http://blizzard.ist.unomaha.edu/~jdabestani/a4/http://blizzard.ist.unomaha.edu/~lschaller/a5/http://blizzard.ist.unomaha.edu/images/style.css---http://blizzard.ist.unomaha.edu/~handersen/a3/--http://blizzard.ist.unomaha.edu/1300-1-xhtml/---------http://blizzard.ist.unomaha.edu/~lschaller/a4/index.html-http://blizzard.ist.unomaha.edu/~dpinkerton/a3/http://myuno.unomaha.edu/webapps/blackboard/content/courseMenu.jsp?course_id=_147588_1&mini=Y-http://blizzard.ist.unomaha.edu/~dpollreis/a4/otherpage.html--
---------------
0
 

Author Comment

by:fac66
ID: 35053289
If I do a:

print "$string7\n";

Here are the results:


N/A
http://blizzard.ist.unomaha.edu/~jdabestani/a5/
http://blizzard.ist.unomaha.edu/~fackermann/a5/index.html
N/A
http://myuno.unomaha.edu/webapps/blackboard/content/courseMenu.jsp?course_id=_147595_1&mini=Y
http://blizzard.ist.unomaha.edu/~jrauscher/a7/
http://blizzard.ist.unomaha.edu/~dtwilliams/a5/index.html
http://validator.w3.org/check?uri=http%3A%2F%2Fblizzard.ist.unomaha.edu%2F%7Ekholtz%2Fa5%2Fotherpage.html&charset=%28detect+automatically%29&doctype=Inline&group=0&accept=text%2Fhtml%2Capplication%2Fxhtml%2Bxml%2Capplication%2Fxml%3Bq%3D0.9%2C*%2F*%3Bq%3D0.8&accept-language=en-us%2Cen%3Bq%3D0.5&accept-charset=ISO-8859-1%2Cutf-8%3Bq%3D0.7%2C*%3Bq%3D0.7&user-agent=W3C_Validator%2F1.2
N/A
http://blizzard.ist.unomaha.edu/~fackermann/a5/index.html
http://blizzard.ist.unomaha.edu/~asatterfield/projectammon/art.html
http://blizzard.ist.unomaha.edu/~jhperez/a5/index.html
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A
http://blizzard.ist.unomaha.edu/~ppickett/project/research.html
http://blizzard.ist.unomaha.edu/~jhperez/a4/index.html
N/A
N/A
N/A
http://myuno.unomaha.edu/webapps/blackboard/content/courseMenu.jsp?course_id=_147588_1&mini=Y
N/A
http://blizzard.ist.unomaha.edu/~lschaller/a5/
N/A
http://blizzard.ist.unomaha.edu/~jdfitzpatrick/a4/otherpage.html
http://blizzard.ist.unomaha.edu/~handersen/a5/index.html
N/A
http://blizzard.ist.unomaha.edu/~bmmurray/a5/
N/A
N/A
N/A
http://blizzard.ist.unomaha.edu/~scarmody/a5/otherpage.html
http://blizzard.ist.unomaha.edu/~gfosmer/a4/otherpage.html
http://blizzard.ist.unomaha.edu/1300-2-css/
http://blizzard.ist.unomaha.edu/~gcnielsen/a4/index.html
http://blizzard.ist.unomaha.edu/~gfosmer/a5/
N/A
http://myuno.unomaha.edu/webapps/blackboard/content/listContentEditable.jsp?content_id=_2151076_1&course_id=_147587_1
N/A
http://blizzard.ist.unomaha.edu/~ksebastian/a5/
N/A
http://blizzard.ist.unomaha.edu/~rbeasley/a4/otherpage.html
N/A
N/A
http://blizzard.ist.unomaha.edu/~drgulick/a5/otherpage.html
N/A
http://blizzard.ist.unomaha.edu/~jrauscher/a7/
http://myuno.unomaha.edu/webapps/blackboard/content/courseMenu.jsp?course_id=_147588_1&mini=Y
http://blizzard.ist.unomaha.edu/~gcnielsen/a3/index.html
http://blizzard.ist.unomaha.edu/~jrauscher/a4/otherpage.html
http://blizzard.ist.unomaha.edu/~adubey/project/album.html
N/A
N/A
http://blizzard.ist.unomaha.edu/~jblackmore/
N/A
N/A
N/A
http://blizzard.ist.unomaha.edu/~gfosmer/a4/otherpage.html
N/A
N/A
N/A
http://blizzard.ist.unomaha.edu/~jwfitzpatrick/a4/otherpage.html
http://myuno.unomaha.edu/webapps/blackboard/content/courseMenu.jsp?course_id=_147588_1&mini=Y
http://blizzard.ist.unomaha.edu/~jwfitzpatrick/brew/index.html
N/A
N/A
N/A
http://blizzard.ist.unomaha.edu/~jdabestani/a4/
http://blizzard.ist.unomaha.edu/~jdabestani/a4/
http://blizzard.ist.unomaha.edu/~lschaller/a5/
http://blizzard.ist.unomaha.edu/images/style.css
N/A
N/A
N/A
http://blizzard.ist.unomaha.edu/~handersen/a3/
N/A
N/A
http://blizzard.ist.unomaha.edu/1300-1-xhtml/
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A
http://blizzard.ist.unomaha.edu/~lschaller/a4/index.html
N/A
http://blizzard.ist.unomaha.edu/~dpinkerton/a3/
http://myuno.unomaha.edu/webapps/blackboard/content/courseMenu.jsp?course_id=_147588_1&mini=Y
N/A
http://blizzard.ist.unomaha.edu/~dpollreis/a4/otherpage.html
N/A
N/A
blizzard.ist.unomaha.edu
0
 
LVL 16

Expert Comment

by:sjklein42
ID: 35053297
Aha.  It looks like there aren't any separators between the lines at all.

The fix should be done where you are constructing $string7.

You should put a newline  "\n" at the end of each line.  

Or maybe the newlines were there and you are stripping them out?

If this is confusing, please post the code where you make the value for $string7
0
 
LVL 16

Expert Comment

by:sjklein42
ID: 35053307
Your second output posting looks different than the first one.   The second one has separate lines.  On the first one they were all strung together.

The second one looks good.  Does the program work with that input?
0
 
LVL 16

Expert Comment

by:sjklein42
ID: 35053319
When I run your second set of data through it I get this outout:

perl test.pl test.dat
    56  55.45   N/A
    44  43.56   unomaha.edu
     1  0.99    w3.org

Open in new window

0
 

Author Comment

by:fac66
ID: 35053326

The below code is where I created $string7.
I had to replace all the empty lines with N/A

Here is @ref pulled from an apacej log:

http://blizzard.ist.unomaha.edu/~fackermann/a5/index.html
http://blizzard.ist.unomaha.edu/~asatterfield/projectammon/art.html
http://blizzard.ist.unomaha.edu/~jhperez/a5/index.html
-
-
-
-
-
-
-
-
-
http://blizzard.ist.unomaha.edu/~ppickett/project/research.html
http://blizzard.ist.unomaha.edu/~jhperez/a4/index.html
-
-
-
http://myuno.unomaha.edu/webapps/blackboard/content/courseMenu.jsp?course_id=_147588_1&mini=Y
-
http://blizzard.ist.unomaha.edu/~lschaller/a5/
-
http://blizzard.ist.unomaha.edu/~jdfitzpatrick/a4/otherpage.html
http://blizzard.ist.unomaha.edu/~handersen/a5/index.html

The code below replaced every blank line wih N/A:
Does that help?
my $string7 = join('',@ref);
foreach $string7 (@ref)
{
if ( ! ( $string7 =~ /^http\:\/\// ) ) { $string7 = 'N/A'; }
 }

Open in new window

0
 

Author Comment

by:fac66
ID: 35053333
How did you get this?

perl test.pl test.dat
    56  55.45   N/A
    44  43.56   unomaha.edu
     1  0.99    w3.org
0
 
LVL 16

Accepted Solution

by:
sjklein42 earned 2000 total points
ID: 35053348
This is much easier - just loop through @ref directly - no need for $string7.

my $domain;
my (%refCount,$totRefCount);
my $pct;
my @lines = ();
foreach $_ (@ref)
{
    s/[\r\n]//g;
    if ( /http\:\/\/([^\/]+)/ )
    {
        $domain = $1;
        $domain =~ /([^\.]+\.[^\.]+)$/;
        $refCount{$1}++;
		$totRefCount++;
    }
    elsif ( $_ ne '' )
    {
        $refCount{$_}++;
		$totRefCount++;
    }
}

foreach $domain (keys(%refCount))
{
    $pct = sprintf("%3.2f", $refCount{$domain} / $totRefCount * 100);
    push @lines, join("\t", sprintf("%6d", $refCount{$domain}), $pct, $domain) . "\n";
}

print sort {$b cmp $a} @lines; 

Open in new window

0
 

Author Comment

by:fac66
ID: 35053381

Excellent!!
Thank you very much sir!
0

Featured Post

What does it mean to be "Always On"?

Is your cloud always on? With an Always On cloud you won't have to worry about downtime for maintenance or software application code updates, ensuring that your bottom line isn't affected.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

A year or so back I was asked to have a play with MongoDB; within half an hour I had downloaded (http://www.mongodb.org/downloads),  installed and started the daemon, and had a console window open. After an hour or two of playing at the command …
In the distant past (last year) I hacked together a little toy that would allow a couple of Manager types to query, preview, and extract data from a number of MongoDB instances, to their tool of choice: Excel (http://dilbert.com/strips/comic/2007-08…
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…
Six Sigma Control Plans

722 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question