Solved

Perl regex help

Posted on 2011-03-06
19
465 Views
Last Modified: 2012-05-11
0
Comment
Question by:fac66
  • 11
  • 8
19 Comments
 
LVL 16

Expert Comment

by:sjklein42
ID: 35049917
while ( <> )
{
    if ( /^http\:\/\/([^\/]+)/ )
    {
        $refCount{$1}++;
    }
}

foreach $domain (keys(%refCount))
{
    push @lines, sprintf("%6d", $refCount{$domain}) . "\t" . $domain . "\n";
}

print sort {b <=> a} @lines;

Open in new window

0
 
LVL 16

Expert Comment

by:sjklein42
ID: 35049995
This version only shows the domain (not subdomain) and includes N/A line:

while ( <> )
{
    s/[\r\n]//g;
    if ( /^http\:\/\/([^\/]+)/ )
    {
        $domain = $1;
        $domain =~ /([^\.]+\.[^\.]+)$/;
        $refCount{$1}++;
    }
    else
    {
        $refCount{$_}++;
    }
}

foreach $domain (keys(%refCount))
{
    push @lines, sprintf("%6d", $refCount{$domain}) . "\t" . $domain . "\n";
}

print sort {b <=> a} @lines;

Open in new window

0
 
LVL 16

Expert Comment

by:sjklein42
ID: 35050032
With percentages and header.

print join("\t", ' Hits ', '%-Age', 'Resource') . "\n";
print join("\t", '------', '-----', '--------') . "\n";

while ( <> )
{
    s/[\r\n]//g;
    if ( /^http\:\/\/([^\/]+)/ )
    {
        $domain = $1;
        $domain =~ /([^\.]+\.[^\.]+)$/;
        $refCount{$1}++;
    }
    else
    {
        $refCount{$_}++;
    }

	$totRefCount++;
}

foreach $domain (keys(%refCount))
{
	$pct = sprintf("%3.2f", $refCount{$domain} / $totRefCount * 100);
    push @lines, join("\t", sprintf("%6d", $refCount{$domain}), $pct, $domain) . "\n";
}

print sort {b <=> a} @lines;

Open in new window

0
 

Author Comment

by:fac66
ID: 35052159
Not following you.
I got a string with the following data:

http://my.una.edu/webapps/blackboard/content/courseMenu.jsp?course_id=_147588_1&mini=Y
N/A
http://blizzard.ist.una.edu/~lschaller/a5/
N/A
http://blizzard.ist.una.edu/~jdfitzpatrick/a4/otherpage.html
http://blizzard.ist.una.edu/~handersen/a5/index.html
N/A
http://blizzard.ist.una.edu/~bmmurray/a5/

Should I begin like this?
Or should I copy into an array?
while ( $string )
{
    s/[\r\n]//g;
    if ( /^http\:\/\/([^\/]+)/ )
    {
        $domain = $1;

Open in new window

0
 
LVL 16

Expert Comment

by:sjklein42
ID: 35052260
Ok. I thought it was in a file.

If the input is in one big long string called $string, with embedded newlines between the lines,

replace this line:

while ( <> )

with this

foreach $_ (split(/[\r\n]+/, $string))

Open in new window

0
 

Author Comment

by:fac66
ID: 35052673
This is how I have it..
This is what it prints:

1  100.00  -http://blizzard.ist.una.edu/~jdabestani/a5/http://blizzard.ist.una.edu/~fackermann/a5/index.html-http://my.una.edu/webapps/blackboard/content/courseMenu.jsp?course_id=_147595_1&mini=Yhttp://blizzard.ist.una.edu/~jrauscher/a7/http://blizzard.ist.una.edu/~dtwilliams/a5/index.htmlhttp://validator.w3.org/check?uri=http%3A%2F%2Fblizzard.ist.una.edu%2F%7Ekholtz%2Fa5%2Fotherpage.html&charset=%28detect+a

Did I confugire it correct?
my $domain;
my (%refCount,$totRefCount);
my $pct;
my @lines = ();
 foreach $_ (split(/[\r\n]+/, $string7))
{
   s/[\r\n]//g;
    if ( /^http\:\/\/([^\/]+)/ )
    {
        $domain = $1;
        $domain =~ /([^\.]+\.[^\.]+)$/;
        $refCount{$1}++;
    }
    else
    {
        $refCount{$_}++;
    }

        $totRefCount++;
}

foreach $domain (keys(%refCount))
{
        $pct = sprintf("%3.2f", $refCount{$domain} / $totRefCount * 100);
    push @lines, join("\t", sprintf("%6d", $refCount{$domain}), $pct, $domain) . "\n";
}
print sort {$b cmp $a} @lines;

Open in new window

0
 
LVL 16

Expert Comment

by:sjklein42
ID: 35052933
It looks like there is a '-' at the beginnng of the URL.  This version should handle it.

There are newline characters between the records in $string7, right?

my $domain;
my (%refCount,$totRefCount);
my $pct;
my @lines = ();
 foreach $_ (split(/[\r\n]+/, $string7))
{
   s/[\r\n]//g;
    if ( /http\:\/\/([^\/]+)/ )
    {
        $domain = $1;
        $domain =~ /([^\.]+\.[^\.]+)$/;
        $refCount{$1}++;
    }
    else
    {
        $refCount{$_}++;
    }

        $totRefCount++;
}

foreach $domain (keys(%refCount))
{
        $pct = sprintf("%3.2f", $refCount{$domain} / $totRefCount * 100);
    push @lines, join("\t", sprintf("%6d", $refCount{$domain}), $pct, $domain) . "\n";
}
print sort {$b cmp $a} @lines;

Open in new window

0
 
LVL 16

Expert Comment

by:sjklein42
ID: 35053174
Just so we're on the same page, this is how I think you're calling it.  Maybe I'm wrong about the input:


$string7 = q{
http://blizzard.ist.una.edu/~dtwilliams/a5/index.html
http://validator.w3.org/check?uri=http%3A%2F%2Fblizzard.ist.unomaha.edu%2F%7Ekholtz%2Fa5%2Fotherpage.html&charset=%28detect+automatically%29&doctype=Inline&group=0&accept=text%2Fhtml%2Capplication%2Fxhtml%2Bxml%2Capplication%2Fxml%3Bq%3D0.9%2C*%2F*%3Bq%3D0.8&accept-language=en-us%2Cen%3Bq%3D0.5&accept-charset=ISO-8859-1%2Cutf-8%3Bq%3D0.7%2C*%3Bq%3D0.7&user-agent=W3C_Validator%2F1.2
N/A
http://blizzard.ist.una.edu/~fackermann/a5/index.html
http://blizzard.ist.una.edu/~asatterfield/projectammon/art.html
http://blizzard.ist.una.edu/~jhperez/a5/index.html
N/A
N/A
N/A
};


my $domain;
my (%refCount,$totRefCount);
my $pct;
my @lines = ();
 foreach $_ (split(/[\r\n]+/, $string7))
{
    s/[\r\n]//g;
    if ( /http\:\/\/([^\/]+)/ )
    {
        $domain = $1;
        $domain =~ /([^\.]+\.[^\.]+)$/;
        $refCount{$1}++;
		$totRefCount++;
    }
    elsif ( $_ ne '' )
    {
        $refCount{$_}++;
		$totRefCount++;
    }
}

foreach $domain (keys(%refCount))
{
    $pct = sprintf("%3.2f", $refCount{$domain} / $totRefCount * 100);
    push @lines, join("\t", sprintf("%6d", $refCount{$domain}), $pct, $domain) . "\n";
}

print sort {$b cmp $a} @lines;

Open in new window



and the output:
c:\temp>perl foob.pl
     4  44.44   una.edu
     4  44.44   N/A
     1  11.11   w3.org

Open in new window

0
 

Author Comment

by:fac66
ID: 35053192
Thanks for you help.

Getting real close but it prints only 1 line.

Hits   %-Age   Resource
------  -----   --------
     1  100.00  una.edu

Also need to account for the N/A
For example:
 Hits  %-age   Resource
  ----  -----   --------
    56  55.45   N/A
    44  43.56   una.edu
     1   0.99   w3.org
0
Is Your Active Directory as Secure as You Think?

More than 75% of all records are compromised because of the loss or theft of a privileged credential. Experts have been exploring Active Directory infrastructure to identify key threats and establish best practices for keeping data safe. Attend this month’s webinar to learn more.

 
LVL 16

Expert Comment

by:sjklein42
ID: 35053265
Something is different about the input.

How are the lines separated?

Please add this to the code and then show me the output it generates.

print "\n------------\n$string7\n---------------\n";

Open in new window

0
 

Author Comment

by:fac66
ID: 35053284
This is the output:



------------
-http://blizzard.ist.unomaha.edu/~jdabestani/a5/http://blizzard.ist.unomaha.edu/~fackermann/a5/index.html-http://myuno.unomaha.edu/webapps/blackboard/content/courseMenu.jsp?course_id=_147595_1&mini=Yhttp://blizzard.ist.unomaha.edu/~jrauscher/a7/http://blizzard.ist.unomaha.edu/~dtwilliams/a5/index.htmlhttp://validator.w3.org/check?uri=http%3A%2F%2Fblizzard.ist.unomaha.edu%2F%7Ekholtz%2Fa5%2Fotherpage.html&charset=%28detect+automatically%29&doctype=Inline&group=0&accept=text%2Fhtml%2Capplication%2Fxhtml%2Bxml%2Capplication%2Fxml%3Bq%3D0.9%2C*%2F*%3Bq%3D0.8&accept-language=en-us%2Cen%3Bq%3D0.5&accept-charset=ISO-8859-1%2Cutf-8%3Bq%3D0.7%2C*%3Bq%3D0.7&user-agent=W3C_Validator%2F1.2-http://blizzard.ist.unomaha.edu/~fackermann/a5/index.htmlhttp://blizzard.ist.unomaha.edu/~asatterfield/projectammon/art.htmlhttp://blizzard.ist.unomaha.edu/~jhperez/a5/index.html---------http://blizzard.ist.unomaha.edu/~ppickett/project/research.htmlhttp://blizzard.ist.unomaha.edu/~jhperez/a4/index.html---http://myuno.unomaha.edu/webapps/blackboard/content/courseMenu.jsp?course_id=_147588_1&mini=Y-http://blizzard.ist.unomaha.edu/~lschaller/a5/-http://blizzard.ist.unomaha.edu/~jdfitzpatrick/a4/otherpage.htmlhttp://blizzard.ist.unomaha.edu/~handersen/a5/index.html-http://blizzard.ist.unomaha.edu/~bmmurray/a5/---http://blizzard.ist.unomaha.edu/~scarmody/a5/otherpage.htmlhttp://blizzard.ist.unomaha.edu/~gfosmer/a4/otherpage.htmlhttp://blizzard.ist.unomaha.edu/1300-2-css/http://blizzard.ist.unomaha.edu/~gcnielsen/a4/index.htmlhttp://blizzard.ist.unomaha.edu/~gfosmer/a5/-http://myuno.unomaha.edu/webapps/blackboard/content/listContentEditable.jsp?content_id=_2151076_1&course_id=_147587_1-http://blizzard.ist.unomaha.edu/~ksebastian/a5/-http://blizzard.ist.unomaha.edu/~rbeasley/a4/otherpage.html--http://blizzard.ist.unomaha.edu/~drgulick/a5/otherpage.html-http://blizzard.ist.unomaha.edu/~jrauscher/a7/http://myuno.unomaha.edu/webapps/blackboard/content/courseMenu.jsp?course_id=_147588_1&mini=Yhttp://blizzard.ist.unomaha.edu/~gcnielsen/a3/index.htmlhttp://blizzard.ist.unomaha.edu/~jrauscher/a4/otherpage.htmlhttp://blizzard.ist.unomaha.edu/~adubey/project/album.html--http://blizzard.ist.unomaha.edu/~jblackmore/---http://blizzard.ist.unomaha.edu/~gfosmer/a4/otherpage.html---http://blizzard.ist.unomaha.edu/~jwfitzpatrick/a4/otherpage.htmlhttp://myuno.unomaha.edu/webapps/blackboard/content/courseMenu.jsp?course_id=_147588_1&mini=Yhttp://blizzard.ist.unomaha.edu/~jwfitzpatrick/brew/index.html---http://blizzard.ist.unomaha.edu/~jdabestani/a4/http://blizzard.ist.unomaha.edu/~jdabestani/a4/http://blizzard.ist.unomaha.edu/~lschaller/a5/http://blizzard.ist.unomaha.edu/images/style.css---http://blizzard.ist.unomaha.edu/~handersen/a3/--http://blizzard.ist.unomaha.edu/1300-1-xhtml/---------http://blizzard.ist.unomaha.edu/~lschaller/a4/index.html-http://blizzard.ist.unomaha.edu/~dpinkerton/a3/http://myuno.unomaha.edu/webapps/blackboard/content/courseMenu.jsp?course_id=_147588_1&mini=Y-http://blizzard.ist.unomaha.edu/~dpollreis/a4/otherpage.html--
---------------
0
 

Author Comment

by:fac66
ID: 35053289
If I do a:

print "$string7\n";

Here are the results:


N/A
http://blizzard.ist.unomaha.edu/~jdabestani/a5/
http://blizzard.ist.unomaha.edu/~fackermann/a5/index.html
N/A
http://myuno.unomaha.edu/webapps/blackboard/content/courseMenu.jsp?course_id=_147595_1&mini=Y
http://blizzard.ist.unomaha.edu/~jrauscher/a7/
http://blizzard.ist.unomaha.edu/~dtwilliams/a5/index.html
http://validator.w3.org/check?uri=http%3A%2F%2Fblizzard.ist.unomaha.edu%2F%7Ekholtz%2Fa5%2Fotherpage.html&charset=%28detect+automatically%29&doctype=Inline&group=0&accept=text%2Fhtml%2Capplication%2Fxhtml%2Bxml%2Capplication%2Fxml%3Bq%3D0.9%2C*%2F*%3Bq%3D0.8&accept-language=en-us%2Cen%3Bq%3D0.5&accept-charset=ISO-8859-1%2Cutf-8%3Bq%3D0.7%2C*%3Bq%3D0.7&user-agent=W3C_Validator%2F1.2
N/A
http://blizzard.ist.unomaha.edu/~fackermann/a5/index.html
http://blizzard.ist.unomaha.edu/~asatterfield/projectammon/art.html
http://blizzard.ist.unomaha.edu/~jhperez/a5/index.html
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A
http://blizzard.ist.unomaha.edu/~ppickett/project/research.html
http://blizzard.ist.unomaha.edu/~jhperez/a4/index.html
N/A
N/A
N/A
http://myuno.unomaha.edu/webapps/blackboard/content/courseMenu.jsp?course_id=_147588_1&mini=Y
N/A
http://blizzard.ist.unomaha.edu/~lschaller/a5/
N/A
http://blizzard.ist.unomaha.edu/~jdfitzpatrick/a4/otherpage.html
http://blizzard.ist.unomaha.edu/~handersen/a5/index.html
N/A
http://blizzard.ist.unomaha.edu/~bmmurray/a5/
N/A
N/A
N/A
http://blizzard.ist.unomaha.edu/~scarmody/a5/otherpage.html
http://blizzard.ist.unomaha.edu/~gfosmer/a4/otherpage.html
http://blizzard.ist.unomaha.edu/1300-2-css/
http://blizzard.ist.unomaha.edu/~gcnielsen/a4/index.html
http://blizzard.ist.unomaha.edu/~gfosmer/a5/
N/A
http://myuno.unomaha.edu/webapps/blackboard/content/listContentEditable.jsp?content_id=_2151076_1&course_id=_147587_1
N/A
http://blizzard.ist.unomaha.edu/~ksebastian/a5/
N/A
http://blizzard.ist.unomaha.edu/~rbeasley/a4/otherpage.html
N/A
N/A
http://blizzard.ist.unomaha.edu/~drgulick/a5/otherpage.html
N/A
http://blizzard.ist.unomaha.edu/~jrauscher/a7/
http://myuno.unomaha.edu/webapps/blackboard/content/courseMenu.jsp?course_id=_147588_1&mini=Y
http://blizzard.ist.unomaha.edu/~gcnielsen/a3/index.html
http://blizzard.ist.unomaha.edu/~jrauscher/a4/otherpage.html
http://blizzard.ist.unomaha.edu/~adubey/project/album.html
N/A
N/A
http://blizzard.ist.unomaha.edu/~jblackmore/
N/A
N/A
N/A
http://blizzard.ist.unomaha.edu/~gfosmer/a4/otherpage.html
N/A
N/A
N/A
http://blizzard.ist.unomaha.edu/~jwfitzpatrick/a4/otherpage.html
http://myuno.unomaha.edu/webapps/blackboard/content/courseMenu.jsp?course_id=_147588_1&mini=Y
http://blizzard.ist.unomaha.edu/~jwfitzpatrick/brew/index.html
N/A
N/A
N/A
http://blizzard.ist.unomaha.edu/~jdabestani/a4/
http://blizzard.ist.unomaha.edu/~jdabestani/a4/
http://blizzard.ist.unomaha.edu/~lschaller/a5/
http://blizzard.ist.unomaha.edu/images/style.css
N/A
N/A
N/A
http://blizzard.ist.unomaha.edu/~handersen/a3/
N/A
N/A
http://blizzard.ist.unomaha.edu/1300-1-xhtml/
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A
http://blizzard.ist.unomaha.edu/~lschaller/a4/index.html
N/A
http://blizzard.ist.unomaha.edu/~dpinkerton/a3/
http://myuno.unomaha.edu/webapps/blackboard/content/courseMenu.jsp?course_id=_147588_1&mini=Y
N/A
http://blizzard.ist.unomaha.edu/~dpollreis/a4/otherpage.html
N/A
N/A
blizzard.ist.unomaha.edu
0
 
LVL 16

Expert Comment

by:sjklein42
ID: 35053297
Aha.  It looks like there aren't any separators between the lines at all.

The fix should be done where you are constructing $string7.

You should put a newline  "\n" at the end of each line.  

Or maybe the newlines were there and you are stripping them out?

If this is confusing, please post the code where you make the value for $string7
0
 
LVL 16

Expert Comment

by:sjklein42
ID: 35053307
Your second output posting looks different than the first one.   The second one has separate lines.  On the first one they were all strung together.

The second one looks good.  Does the program work with that input?
0
 
LVL 16

Expert Comment

by:sjklein42
ID: 35053319
When I run your second set of data through it I get this outout:

perl test.pl test.dat
    56  55.45   N/A
    44  43.56   unomaha.edu
     1  0.99    w3.org

Open in new window

0
 

Author Comment

by:fac66
ID: 35053326

The below code is where I created $string7.
I had to replace all the empty lines with N/A

Here is @ref pulled from an apacej log:

http://blizzard.ist.unomaha.edu/~fackermann/a5/index.html
http://blizzard.ist.unomaha.edu/~asatterfield/projectammon/art.html
http://blizzard.ist.unomaha.edu/~jhperez/a5/index.html
-
-
-
-
-
-
-
-
-
http://blizzard.ist.unomaha.edu/~ppickett/project/research.html
http://blizzard.ist.unomaha.edu/~jhperez/a4/index.html
-
-
-
http://myuno.unomaha.edu/webapps/blackboard/content/courseMenu.jsp?course_id=_147588_1&mini=Y
-
http://blizzard.ist.unomaha.edu/~lschaller/a5/
-
http://blizzard.ist.unomaha.edu/~jdfitzpatrick/a4/otherpage.html
http://blizzard.ist.unomaha.edu/~handersen/a5/index.html

The code below replaced every blank line wih N/A:
Does that help?
my $string7 = join('',@ref);
foreach $string7 (@ref)
{
if ( ! ( $string7 =~ /^http\:\/\// ) ) { $string7 = 'N/A'; }
 }

Open in new window

0
 

Author Comment

by:fac66
ID: 35053333
How did you get this?

perl test.pl test.dat
    56  55.45   N/A
    44  43.56   unomaha.edu
     1  0.99    w3.org
0
 
LVL 16

Accepted Solution

by:
sjklein42 earned 500 total points
ID: 35053348
This is much easier - just loop through @ref directly - no need for $string7.

my $domain;
my (%refCount,$totRefCount);
my $pct;
my @lines = ();
foreach $_ (@ref)
{
    s/[\r\n]//g;
    if ( /http\:\/\/([^\/]+)/ )
    {
        $domain = $1;
        $domain =~ /([^\.]+\.[^\.]+)$/;
        $refCount{$1}++;
		$totRefCount++;
    }
    elsif ( $_ ne '' )
    {
        $refCount{$_}++;
		$totRefCount++;
    }
}

foreach $domain (keys(%refCount))
{
    $pct = sprintf("%3.2f", $refCount{$domain} / $totRefCount * 100);
    push @lines, join("\t", sprintf("%6d", $refCount{$domain}), $pct, $domain) . "\n";
}

print sort {$b cmp $a} @lines; 

Open in new window

0
 

Author Comment

by:fac66
ID: 35053381

Excellent!!
Thank you very much sir!
0

Featured Post

Is Your Active Directory as Secure as You Think?

More than 75% of all records are compromised because of the loss or theft of a privileged credential. Experts have been exploring Active Directory infrastructure to identify key threats and establish best practices for keeping data safe. Attend this month’s webinar to learn more.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

I've just discovered very important differences between Windows an Unix formats in Perl,at least 5.xx.. MOST IMPORTANT: Use Unix file format while saving Your script. otherwise it will have ^M s or smth likely weird in the EOL, Then DO NOT use m…
A year or so back I was asked to have a play with MongoDB; within half an hour I had downloaded (http://www.mongodb.org/downloads),  installed and started the daemon, and had a console window open. After an hour or two of playing at the command …
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…
Migrating to Microsoft Office 365 is becoming increasingly popular for organizations both large and small. If you have made the leap to Microsoft’s cloud platform, you know that you will need to create a corporate email signature for your Office 365…

867 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

16 Experts available now in Live!

Get 1:1 Help Now