Solved

Perl regex help

Posted on 2011-03-06
19
467 Views
Last Modified: 2012-05-11
0
Comment
Question by:fac66
  • 11
  • 8
19 Comments
 
LVL 16

Expert Comment

by:sjklein42
ID: 35049917
while ( <> )
{
    if ( /^http\:\/\/([^\/]+)/ )
    {
        $refCount{$1}++;
    }
}

foreach $domain (keys(%refCount))
{
    push @lines, sprintf("%6d", $refCount{$domain}) . "\t" . $domain . "\n";
}

print sort {b <=> a} @lines;

Open in new window

0
 
LVL 16

Expert Comment

by:sjklein42
ID: 35049995
This version only shows the domain (not subdomain) and includes N/A line:

while ( <> )
{
    s/[\r\n]//g;
    if ( /^http\:\/\/([^\/]+)/ )
    {
        $domain = $1;
        $domain =~ /([^\.]+\.[^\.]+)$/;
        $refCount{$1}++;
    }
    else
    {
        $refCount{$_}++;
    }
}

foreach $domain (keys(%refCount))
{
    push @lines, sprintf("%6d", $refCount{$domain}) . "\t" . $domain . "\n";
}

print sort {b <=> a} @lines;

Open in new window

0
 
LVL 16

Expert Comment

by:sjklein42
ID: 35050032
With percentages and header.

print join("\t", ' Hits ', '%-Age', 'Resource') . "\n";
print join("\t", '------', '-----', '--------') . "\n";

while ( <> )
{
    s/[\r\n]//g;
    if ( /^http\:\/\/([^\/]+)/ )
    {
        $domain = $1;
        $domain =~ /([^\.]+\.[^\.]+)$/;
        $refCount{$1}++;
    }
    else
    {
        $refCount{$_}++;
    }

	$totRefCount++;
}

foreach $domain (keys(%refCount))
{
	$pct = sprintf("%3.2f", $refCount{$domain} / $totRefCount * 100);
    push @lines, join("\t", sprintf("%6d", $refCount{$domain}), $pct, $domain) . "\n";
}

print sort {b <=> a} @lines;

Open in new window

0
Free Tool: SSL Checker

Scans your site and returns information about your SSL implementation and certificate. Helpful for debugging and validating your SSL configuration.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

 

Author Comment

by:fac66
ID: 35052159
Not following you.
I got a string with the following data:

http://my.una.edu/webapps/blackboard/content/courseMenu.jsp?course_id=_147588_1&mini=Y
N/A
http://blizzard.ist.una.edu/~lschaller/a5/
N/A
http://blizzard.ist.una.edu/~jdfitzpatrick/a4/otherpage.html
http://blizzard.ist.una.edu/~handersen/a5/index.html
N/A
http://blizzard.ist.una.edu/~bmmurray/a5/

Should I begin like this?
Or should I copy into an array?
while ( $string )
{
    s/[\r\n]//g;
    if ( /^http\:\/\/([^\/]+)/ )
    {
        $domain = $1;

Open in new window

0
 
LVL 16

Expert Comment

by:sjklein42
ID: 35052260
Ok. I thought it was in a file.

If the input is in one big long string called $string, with embedded newlines between the lines,

replace this line:

while ( <> )

with this

foreach $_ (split(/[\r\n]+/, $string))

Open in new window

0
 

Author Comment

by:fac66
ID: 35052673
This is how I have it..
This is what it prints:

1  100.00  -http://blizzard.ist.una.edu/~jdabestani/a5/http://blizzard.ist.una.edu/~fackermann/a5/index.html-http://my.una.edu/webapps/blackboard/content/courseMenu.jsp?course_id=_147595_1&mini=Yhttp://blizzard.ist.una.edu/~jrauscher/a7/http://blizzard.ist.una.edu/~dtwilliams/a5/index.htmlhttp://validator.w3.org/check?uri=http%3A%2F%2Fblizzard.ist.una.edu%2F%7Ekholtz%2Fa5%2Fotherpage.html&charset=%28detect+a

Did I confugire it correct?
my $domain;
my (%refCount,$totRefCount);
my $pct;
my @lines = ();
 foreach $_ (split(/[\r\n]+/, $string7))
{
   s/[\r\n]//g;
    if ( /^http\:\/\/([^\/]+)/ )
    {
        $domain = $1;
        $domain =~ /([^\.]+\.[^\.]+)$/;
        $refCount{$1}++;
    }
    else
    {
        $refCount{$_}++;
    }

        $totRefCount++;
}

foreach $domain (keys(%refCount))
{
        $pct = sprintf("%3.2f", $refCount{$domain} / $totRefCount * 100);
    push @lines, join("\t", sprintf("%6d", $refCount{$domain}), $pct, $domain) . "\n";
}
print sort {$b cmp $a} @lines;

Open in new window

0
 
LVL 16

Expert Comment

by:sjklein42
ID: 35052933
It looks like there is a '-' at the beginnng of the URL.  This version should handle it.

There are newline characters between the records in $string7, right?

my $domain;
my (%refCount,$totRefCount);
my $pct;
my @lines = ();
 foreach $_ (split(/[\r\n]+/, $string7))
{
   s/[\r\n]//g;
    if ( /http\:\/\/([^\/]+)/ )
    {
        $domain = $1;
        $domain =~ /([^\.]+\.[^\.]+)$/;
        $refCount{$1}++;
    }
    else
    {
        $refCount{$_}++;
    }

        $totRefCount++;
}

foreach $domain (keys(%refCount))
{
        $pct = sprintf("%3.2f", $refCount{$domain} / $totRefCount * 100);
    push @lines, join("\t", sprintf("%6d", $refCount{$domain}), $pct, $domain) . "\n";
}
print sort {$b cmp $a} @lines;

Open in new window

0
 
LVL 16

Expert Comment

by:sjklein42
ID: 35053174
Just so we're on the same page, this is how I think you're calling it.  Maybe I'm wrong about the input:


$string7 = q{
http://blizzard.ist.una.edu/~dtwilliams/a5/index.html
http://validator.w3.org/check?uri=http%3A%2F%2Fblizzard.ist.unomaha.edu%2F%7Ekholtz%2Fa5%2Fotherpage.html&charset=%28detect+automatically%29&doctype=Inline&group=0&accept=text%2Fhtml%2Capplication%2Fxhtml%2Bxml%2Capplication%2Fxml%3Bq%3D0.9%2C*%2F*%3Bq%3D0.8&accept-language=en-us%2Cen%3Bq%3D0.5&accept-charset=ISO-8859-1%2Cutf-8%3Bq%3D0.7%2C*%3Bq%3D0.7&user-agent=W3C_Validator%2F1.2
N/A
http://blizzard.ist.una.edu/~fackermann/a5/index.html
http://blizzard.ist.una.edu/~asatterfield/projectammon/art.html
http://blizzard.ist.una.edu/~jhperez/a5/index.html
N/A
N/A
N/A
};


my $domain;
my (%refCount,$totRefCount);
my $pct;
my @lines = ();
 foreach $_ (split(/[\r\n]+/, $string7))
{
    s/[\r\n]//g;
    if ( /http\:\/\/([^\/]+)/ )
    {
        $domain = $1;
        $domain =~ /([^\.]+\.[^\.]+)$/;
        $refCount{$1}++;
		$totRefCount++;
    }
    elsif ( $_ ne '' )
    {
        $refCount{$_}++;
		$totRefCount++;
    }
}

foreach $domain (keys(%refCount))
{
    $pct = sprintf("%3.2f", $refCount{$domain} / $totRefCount * 100);
    push @lines, join("\t", sprintf("%6d", $refCount{$domain}), $pct, $domain) . "\n";
}

print sort {$b cmp $a} @lines;

Open in new window



and the output:
c:\temp>perl foob.pl
     4  44.44   una.edu
     4  44.44   N/A
     1  11.11   w3.org

Open in new window

0
 

Author Comment

by:fac66
ID: 35053192
Thanks for you help.

Getting real close but it prints only 1 line.

Hits   %-Age   Resource
------  -----   --------
     1  100.00  una.edu

Also need to account for the N/A
For example:
 Hits  %-age   Resource
  ----  -----   --------
    56  55.45   N/A
    44  43.56   una.edu
     1   0.99   w3.org
0
 
LVL 16

Expert Comment

by:sjklein42
ID: 35053265
Something is different about the input.

How are the lines separated?

Please add this to the code and then show me the output it generates.

print "\n------------\n$string7\n---------------\n";

Open in new window

0
 

Author Comment

by:fac66
ID: 35053284
This is the output:



------------
-http://blizzard.ist.unomaha.edu/~jdabestani/a5/http://blizzard.ist.unomaha.edu/~fackermann/a5/index.html-http://myuno.unomaha.edu/webapps/blackboard/content/courseMenu.jsp?course_id=_147595_1&mini=Yhttp://blizzard.ist.unomaha.edu/~jrauscher/a7/http://blizzard.ist.unomaha.edu/~dtwilliams/a5/index.htmlhttp://validator.w3.org/check?uri=http%3A%2F%2Fblizzard.ist.unomaha.edu%2F%7Ekholtz%2Fa5%2Fotherpage.html&charset=%28detect+automatically%29&doctype=Inline&group=0&accept=text%2Fhtml%2Capplication%2Fxhtml%2Bxml%2Capplication%2Fxml%3Bq%3D0.9%2C*%2F*%3Bq%3D0.8&accept-language=en-us%2Cen%3Bq%3D0.5&accept-charset=ISO-8859-1%2Cutf-8%3Bq%3D0.7%2C*%3Bq%3D0.7&user-agent=W3C_Validator%2F1.2-http://blizzard.ist.unomaha.edu/~fackermann/a5/index.htmlhttp://blizzard.ist.unomaha.edu/~asatterfield/projectammon/art.htmlhttp://blizzard.ist.unomaha.edu/~jhperez/a5/index.html---------http://blizzard.ist.unomaha.edu/~ppickett/project/research.htmlhttp://blizzard.ist.unomaha.edu/~jhperez/a4/index.html---http://myuno.unomaha.edu/webapps/blackboard/content/courseMenu.jsp?course_id=_147588_1&mini=Y-http://blizzard.ist.unomaha.edu/~lschaller/a5/-http://blizzard.ist.unomaha.edu/~jdfitzpatrick/a4/otherpage.htmlhttp://blizzard.ist.unomaha.edu/~handersen/a5/index.html-http://blizzard.ist.unomaha.edu/~bmmurray/a5/---http://blizzard.ist.unomaha.edu/~scarmody/a5/otherpage.htmlhttp://blizzard.ist.unomaha.edu/~gfosmer/a4/otherpage.htmlhttp://blizzard.ist.unomaha.edu/1300-2-css/http://blizzard.ist.unomaha.edu/~gcnielsen/a4/index.htmlhttp://blizzard.ist.unomaha.edu/~gfosmer/a5/-http://myuno.unomaha.edu/webapps/blackboard/content/listContentEditable.jsp?content_id=_2151076_1&course_id=_147587_1-http://blizzard.ist.unomaha.edu/~ksebastian/a5/-http://blizzard.ist.unomaha.edu/~rbeasley/a4/otherpage.html--http://blizzard.ist.unomaha.edu/~drgulick/a5/otherpage.html-http://blizzard.ist.unomaha.edu/~jrauscher/a7/http://myuno.unomaha.edu/webapps/blackboard/content/courseMenu.jsp?course_id=_147588_1&mini=Yhttp://blizzard.ist.unomaha.edu/~gcnielsen/a3/index.htmlhttp://blizzard.ist.unomaha.edu/~jrauscher/a4/otherpage.htmlhttp://blizzard.ist.unomaha.edu/~adubey/project/album.html--http://blizzard.ist.unomaha.edu/~jblackmore/---http://blizzard.ist.unomaha.edu/~gfosmer/a4/otherpage.html---http://blizzard.ist.unomaha.edu/~jwfitzpatrick/a4/otherpage.htmlhttp://myuno.unomaha.edu/webapps/blackboard/content/courseMenu.jsp?course_id=_147588_1&mini=Yhttp://blizzard.ist.unomaha.edu/~jwfitzpatrick/brew/index.html---http://blizzard.ist.unomaha.edu/~jdabestani/a4/http://blizzard.ist.unomaha.edu/~jdabestani/a4/http://blizzard.ist.unomaha.edu/~lschaller/a5/http://blizzard.ist.unomaha.edu/images/style.css---http://blizzard.ist.unomaha.edu/~handersen/a3/--http://blizzard.ist.unomaha.edu/1300-1-xhtml/---------http://blizzard.ist.unomaha.edu/~lschaller/a4/index.html-http://blizzard.ist.unomaha.edu/~dpinkerton/a3/http://myuno.unomaha.edu/webapps/blackboard/content/courseMenu.jsp?course_id=_147588_1&mini=Y-http://blizzard.ist.unomaha.edu/~dpollreis/a4/otherpage.html--
---------------
0
 

Author Comment

by:fac66
ID: 35053289
If I do a:

print "$string7\n";

Here are the results:


N/A
http://blizzard.ist.unomaha.edu/~jdabestani/a5/
http://blizzard.ist.unomaha.edu/~fackermann/a5/index.html
N/A
http://myuno.unomaha.edu/webapps/blackboard/content/courseMenu.jsp?course_id=_147595_1&mini=Y
http://blizzard.ist.unomaha.edu/~jrauscher/a7/
http://blizzard.ist.unomaha.edu/~dtwilliams/a5/index.html
http://validator.w3.org/check?uri=http%3A%2F%2Fblizzard.ist.unomaha.edu%2F%7Ekholtz%2Fa5%2Fotherpage.html&charset=%28detect+automatically%29&doctype=Inline&group=0&accept=text%2Fhtml%2Capplication%2Fxhtml%2Bxml%2Capplication%2Fxml%3Bq%3D0.9%2C*%2F*%3Bq%3D0.8&accept-language=en-us%2Cen%3Bq%3D0.5&accept-charset=ISO-8859-1%2Cutf-8%3Bq%3D0.7%2C*%3Bq%3D0.7&user-agent=W3C_Validator%2F1.2
N/A
http://blizzard.ist.unomaha.edu/~fackermann/a5/index.html
http://blizzard.ist.unomaha.edu/~asatterfield/projectammon/art.html
http://blizzard.ist.unomaha.edu/~jhperez/a5/index.html
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A
http://blizzard.ist.unomaha.edu/~ppickett/project/research.html
http://blizzard.ist.unomaha.edu/~jhperez/a4/index.html
N/A
N/A
N/A
http://myuno.unomaha.edu/webapps/blackboard/content/courseMenu.jsp?course_id=_147588_1&mini=Y
N/A
http://blizzard.ist.unomaha.edu/~lschaller/a5/
N/A
http://blizzard.ist.unomaha.edu/~jdfitzpatrick/a4/otherpage.html
http://blizzard.ist.unomaha.edu/~handersen/a5/index.html
N/A
http://blizzard.ist.unomaha.edu/~bmmurray/a5/
N/A
N/A
N/A
http://blizzard.ist.unomaha.edu/~scarmody/a5/otherpage.html
http://blizzard.ist.unomaha.edu/~gfosmer/a4/otherpage.html
http://blizzard.ist.unomaha.edu/1300-2-css/
http://blizzard.ist.unomaha.edu/~gcnielsen/a4/index.html
http://blizzard.ist.unomaha.edu/~gfosmer/a5/
N/A
http://myuno.unomaha.edu/webapps/blackboard/content/listContentEditable.jsp?content_id=_2151076_1&course_id=_147587_1
N/A
http://blizzard.ist.unomaha.edu/~ksebastian/a5/
N/A
http://blizzard.ist.unomaha.edu/~rbeasley/a4/otherpage.html
N/A
N/A
http://blizzard.ist.unomaha.edu/~drgulick/a5/otherpage.html
N/A
http://blizzard.ist.unomaha.edu/~jrauscher/a7/
http://myuno.unomaha.edu/webapps/blackboard/content/courseMenu.jsp?course_id=_147588_1&mini=Y
http://blizzard.ist.unomaha.edu/~gcnielsen/a3/index.html
http://blizzard.ist.unomaha.edu/~jrauscher/a4/otherpage.html
http://blizzard.ist.unomaha.edu/~adubey/project/album.html
N/A
N/A
http://blizzard.ist.unomaha.edu/~jblackmore/
N/A
N/A
N/A
http://blizzard.ist.unomaha.edu/~gfosmer/a4/otherpage.html
N/A
N/A
N/A
http://blizzard.ist.unomaha.edu/~jwfitzpatrick/a4/otherpage.html
http://myuno.unomaha.edu/webapps/blackboard/content/courseMenu.jsp?course_id=_147588_1&mini=Y
http://blizzard.ist.unomaha.edu/~jwfitzpatrick/brew/index.html
N/A
N/A
N/A
http://blizzard.ist.unomaha.edu/~jdabestani/a4/
http://blizzard.ist.unomaha.edu/~jdabestani/a4/
http://blizzard.ist.unomaha.edu/~lschaller/a5/
http://blizzard.ist.unomaha.edu/images/style.css
N/A
N/A
N/A
http://blizzard.ist.unomaha.edu/~handersen/a3/
N/A
N/A
http://blizzard.ist.unomaha.edu/1300-1-xhtml/
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A
http://blizzard.ist.unomaha.edu/~lschaller/a4/index.html
N/A
http://blizzard.ist.unomaha.edu/~dpinkerton/a3/
http://myuno.unomaha.edu/webapps/blackboard/content/courseMenu.jsp?course_id=_147588_1&mini=Y
N/A
http://blizzard.ist.unomaha.edu/~dpollreis/a4/otherpage.html
N/A
N/A
blizzard.ist.unomaha.edu
0
 
LVL 16

Expert Comment

by:sjklein42
ID: 35053297
Aha.  It looks like there aren't any separators between the lines at all.

The fix should be done where you are constructing $string7.

You should put a newline  "\n" at the end of each line.  

Or maybe the newlines were there and you are stripping them out?

If this is confusing, please post the code where you make the value for $string7
0
 
LVL 16

Expert Comment

by:sjklein42
ID: 35053307
Your second output posting looks different than the first one.   The second one has separate lines.  On the first one they were all strung together.

The second one looks good.  Does the program work with that input?
0
 
LVL 16

Expert Comment

by:sjklein42
ID: 35053319
When I run your second set of data through it I get this outout:

perl test.pl test.dat
    56  55.45   N/A
    44  43.56   unomaha.edu
     1  0.99    w3.org

Open in new window

0
 

Author Comment

by:fac66
ID: 35053326

The below code is where I created $string7.
I had to replace all the empty lines with N/A

Here is @ref pulled from an apacej log:

http://blizzard.ist.unomaha.edu/~fackermann/a5/index.html
http://blizzard.ist.unomaha.edu/~asatterfield/projectammon/art.html
http://blizzard.ist.unomaha.edu/~jhperez/a5/index.html
-
-
-
-
-
-
-
-
-
http://blizzard.ist.unomaha.edu/~ppickett/project/research.html
http://blizzard.ist.unomaha.edu/~jhperez/a4/index.html
-
-
-
http://myuno.unomaha.edu/webapps/blackboard/content/courseMenu.jsp?course_id=_147588_1&mini=Y
-
http://blizzard.ist.unomaha.edu/~lschaller/a5/
-
http://blizzard.ist.unomaha.edu/~jdfitzpatrick/a4/otherpage.html
http://blizzard.ist.unomaha.edu/~handersen/a5/index.html

The code below replaced every blank line wih N/A:
Does that help?
my $string7 = join('',@ref);
foreach $string7 (@ref)
{
if ( ! ( $string7 =~ /^http\:\/\// ) ) { $string7 = 'N/A'; }
 }

Open in new window

0
 

Author Comment

by:fac66
ID: 35053333
How did you get this?

perl test.pl test.dat
    56  55.45   N/A
    44  43.56   unomaha.edu
     1  0.99    w3.org
0
 
LVL 16

Accepted Solution

by:
sjklein42 earned 500 total points
ID: 35053348
This is much easier - just loop through @ref directly - no need for $string7.

my $domain;
my (%refCount,$totRefCount);
my $pct;
my @lines = ();
foreach $_ (@ref)
{
    s/[\r\n]//g;
    if ( /http\:\/\/([^\/]+)/ )
    {
        $domain = $1;
        $domain =~ /([^\.]+\.[^\.]+)$/;
        $refCount{$1}++;
		$totRefCount++;
    }
    elsif ( $_ ne '' )
    {
        $refCount{$_}++;
		$totRefCount++;
    }
}

foreach $domain (keys(%refCount))
{
    $pct = sprintf("%3.2f", $refCount{$domain} / $totRefCount * 100);
    push @lines, join("\t", sprintf("%6d", $refCount{$domain}), $pct, $domain) . "\n";
}

print sort {$b cmp $a} @lines; 

Open in new window

0
 

Author Comment

by:fac66
ID: 35053381

Excellent!!
Thank you very much sir!
0

Featured Post

Free Tool: SSL Checker

Scans your site and returns information about your SSL implementation and certificate. Helpful for debugging and validating your SSL configuration.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Title # Comments Views Activity
How to strip .csv from file name 9 83
collecting information 2 180
how to exit a  for loop inside a function with return value in bash 5 103
perl syntax 3 16
Many time we need to work with multiple files all together. If its windows system then we can use some GUI based editor to accomplish our task. But what if you are on putty or have only CLI(Command Line Interface) as an option to  edit your files. I…
There are many situations when we need to display the data in sorted order. For example: Student details by name or by rank or by total marks etc. If you are working on data driven based projects then you will use sorting techniques very frequently.…
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…

856 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question