Want to protect your cyber security and still get fast solutions? Ask a secure question today.Go Premium

x
  • Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 484
  • Last Modified:

Perl regex help

0
fac66
Asked:
fac66
  • 11
  • 8
1 Solution
 
sjklein42Commented:
while ( <> )
{
    if ( /^http\:\/\/([^\/]+)/ )
    {
        $refCount{$1}++;
    }
}

foreach $domain (keys(%refCount))
{
    push @lines, sprintf("%6d", $refCount{$domain}) . "\t" . $domain . "\n";
}

print sort {b <=> a} @lines;

Open in new window

0
 
sjklein42Commented:
This version only shows the domain (not subdomain) and includes N/A line:

while ( <> )
{
    s/[\r\n]//g;
    if ( /^http\:\/\/([^\/]+)/ )
    {
        $domain = $1;
        $domain =~ /([^\.]+\.[^\.]+)$/;
        $refCount{$1}++;
    }
    else
    {
        $refCount{$_}++;
    }
}

foreach $domain (keys(%refCount))
{
    push @lines, sprintf("%6d", $refCount{$domain}) . "\t" . $domain . "\n";
}

print sort {b <=> a} @lines;

Open in new window

0
 
sjklein42Commented:
With percentages and header.

print join("\t", ' Hits ', '%-Age', 'Resource') . "\n";
print join("\t", '------', '-----', '--------') . "\n";

while ( <> )
{
    s/[\r\n]//g;
    if ( /^http\:\/\/([^\/]+)/ )
    {
        $domain = $1;
        $domain =~ /([^\.]+\.[^\.]+)$/;
        $refCount{$1}++;
    }
    else
    {
        $refCount{$_}++;
    }

	$totRefCount++;
}

foreach $domain (keys(%refCount))
{
	$pct = sprintf("%3.2f", $refCount{$domain} / $totRefCount * 100);
    push @lines, join("\t", sprintf("%6d", $refCount{$domain}), $pct, $domain) . "\n";
}

print sort {b <=> a} @lines;

Open in new window

0
Independent Software Vendors: We Want Your Opinion

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 
fac66Author Commented:
Not following you.
I got a string with the following data:

http://my.una.edu/webapps/blackboard/content/courseMenu.jsp?course_id=_147588_1&mini=Y
N/A
http://blizzard.ist.una.edu/~lschaller/a5/
N/A
http://blizzard.ist.una.edu/~jdfitzpatrick/a4/otherpage.html
http://blizzard.ist.una.edu/~handersen/a5/index.html
N/A
http://blizzard.ist.una.edu/~bmmurray/a5/

Should I begin like this?
Or should I copy into an array?
while ( $string )
{
    s/[\r\n]//g;
    if ( /^http\:\/\/([^\/]+)/ )
    {
        $domain = $1;

Open in new window

0
 
sjklein42Commented:
Ok. I thought it was in a file.

If the input is in one big long string called $string, with embedded newlines between the lines,

replace this line:

while ( <> )

with this

foreach $_ (split(/[\r\n]+/, $string))

Open in new window

0
 
fac66Author Commented:
This is how I have it..
This is what it prints:

1  100.00  -http://blizzard.ist.una.edu/~jdabestani/a5/http://blizzard.ist.una.edu/~fackermann/a5/index.html-http://my.una.edu/webapps/blackboard/content/courseMenu.jsp?course_id=_147595_1&mini=Yhttp://blizzard.ist.una.edu/~jrauscher/a7/http://blizzard.ist.una.edu/~dtwilliams/a5/index.htmlhttp://validator.w3.org/check?uri=http%3A%2F%2Fblizzard.ist.una.edu%2F%7Ekholtz%2Fa5%2Fotherpage.html&charset=%28detect+a

Did I confugire it correct?
my $domain;
my (%refCount,$totRefCount);
my $pct;
my @lines = ();
 foreach $_ (split(/[\r\n]+/, $string7))
{
   s/[\r\n]//g;
    if ( /^http\:\/\/([^\/]+)/ )
    {
        $domain = $1;
        $domain =~ /([^\.]+\.[^\.]+)$/;
        $refCount{$1}++;
    }
    else
    {
        $refCount{$_}++;
    }

        $totRefCount++;
}

foreach $domain (keys(%refCount))
{
        $pct = sprintf("%3.2f", $refCount{$domain} / $totRefCount * 100);
    push @lines, join("\t", sprintf("%6d", $refCount{$domain}), $pct, $domain) . "\n";
}
print sort {$b cmp $a} @lines;

Open in new window

0
 
sjklein42Commented:
It looks like there is a '-' at the beginnng of the URL.  This version should handle it.

There are newline characters between the records in $string7, right?

my $domain;
my (%refCount,$totRefCount);
my $pct;
my @lines = ();
 foreach $_ (split(/[\r\n]+/, $string7))
{
   s/[\r\n]//g;
    if ( /http\:\/\/([^\/]+)/ )
    {
        $domain = $1;
        $domain =~ /([^\.]+\.[^\.]+)$/;
        $refCount{$1}++;
    }
    else
    {
        $refCount{$_}++;
    }

        $totRefCount++;
}

foreach $domain (keys(%refCount))
{
        $pct = sprintf("%3.2f", $refCount{$domain} / $totRefCount * 100);
    push @lines, join("\t", sprintf("%6d", $refCount{$domain}), $pct, $domain) . "\n";
}
print sort {$b cmp $a} @lines;

Open in new window

0
 
sjklein42Commented:
Just so we're on the same page, this is how I think you're calling it.  Maybe I'm wrong about the input:


$string7 = q{
http://blizzard.ist.una.edu/~dtwilliams/a5/index.html
http://validator.w3.org/check?uri=http%3A%2F%2Fblizzard.ist.unomaha.edu%2F%7Ekholtz%2Fa5%2Fotherpage.html&charset=%28detect+automatically%29&doctype=Inline&group=0&accept=text%2Fhtml%2Capplication%2Fxhtml%2Bxml%2Capplication%2Fxml%3Bq%3D0.9%2C*%2F*%3Bq%3D0.8&accept-language=en-us%2Cen%3Bq%3D0.5&accept-charset=ISO-8859-1%2Cutf-8%3Bq%3D0.7%2C*%3Bq%3D0.7&user-agent=W3C_Validator%2F1.2
N/A
http://blizzard.ist.una.edu/~fackermann/a5/index.html
http://blizzard.ist.una.edu/~asatterfield/projectammon/art.html
http://blizzard.ist.una.edu/~jhperez/a5/index.html
N/A
N/A
N/A
};


my $domain;
my (%refCount,$totRefCount);
my $pct;
my @lines = ();
 foreach $_ (split(/[\r\n]+/, $string7))
{
    s/[\r\n]//g;
    if ( /http\:\/\/([^\/]+)/ )
    {
        $domain = $1;
        $domain =~ /([^\.]+\.[^\.]+)$/;
        $refCount{$1}++;
		$totRefCount++;
    }
    elsif ( $_ ne '' )
    {
        $refCount{$_}++;
		$totRefCount++;
    }
}

foreach $domain (keys(%refCount))
{
    $pct = sprintf("%3.2f", $refCount{$domain} / $totRefCount * 100);
    push @lines, join("\t", sprintf("%6d", $refCount{$domain}), $pct, $domain) . "\n";
}

print sort {$b cmp $a} @lines;

Open in new window



and the output:
c:\temp>perl foob.pl
     4  44.44   una.edu
     4  44.44   N/A
     1  11.11   w3.org

Open in new window

0
 
fac66Author Commented:
Thanks for you help.

Getting real close but it prints only 1 line.

Hits   %-Age   Resource
------  -----   --------
     1  100.00  una.edu

Also need to account for the N/A
For example:
 Hits  %-age   Resource
  ----  -----   --------
    56  55.45   N/A
    44  43.56   una.edu
     1   0.99   w3.org
0
 
sjklein42Commented:
Something is different about the input.

How are the lines separated?

Please add this to the code and then show me the output it generates.

print "\n------------\n$string7\n---------------\n";

Open in new window

0
 
fac66Author Commented:
This is the output:



------------
-http://blizzard.ist.unomaha.edu/~jdabestani/a5/http://blizzard.ist.unomaha.edu/~fackermann/a5/index.html-http://myuno.unomaha.edu/webapps/blackboard/content/courseMenu.jsp?course_id=_147595_1&mini=Yhttp://blizzard.ist.unomaha.edu/~jrauscher/a7/http://blizzard.ist.unomaha.edu/~dtwilliams/a5/index.htmlhttp://validator.w3.org/check?uri=http%3A%2F%2Fblizzard.ist.unomaha.edu%2F%7Ekholtz%2Fa5%2Fotherpage.html&charset=%28detect+automatically%29&doctype=Inline&group=0&accept=text%2Fhtml%2Capplication%2Fxhtml%2Bxml%2Capplication%2Fxml%3Bq%3D0.9%2C*%2F*%3Bq%3D0.8&accept-language=en-us%2Cen%3Bq%3D0.5&accept-charset=ISO-8859-1%2Cutf-8%3Bq%3D0.7%2C*%3Bq%3D0.7&user-agent=W3C_Validator%2F1.2-http://blizzard.ist.unomaha.edu/~fackermann/a5/index.htmlhttp://blizzard.ist.unomaha.edu/~asatterfield/projectammon/art.htmlhttp://blizzard.ist.unomaha.edu/~jhperez/a5/index.html---------http://blizzard.ist.unomaha.edu/~ppickett/project/research.htmlhttp://blizzard.ist.unomaha.edu/~jhperez/a4/index.html---http://myuno.unomaha.edu/webapps/blackboard/content/courseMenu.jsp?course_id=_147588_1&mini=Y-http://blizzard.ist.unomaha.edu/~lschaller/a5/-http://blizzard.ist.unomaha.edu/~jdfitzpatrick/a4/otherpage.htmlhttp://blizzard.ist.unomaha.edu/~handersen/a5/index.html-http://blizzard.ist.unomaha.edu/~bmmurray/a5/---http://blizzard.ist.unomaha.edu/~scarmody/a5/otherpage.htmlhttp://blizzard.ist.unomaha.edu/~gfosmer/a4/otherpage.htmlhttp://blizzard.ist.unomaha.edu/1300-2-css/http://blizzard.ist.unomaha.edu/~gcnielsen/a4/index.htmlhttp://blizzard.ist.unomaha.edu/~gfosmer/a5/-http://myuno.unomaha.edu/webapps/blackboard/content/listContentEditable.jsp?content_id=_2151076_1&course_id=_147587_1-http://blizzard.ist.unomaha.edu/~ksebastian/a5/-http://blizzard.ist.unomaha.edu/~rbeasley/a4/otherpage.html--http://blizzard.ist.unomaha.edu/~drgulick/a5/otherpage.html-http://blizzard.ist.unomaha.edu/~jrauscher/a7/http://myuno.unomaha.edu/webapps/blackboard/content/courseMenu.jsp?course_id=_147588_1&mini=Yhttp://blizzard.ist.unomaha.edu/~gcnielsen/a3/index.htmlhttp://blizzard.ist.unomaha.edu/~jrauscher/a4/otherpage.htmlhttp://blizzard.ist.unomaha.edu/~adubey/project/album.html--http://blizzard.ist.unomaha.edu/~jblackmore/---http://blizzard.ist.unomaha.edu/~gfosmer/a4/otherpage.html---http://blizzard.ist.unomaha.edu/~jwfitzpatrick/a4/otherpage.htmlhttp://myuno.unomaha.edu/webapps/blackboard/content/courseMenu.jsp?course_id=_147588_1&mini=Yhttp://blizzard.ist.unomaha.edu/~jwfitzpatrick/brew/index.html---http://blizzard.ist.unomaha.edu/~jdabestani/a4/http://blizzard.ist.unomaha.edu/~jdabestani/a4/http://blizzard.ist.unomaha.edu/~lschaller/a5/http://blizzard.ist.unomaha.edu/images/style.css---http://blizzard.ist.unomaha.edu/~handersen/a3/--http://blizzard.ist.unomaha.edu/1300-1-xhtml/---------http://blizzard.ist.unomaha.edu/~lschaller/a4/index.html-http://blizzard.ist.unomaha.edu/~dpinkerton/a3/http://myuno.unomaha.edu/webapps/blackboard/content/courseMenu.jsp?course_id=_147588_1&mini=Y-http://blizzard.ist.unomaha.edu/~dpollreis/a4/otherpage.html--
---------------
0
 
fac66Author Commented:
If I do a:

print "$string7\n";

Here are the results:


N/A
http://blizzard.ist.unomaha.edu/~jdabestani/a5/
http://blizzard.ist.unomaha.edu/~fackermann/a5/index.html
N/A
http://myuno.unomaha.edu/webapps/blackboard/content/courseMenu.jsp?course_id=_147595_1&mini=Y
http://blizzard.ist.unomaha.edu/~jrauscher/a7/
http://blizzard.ist.unomaha.edu/~dtwilliams/a5/index.html
http://validator.w3.org/check?uri=http%3A%2F%2Fblizzard.ist.unomaha.edu%2F%7Ekholtz%2Fa5%2Fotherpage.html&charset=%28detect+automatically%29&doctype=Inline&group=0&accept=text%2Fhtml%2Capplication%2Fxhtml%2Bxml%2Capplication%2Fxml%3Bq%3D0.9%2C*%2F*%3Bq%3D0.8&accept-language=en-us%2Cen%3Bq%3D0.5&accept-charset=ISO-8859-1%2Cutf-8%3Bq%3D0.7%2C*%3Bq%3D0.7&user-agent=W3C_Validator%2F1.2
N/A
http://blizzard.ist.unomaha.edu/~fackermann/a5/index.html
http://blizzard.ist.unomaha.edu/~asatterfield/projectammon/art.html
http://blizzard.ist.unomaha.edu/~jhperez/a5/index.html
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A
http://blizzard.ist.unomaha.edu/~ppickett/project/research.html
http://blizzard.ist.unomaha.edu/~jhperez/a4/index.html
N/A
N/A
N/A
http://myuno.unomaha.edu/webapps/blackboard/content/courseMenu.jsp?course_id=_147588_1&mini=Y
N/A
http://blizzard.ist.unomaha.edu/~lschaller/a5/
N/A
http://blizzard.ist.unomaha.edu/~jdfitzpatrick/a4/otherpage.html
http://blizzard.ist.unomaha.edu/~handersen/a5/index.html
N/A
http://blizzard.ist.unomaha.edu/~bmmurray/a5/
N/A
N/A
N/A
http://blizzard.ist.unomaha.edu/~scarmody/a5/otherpage.html
http://blizzard.ist.unomaha.edu/~gfosmer/a4/otherpage.html
http://blizzard.ist.unomaha.edu/1300-2-css/
http://blizzard.ist.unomaha.edu/~gcnielsen/a4/index.html
http://blizzard.ist.unomaha.edu/~gfosmer/a5/
N/A
http://myuno.unomaha.edu/webapps/blackboard/content/listContentEditable.jsp?content_id=_2151076_1&course_id=_147587_1
N/A
http://blizzard.ist.unomaha.edu/~ksebastian/a5/
N/A
http://blizzard.ist.unomaha.edu/~rbeasley/a4/otherpage.html
N/A
N/A
http://blizzard.ist.unomaha.edu/~drgulick/a5/otherpage.html
N/A
http://blizzard.ist.unomaha.edu/~jrauscher/a7/
http://myuno.unomaha.edu/webapps/blackboard/content/courseMenu.jsp?course_id=_147588_1&mini=Y
http://blizzard.ist.unomaha.edu/~gcnielsen/a3/index.html
http://blizzard.ist.unomaha.edu/~jrauscher/a4/otherpage.html
http://blizzard.ist.unomaha.edu/~adubey/project/album.html
N/A
N/A
http://blizzard.ist.unomaha.edu/~jblackmore/
N/A
N/A
N/A
http://blizzard.ist.unomaha.edu/~gfosmer/a4/otherpage.html
N/A
N/A
N/A
http://blizzard.ist.unomaha.edu/~jwfitzpatrick/a4/otherpage.html
http://myuno.unomaha.edu/webapps/blackboard/content/courseMenu.jsp?course_id=_147588_1&mini=Y
http://blizzard.ist.unomaha.edu/~jwfitzpatrick/brew/index.html
N/A
N/A
N/A
http://blizzard.ist.unomaha.edu/~jdabestani/a4/
http://blizzard.ist.unomaha.edu/~jdabestani/a4/
http://blizzard.ist.unomaha.edu/~lschaller/a5/
http://blizzard.ist.unomaha.edu/images/style.css
N/A
N/A
N/A
http://blizzard.ist.unomaha.edu/~handersen/a3/
N/A
N/A
http://blizzard.ist.unomaha.edu/1300-1-xhtml/
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A
http://blizzard.ist.unomaha.edu/~lschaller/a4/index.html
N/A
http://blizzard.ist.unomaha.edu/~dpinkerton/a3/
http://myuno.unomaha.edu/webapps/blackboard/content/courseMenu.jsp?course_id=_147588_1&mini=Y
N/A
http://blizzard.ist.unomaha.edu/~dpollreis/a4/otherpage.html
N/A
N/A
blizzard.ist.unomaha.edu
0
 
sjklein42Commented:
Aha.  It looks like there aren't any separators between the lines at all.

The fix should be done where you are constructing $string7.

You should put a newline  "\n" at the end of each line.  

Or maybe the newlines were there and you are stripping them out?

If this is confusing, please post the code where you make the value for $string7
0
 
sjklein42Commented:
Your second output posting looks different than the first one.   The second one has separate lines.  On the first one they were all strung together.

The second one looks good.  Does the program work with that input?
0
 
sjklein42Commented:
When I run your second set of data through it I get this outout:

perl test.pl test.dat
    56  55.45   N/A
    44  43.56   unomaha.edu
     1  0.99    w3.org

Open in new window

0
 
fac66Author Commented:

The below code is where I created $string7.
I had to replace all the empty lines with N/A

Here is @ref pulled from an apacej log:

http://blizzard.ist.unomaha.edu/~fackermann/a5/index.html
http://blizzard.ist.unomaha.edu/~asatterfield/projectammon/art.html
http://blizzard.ist.unomaha.edu/~jhperez/a5/index.html
-
-
-
-
-
-
-
-
-
http://blizzard.ist.unomaha.edu/~ppickett/project/research.html
http://blizzard.ist.unomaha.edu/~jhperez/a4/index.html
-
-
-
http://myuno.unomaha.edu/webapps/blackboard/content/courseMenu.jsp?course_id=_147588_1&mini=Y
-
http://blizzard.ist.unomaha.edu/~lschaller/a5/
-
http://blizzard.ist.unomaha.edu/~jdfitzpatrick/a4/otherpage.html
http://blizzard.ist.unomaha.edu/~handersen/a5/index.html

The code below replaced every blank line wih N/A:
Does that help?
my $string7 = join('',@ref);
foreach $string7 (@ref)
{
if ( ! ( $string7 =~ /^http\:\/\// ) ) { $string7 = 'N/A'; }
 }

Open in new window

0
 
fac66Author Commented:
How did you get this?

perl test.pl test.dat
    56  55.45   N/A
    44  43.56   unomaha.edu
     1  0.99    w3.org
0
 
sjklein42Commented:
This is much easier - just loop through @ref directly - no need for $string7.

my $domain;
my (%refCount,$totRefCount);
my $pct;
my @lines = ();
foreach $_ (@ref)
{
    s/[\r\n]//g;
    if ( /http\:\/\/([^\/]+)/ )
    {
        $domain = $1;
        $domain =~ /([^\.]+\.[^\.]+)$/;
        $refCount{$1}++;
		$totRefCount++;
    }
    elsif ( $_ ne '' )
    {
        $refCount{$_}++;
		$totRefCount++;
    }
}

foreach $domain (keys(%refCount))
{
    $pct = sprintf("%3.2f", $refCount{$domain} / $totRefCount * 100);
    push @lines, join("\t", sprintf("%6d", $refCount{$domain}), $pct, $domain) . "\n";
}

print sort {$b cmp $a} @lines; 

Open in new window

0
 
fac66Author Commented:

Excellent!!
Thank you very much sir!
0

Featured Post

Technology Partners: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

  • 11
  • 8
Tackle projects and never again get stuck behind a technical roadblock.
Join Now