Solved

Perl regex help

Posted on 2011-03-06
19
461 Views
Last Modified: 2012-05-11
0
Comment
Question by:fac66
  • 11
  • 8
19 Comments
 
LVL 16

Expert Comment

by:sjklein42
ID: 35049917
while ( <> )
{
    if ( /^http\:\/\/([^\/]+)/ )
    {
        $refCount{$1}++;
    }
}

foreach $domain (keys(%refCount))
{
    push @lines, sprintf("%6d", $refCount{$domain}) . "\t" . $domain . "\n";
}

print sort {b <=> a} @lines;

Open in new window

0
 
LVL 16

Expert Comment

by:sjklein42
ID: 35049995
This version only shows the domain (not subdomain) and includes N/A line:

while ( <> )
{
    s/[\r\n]//g;
    if ( /^http\:\/\/([^\/]+)/ )
    {
        $domain = $1;
        $domain =~ /([^\.]+\.[^\.]+)$/;
        $refCount{$1}++;
    }
    else
    {
        $refCount{$_}++;
    }
}

foreach $domain (keys(%refCount))
{
    push @lines, sprintf("%6d", $refCount{$domain}) . "\t" . $domain . "\n";
}

print sort {b <=> a} @lines;

Open in new window

0
 
LVL 16

Expert Comment

by:sjklein42
ID: 35050032
With percentages and header.

print join("\t", ' Hits ', '%-Age', 'Resource') . "\n";
print join("\t", '------', '-----', '--------') . "\n";

while ( <> )
{
    s/[\r\n]//g;
    if ( /^http\:\/\/([^\/]+)/ )
    {
        $domain = $1;
        $domain =~ /([^\.]+\.[^\.]+)$/;
        $refCount{$1}++;
    }
    else
    {
        $refCount{$_}++;
    }

	$totRefCount++;
}

foreach $domain (keys(%refCount))
{
	$pct = sprintf("%3.2f", $refCount{$domain} / $totRefCount * 100);
    push @lines, join("\t", sprintf("%6d", $refCount{$domain}), $pct, $domain) . "\n";
}

print sort {b <=> a} @lines;

Open in new window

0
 

Author Comment

by:fac66
ID: 35052159
Not following you.
I got a string with the following data:

http://my.una.edu/webapps/blackboard/content/courseMenu.jsp?course_id=_147588_1&mini=Y
N/A
http://blizzard.ist.una.edu/~lschaller/a5/
N/A
http://blizzard.ist.una.edu/~jdfitzpatrick/a4/otherpage.html
http://blizzard.ist.una.edu/~handersen/a5/index.html
N/A
http://blizzard.ist.una.edu/~bmmurray/a5/

Should I begin like this?
Or should I copy into an array?
while ( $string )
{
    s/[\r\n]//g;
    if ( /^http\:\/\/([^\/]+)/ )
    {
        $domain = $1;

Open in new window

0
 
LVL 16

Expert Comment

by:sjklein42
ID: 35052260
Ok. I thought it was in a file.

If the input is in one big long string called $string, with embedded newlines between the lines,

replace this line:

while ( <> )

with this

foreach $_ (split(/[\r\n]+/, $string))

Open in new window

0
 

Author Comment

by:fac66
ID: 35052673
This is how I have it..
This is what it prints:

1  100.00  -http://blizzard.ist.una.edu/~jdabestani/a5/http://blizzard.ist.una.edu/~fackermann/a5/index.html-http://my.una.edu/webapps/blackboard/content/courseMenu.jsp?course_id=_147595_1&mini=Yhttp://blizzard.ist.una.edu/~jrauscher/a7/http://blizzard.ist.una.edu/~dtwilliams/a5/index.htmlhttp://validator.w3.org/check?uri=http%3A%2F%2Fblizzard.ist.una.edu%2F%7Ekholtz%2Fa5%2Fotherpage.html&charset=%28detect+a

Did I confugire it correct?
my $domain;
my (%refCount,$totRefCount);
my $pct;
my @lines = ();
 foreach $_ (split(/[\r\n]+/, $string7))
{
   s/[\r\n]//g;
    if ( /^http\:\/\/([^\/]+)/ )
    {
        $domain = $1;
        $domain =~ /([^\.]+\.[^\.]+)$/;
        $refCount{$1}++;
    }
    else
    {
        $refCount{$_}++;
    }

        $totRefCount++;
}

foreach $domain (keys(%refCount))
{
        $pct = sprintf("%3.2f", $refCount{$domain} / $totRefCount * 100);
    push @lines, join("\t", sprintf("%6d", $refCount{$domain}), $pct, $domain) . "\n";
}
print sort {$b cmp $a} @lines;

Open in new window

0
 
LVL 16

Expert Comment

by:sjklein42
ID: 35052933
It looks like there is a '-' at the beginnng of the URL.  This version should handle it.

There are newline characters between the records in $string7, right?

my $domain;
my (%refCount,$totRefCount);
my $pct;
my @lines = ();
 foreach $_ (split(/[\r\n]+/, $string7))
{
   s/[\r\n]//g;
    if ( /http\:\/\/([^\/]+)/ )
    {
        $domain = $1;
        $domain =~ /([^\.]+\.[^\.]+)$/;
        $refCount{$1}++;
    }
    else
    {
        $refCount{$_}++;
    }

        $totRefCount++;
}

foreach $domain (keys(%refCount))
{
        $pct = sprintf("%3.2f", $refCount{$domain} / $totRefCount * 100);
    push @lines, join("\t", sprintf("%6d", $refCount{$domain}), $pct, $domain) . "\n";
}
print sort {$b cmp $a} @lines;

Open in new window

0
 
LVL 16

Expert Comment

by:sjklein42
ID: 35053174
Just so we're on the same page, this is how I think you're calling it.  Maybe I'm wrong about the input:


$string7 = q{
http://blizzard.ist.una.edu/~dtwilliams/a5/index.html
http://validator.w3.org/check?uri=http%3A%2F%2Fblizzard.ist.unomaha.edu%2F%7Ekholtz%2Fa5%2Fotherpage.html&charset=%28detect+automatically%29&doctype=Inline&group=0&accept=text%2Fhtml%2Capplication%2Fxhtml%2Bxml%2Capplication%2Fxml%3Bq%3D0.9%2C*%2F*%3Bq%3D0.8&accept-language=en-us%2Cen%3Bq%3D0.5&accept-charset=ISO-8859-1%2Cutf-8%3Bq%3D0.7%2C*%3Bq%3D0.7&user-agent=W3C_Validator%2F1.2
N/A
http://blizzard.ist.una.edu/~fackermann/a5/index.html
http://blizzard.ist.una.edu/~asatterfield/projectammon/art.html
http://blizzard.ist.una.edu/~jhperez/a5/index.html
N/A
N/A
N/A
};


my $domain;
my (%refCount,$totRefCount);
my $pct;
my @lines = ();
 foreach $_ (split(/[\r\n]+/, $string7))
{
    s/[\r\n]//g;
    if ( /http\:\/\/([^\/]+)/ )
    {
        $domain = $1;
        $domain =~ /([^\.]+\.[^\.]+)$/;
        $refCount{$1}++;
		$totRefCount++;
    }
    elsif ( $_ ne '' )
    {
        $refCount{$_}++;
		$totRefCount++;
    }
}

foreach $domain (keys(%refCount))
{
    $pct = sprintf("%3.2f", $refCount{$domain} / $totRefCount * 100);
    push @lines, join("\t", sprintf("%6d", $refCount{$domain}), $pct, $domain) . "\n";
}

print sort {$b cmp $a} @lines;

Open in new window



and the output:
c:\temp>perl foob.pl
     4  44.44   una.edu
     4  44.44   N/A
     1  11.11   w3.org

Open in new window

0
 

Author Comment

by:fac66
ID: 35053192
Thanks for you help.

Getting real close but it prints only 1 line.

Hits   %-Age   Resource
------  -----   --------
     1  100.00  una.edu

Also need to account for the N/A
For example:
 Hits  %-age   Resource
  ----  -----   --------
    56  55.45   N/A
    44  43.56   una.edu
     1   0.99   w3.org
0
Why You Should Analyze Threat Actor TTPs

After years of analyzing threat actor behavior, it’s become clear that at any given time there are specific tactics, techniques, and procedures (TTPs) that are particularly prevalent. By analyzing and understanding these TTPs, you can dramatically enhance your security program.

 
LVL 16

Expert Comment

by:sjklein42
ID: 35053265
Something is different about the input.

How are the lines separated?

Please add this to the code and then show me the output it generates.

print "\n------------\n$string7\n---------------\n";

Open in new window

0
 

Author Comment

by:fac66
ID: 35053284
This is the output:



------------
-http://blizzard.ist.unomaha.edu/~jdabestani/a5/http://blizzard.ist.unomaha.edu/~fackermann/a5/index.html-http://myuno.unomaha.edu/webapps/blackboard/content/courseMenu.jsp?course_id=_147595_1&mini=Yhttp://blizzard.ist.unomaha.edu/~jrauscher/a7/http://blizzard.ist.unomaha.edu/~dtwilliams/a5/index.htmlhttp://validator.w3.org/check?uri=http%3A%2F%2Fblizzard.ist.unomaha.edu%2F%7Ekholtz%2Fa5%2Fotherpage.html&charset=%28detect+automatically%29&doctype=Inline&group=0&accept=text%2Fhtml%2Capplication%2Fxhtml%2Bxml%2Capplication%2Fxml%3Bq%3D0.9%2C*%2F*%3Bq%3D0.8&accept-language=en-us%2Cen%3Bq%3D0.5&accept-charset=ISO-8859-1%2Cutf-8%3Bq%3D0.7%2C*%3Bq%3D0.7&user-agent=W3C_Validator%2F1.2-http://blizzard.ist.unomaha.edu/~fackermann/a5/index.htmlhttp://blizzard.ist.unomaha.edu/~asatterfield/projectammon/art.htmlhttp://blizzard.ist.unomaha.edu/~jhperez/a5/index.html---------http://blizzard.ist.unomaha.edu/~ppickett/project/research.htmlhttp://blizzard.ist.unomaha.edu/~jhperez/a4/index.html---http://myuno.unomaha.edu/webapps/blackboard/content/courseMenu.jsp?course_id=_147588_1&mini=Y-http://blizzard.ist.unomaha.edu/~lschaller/a5/-http://blizzard.ist.unomaha.edu/~jdfitzpatrick/a4/otherpage.htmlhttp://blizzard.ist.unomaha.edu/~handersen/a5/index.html-http://blizzard.ist.unomaha.edu/~bmmurray/a5/---http://blizzard.ist.unomaha.edu/~scarmody/a5/otherpage.htmlhttp://blizzard.ist.unomaha.edu/~gfosmer/a4/otherpage.htmlhttp://blizzard.ist.unomaha.edu/1300-2-css/http://blizzard.ist.unomaha.edu/~gcnielsen/a4/index.htmlhttp://blizzard.ist.unomaha.edu/~gfosmer/a5/-http://myuno.unomaha.edu/webapps/blackboard/content/listContentEditable.jsp?content_id=_2151076_1&course_id=_147587_1-http://blizzard.ist.unomaha.edu/~ksebastian/a5/-http://blizzard.ist.unomaha.edu/~rbeasley/a4/otherpage.html--http://blizzard.ist.unomaha.edu/~drgulick/a5/otherpage.html-http://blizzard.ist.unomaha.edu/~jrauscher/a7/http://myuno.unomaha.edu/webapps/blackboard/content/courseMenu.jsp?course_id=_147588_1&mini=Yhttp://blizzard.ist.unomaha.edu/~gcnielsen/a3/index.htmlhttp://blizzard.ist.unomaha.edu/~jrauscher/a4/otherpage.htmlhttp://blizzard.ist.unomaha.edu/~adubey/project/album.html--http://blizzard.ist.unomaha.edu/~jblackmore/---http://blizzard.ist.unomaha.edu/~gfosmer/a4/otherpage.html---http://blizzard.ist.unomaha.edu/~jwfitzpatrick/a4/otherpage.htmlhttp://myuno.unomaha.edu/webapps/blackboard/content/courseMenu.jsp?course_id=_147588_1&mini=Yhttp://blizzard.ist.unomaha.edu/~jwfitzpatrick/brew/index.html---http://blizzard.ist.unomaha.edu/~jdabestani/a4/http://blizzard.ist.unomaha.edu/~jdabestani/a4/http://blizzard.ist.unomaha.edu/~lschaller/a5/http://blizzard.ist.unomaha.edu/images/style.css---http://blizzard.ist.unomaha.edu/~handersen/a3/--http://blizzard.ist.unomaha.edu/1300-1-xhtml/---------http://blizzard.ist.unomaha.edu/~lschaller/a4/index.html-http://blizzard.ist.unomaha.edu/~dpinkerton/a3/http://myuno.unomaha.edu/webapps/blackboard/content/courseMenu.jsp?course_id=_147588_1&mini=Y-http://blizzard.ist.unomaha.edu/~dpollreis/a4/otherpage.html--
---------------
0
 

Author Comment

by:fac66
ID: 35053289
If I do a:

print "$string7\n";

Here are the results:


N/A
http://blizzard.ist.unomaha.edu/~jdabestani/a5/
http://blizzard.ist.unomaha.edu/~fackermann/a5/index.html
N/A
http://myuno.unomaha.edu/webapps/blackboard/content/courseMenu.jsp?course_id=_147595_1&mini=Y
http://blizzard.ist.unomaha.edu/~jrauscher/a7/
http://blizzard.ist.unomaha.edu/~dtwilliams/a5/index.html
http://validator.w3.org/check?uri=http%3A%2F%2Fblizzard.ist.unomaha.edu%2F%7Ekholtz%2Fa5%2Fotherpage.html&charset=%28detect+automatically%29&doctype=Inline&group=0&accept=text%2Fhtml%2Capplication%2Fxhtml%2Bxml%2Capplication%2Fxml%3Bq%3D0.9%2C*%2F*%3Bq%3D0.8&accept-language=en-us%2Cen%3Bq%3D0.5&accept-charset=ISO-8859-1%2Cutf-8%3Bq%3D0.7%2C*%3Bq%3D0.7&user-agent=W3C_Validator%2F1.2
N/A
http://blizzard.ist.unomaha.edu/~fackermann/a5/index.html
http://blizzard.ist.unomaha.edu/~asatterfield/projectammon/art.html
http://blizzard.ist.unomaha.edu/~jhperez/a5/index.html
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A
http://blizzard.ist.unomaha.edu/~ppickett/project/research.html
http://blizzard.ist.unomaha.edu/~jhperez/a4/index.html
N/A
N/A
N/A
http://myuno.unomaha.edu/webapps/blackboard/content/courseMenu.jsp?course_id=_147588_1&mini=Y
N/A
http://blizzard.ist.unomaha.edu/~lschaller/a5/
N/A
http://blizzard.ist.unomaha.edu/~jdfitzpatrick/a4/otherpage.html
http://blizzard.ist.unomaha.edu/~handersen/a5/index.html
N/A
http://blizzard.ist.unomaha.edu/~bmmurray/a5/
N/A
N/A
N/A
http://blizzard.ist.unomaha.edu/~scarmody/a5/otherpage.html
http://blizzard.ist.unomaha.edu/~gfosmer/a4/otherpage.html
http://blizzard.ist.unomaha.edu/1300-2-css/
http://blizzard.ist.unomaha.edu/~gcnielsen/a4/index.html
http://blizzard.ist.unomaha.edu/~gfosmer/a5/
N/A
http://myuno.unomaha.edu/webapps/blackboard/content/listContentEditable.jsp?content_id=_2151076_1&course_id=_147587_1
N/A
http://blizzard.ist.unomaha.edu/~ksebastian/a5/
N/A
http://blizzard.ist.unomaha.edu/~rbeasley/a4/otherpage.html
N/A
N/A
http://blizzard.ist.unomaha.edu/~drgulick/a5/otherpage.html
N/A
http://blizzard.ist.unomaha.edu/~jrauscher/a7/
http://myuno.unomaha.edu/webapps/blackboard/content/courseMenu.jsp?course_id=_147588_1&mini=Y
http://blizzard.ist.unomaha.edu/~gcnielsen/a3/index.html
http://blizzard.ist.unomaha.edu/~jrauscher/a4/otherpage.html
http://blizzard.ist.unomaha.edu/~adubey/project/album.html
N/A
N/A
http://blizzard.ist.unomaha.edu/~jblackmore/
N/A
N/A
N/A
http://blizzard.ist.unomaha.edu/~gfosmer/a4/otherpage.html
N/A
N/A
N/A
http://blizzard.ist.unomaha.edu/~jwfitzpatrick/a4/otherpage.html
http://myuno.unomaha.edu/webapps/blackboard/content/courseMenu.jsp?course_id=_147588_1&mini=Y
http://blizzard.ist.unomaha.edu/~jwfitzpatrick/brew/index.html
N/A
N/A
N/A
http://blizzard.ist.unomaha.edu/~jdabestani/a4/
http://blizzard.ist.unomaha.edu/~jdabestani/a4/
http://blizzard.ist.unomaha.edu/~lschaller/a5/
http://blizzard.ist.unomaha.edu/images/style.css
N/A
N/A
N/A
http://blizzard.ist.unomaha.edu/~handersen/a3/
N/A
N/A
http://blizzard.ist.unomaha.edu/1300-1-xhtml/
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A
http://blizzard.ist.unomaha.edu/~lschaller/a4/index.html
N/A
http://blizzard.ist.unomaha.edu/~dpinkerton/a3/
http://myuno.unomaha.edu/webapps/blackboard/content/courseMenu.jsp?course_id=_147588_1&mini=Y
N/A
http://blizzard.ist.unomaha.edu/~dpollreis/a4/otherpage.html
N/A
N/A
blizzard.ist.unomaha.edu
0
 
LVL 16

Expert Comment

by:sjklein42
ID: 35053297
Aha.  It looks like there aren't any separators between the lines at all.

The fix should be done where you are constructing $string7.

You should put a newline  "\n" at the end of each line.  

Or maybe the newlines were there and you are stripping them out?

If this is confusing, please post the code where you make the value for $string7
0
 
LVL 16

Expert Comment

by:sjklein42
ID: 35053307
Your second output posting looks different than the first one.   The second one has separate lines.  On the first one they were all strung together.

The second one looks good.  Does the program work with that input?
0
 
LVL 16

Expert Comment

by:sjklein42
ID: 35053319
When I run your second set of data through it I get this outout:

perl test.pl test.dat
    56  55.45   N/A
    44  43.56   unomaha.edu
     1  0.99    w3.org

Open in new window

0
 

Author Comment

by:fac66
ID: 35053326

The below code is where I created $string7.
I had to replace all the empty lines with N/A

Here is @ref pulled from an apacej log:

http://blizzard.ist.unomaha.edu/~fackermann/a5/index.html
http://blizzard.ist.unomaha.edu/~asatterfield/projectammon/art.html
http://blizzard.ist.unomaha.edu/~jhperez/a5/index.html
-
-
-
-
-
-
-
-
-
http://blizzard.ist.unomaha.edu/~ppickett/project/research.html
http://blizzard.ist.unomaha.edu/~jhperez/a4/index.html
-
-
-
http://myuno.unomaha.edu/webapps/blackboard/content/courseMenu.jsp?course_id=_147588_1&mini=Y
-
http://blizzard.ist.unomaha.edu/~lschaller/a5/
-
http://blizzard.ist.unomaha.edu/~jdfitzpatrick/a4/otherpage.html
http://blizzard.ist.unomaha.edu/~handersen/a5/index.html

The code below replaced every blank line wih N/A:
Does that help?
my $string7 = join('',@ref);
foreach $string7 (@ref)
{
if ( ! ( $string7 =~ /^http\:\/\// ) ) { $string7 = 'N/A'; }
 }

Open in new window

0
 

Author Comment

by:fac66
ID: 35053333
How did you get this?

perl test.pl test.dat
    56  55.45   N/A
    44  43.56   unomaha.edu
     1  0.99    w3.org
0
 
LVL 16

Accepted Solution

by:
sjklein42 earned 500 total points
ID: 35053348
This is much easier - just loop through @ref directly - no need for $string7.

my $domain;
my (%refCount,$totRefCount);
my $pct;
my @lines = ();
foreach $_ (@ref)
{
    s/[\r\n]//g;
    if ( /http\:\/\/([^\/]+)/ )
    {
        $domain = $1;
        $domain =~ /([^\.]+\.[^\.]+)$/;
        $refCount{$1}++;
		$totRefCount++;
    }
    elsif ( $_ ne '' )
    {
        $refCount{$_}++;
		$totRefCount++;
    }
}

foreach $domain (keys(%refCount))
{
    $pct = sprintf("%3.2f", $refCount{$domain} / $totRefCount * 100);
    push @lines, join("\t", sprintf("%6d", $refCount{$domain}), $pct, $domain) . "\n";
}

print sort {$b cmp $a} @lines; 

Open in new window

0
 

Author Comment

by:fac66
ID: 35053381

Excellent!!
Thank you very much sir!
0

Featured Post

How your wiki can always stay up-to-date

Quip doubles as a “living” wiki and a project management tool that evolves with your organization. As you finish projects in Quip, the work remains, easily accessible to all team members, new and old.
- Increase transparency
- Onboard new hires faster
- Access from mobile/offline

Join & Write a Comment

I have been pestered over the years to produce and distribute regular data extracts, and often the request have explicitly requested the data be emailed as an Excel attachement; specifically Excel, as it appears: CSV files confuse (no Red or Green h…
In the distant past (last year) I hacked together a little toy that would allow a couple of Manager types to query, preview, and extract data from a number of MongoDB instances, to their tool of choice: Excel (http://dilbert.com/strips/comic/2007-08…
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…
This video gives you a great overview about bandwidth monitoring with SNMP and WMI with our network monitoring solution PRTG Network Monitor (https://www.paessler.com/prtg). If you're looking for how to monitor bandwidth using netflow or packet s…

747 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

10 Experts available now in Live!

Get 1:1 Help Now