Solved

How to parse file with Perl

Posted on 2008-10-06
20
834 Views
Last Modified: 2012-08-13
Hi,
  I am trying to parse a message file and extract messages from the file that meet the criteria.  So what I am looking for are certain types of messages that are logged in a file.  Example.  I would like to pull out messages that have say CA, DC, OP, etc.. from this file example:  It is located after the ORC segment.  I only want to pull out the message when it finds the two letter code between the pipes, usually located on the ORC line.   I want it to match the |CA| in ORC and not consider the CA in CAMPING as a match.  

2nd Part:  How can I pull out a message that falls between the two Ignored: lines if the next line after Ignored: is ***Message Ignored, Incorrect Order Type?  I would want everything that falls after that first ignored until the next ignored appears.  Hopefully this isn't too confusing and I described exactly what I want.  My code can pull out info but it matches also on the CA in CAMPING.  I am also doing a count for each time it appears and I only want it to find the |CA| instances as well so this does not work correctly as well.

Ignored:
*** Message Ignored, Incorrect Order Type
MSH|^~\&|RP|GJH|ALL|ASD|20080922001707||RDE^O01|20080922001707008697|P|2.3|||||||||
PID|||1234567^^^^^||Person||number|L||||||||||||||||||||||||||||||
PV1||H|B12^B1266^A^K||||100022^James^James,|||||||||||I|947258034|||||||||||||||||||||||||||||||||
ORC|CA|In63708710|947258034-24-1|20315933|||1^BID&0800,2000^INDEF^200809220010^^R^^11111110^||200809220017|1965435^FFR^SHN|1965435^FFR^SHN|100022^James^James,|||||||196u86735^FFR^SHN
RXO|5132^Product^SEQNO||||||||||||||||
RXE|1^BID&0800,2000^INDEF^200809220010^^R^^11111110^|8246300^CAMPING|50||mg|TAP|^|||1|EACH||100022||||||||VEND|||||||||F|24240000^Description|||||||||||||M| ^
RXG|1||1^20080922001000
RXG|2||1^20080922080000

Ignored:
*** Message Ignored, Incorrect Order Type:
MSH|^~\&|RX|GJH|ALL_..........
Next message in similar structure as above....


Any help would be really appreciated.
#! /usr/bin/perl

use warnings;

use strict;

use diagnostics;
 

open(INFILE,  "rxp.ign")   or die "Can't open input.txt: $!";

open(OUTFILE, ">resultsCA.txt") or die "Can't open output.txt: $!";
 

while (<INFILE>) {
 

     if( /\bCA\b/ig ) {

         print OUTFILE $_;

     }

 }
 

close OUTFILE;

close INFILE;
 

my $val = <rxp2.ign>;

chomp ($val);

my $cnt=0;
 

open (HNDL, "$val") || die "wrong filename";

	while ($val = <HNDL>)

  {

	while ($val =~ /\bCA\b/ig)

  {

        ++$cnt;

  }

print "Number of instances of 'CA's' found: $cnt2\n\n";

Open in new window

0
Comment
Question by:simadownnow
  • 10
  • 9
20 Comments
 
LVL 51

Expert Comment

by:ahoffmann
Comment Utility
awk '/^Ignored:/{if(f==1){print x};f==0;}/^ORC\|(CA|DC|OP)\|/{f=1}{x=sprintf("%s\n%s",x,$0)}END{if(f==1){print x}}'  you-file
0
 
LVL 1

Author Comment

by:simadownnow
Comment Utility
I've never used the awk command before because I'm new to Perl.  Could you explain a little better what this is doing?  Do I need to declare any variables, awk?  Also when you write you-file, does that mean my file name that I want to parse?  rxp.ign?  in quotes or anything?  Does this also count the instances of each type of message CA, DC, OP?  
0
 
LVL 51

Expert Comment

by:ahoffmann
Comment Utility
> .. explain a little
set a flag if string Ignored is found at beginning of line
collect all lines if flaf is set
print collected line if string Ignored is found (and after reading file, as there is no more such line but probably a collected one)

> Do I need to declare any variables,
no, as all variables are o (integer) or '' (empty string) by default

> does that mean my file name that I want to parse?
yes

> in quotes or anything?
depends on your shell (i.g. without quotes if the filename does not contains meta characters)

> Does this also count the instances of each type of message CA, DC, OP?
no
to do that, use something like:

awk '/^Ignored:/{if(f==1){print x};f=0;}/^ORC\|(CA|DC|OP)\|/{f=1}{x=sprintf("%s\n%s",x,$0);if(/\|CA\|/){c++}};if(/\|DC\|/){d++}};if(/\|OP\|/){o++}}}END{if(f==1){print x};print "CA: ",c;print "DC: ",d;print "OP: ",o}'  rxp.ign

(not that my first post contains an error: f==0 muxt be f=0)

---
that's quick&dirty with awk, if you need more text precessing it's probably better to start with perl right away
0
 
LVL 1

Author Comment

by:simadownnow
Comment Utility
This log file is created on a windows box so I am using active perl for winXP which is what my workstation is.  
Thanks for the quick reply and explanation.  I don't know how to insert this in my perl code and execute the awk command.  I've tried but I get errors.  Should there be a BEGIN statement to go with the END that is in the code?  
0
 
LVL 51

Expert Comment

by:ahoffmann
Comment Utility
same as perl code (quick&dirty converted from awk)

perl -ane 'm/^Ignored:/&&do{if($f==1){print $x};$f=0;};m/^ORC\|(CA|DC|OP)\|/&&do{$f=1};{$x.=$_;if(/\|CA\|/){$c++};if(/\|DC\|/){$d++};if(/\|OP\|/){$o++}}END{if($f==1){print $x};print "CA: ",$c;print ", DC: ",$d;print ", OP: ",$o;}' rxp.ign
0
 
LVL 1

Author Comment

by:simadownnow
Comment Utility
Hoffman,
  What is the -ane in the command line?  Usually when I run a perl script I run it from the command line i.e perl whatever.pl.  Have you got this to parse the example I put up?  You can duplicate the message back to back to increase the message instances.  I can't get this to run, and I am probably executing it wrong.  Man I am a lamen with PERL, I need to pick up a book.  Sorry to keep asking, this must seem mundane to you..  
0
 
LVL 51

Expert Comment

by:ahoffmann
Comment Utility
> Have you got this to parse the example I put up?
simply stuff anthing between single quotes ' in your .pl file and execute it

> I can't get this to run ..
are you on unreliable systems like windoze? bad luck, you have to use a file for the script or fiddle arround M$'s strange handling of any kind of quotes.
Get any reliable shell and it works as posted. or use a script file. Sorry, I'm not responsible for stupid systems :)

> Man I am a lamen with PERL,
we fix that ;-)

> What is the -ane ...
man perl
man perlrun

-a  awk mode
-n  no print
-e  execute these commands

or more detailled (shamless stolen from perl's man-pages):

  -a   turns on autosplit mode when used with a -n or -p.  An implicit split command to the @F array is done as
        the first thing inside the implicit while loop produced by the -n or -p.

   -e commandline
       may be used to enter one line of program.  If -e is given, Perl will not look for a filename in the argu­
       ment list.  Multiple -e commands may be given to build up a multi-line script.  Make sure to use semicolons
       where you would in a normal program.

       -n   causes Perl to assume the following loop around your program, which makes it iterate over filename argu­
            ments somewhat like sed -n or awk:

              LINE:
                while (<>) {
                    ...             # your program goes here
                }

            Note that the lines are not printed by default.  See -p to have lines printed.  If a file named by an argu­
            ment cannot be opened for some reason, Perl warns you about it and moves on to the next file.

            Here is an efficient way to delete all files older than a week:

                find . -mtime +7 -print | perl -nle unlink

            This is faster than using the -exec switch of find because you don't have to start a process on every file­
            name found.  It does suffer from the bug of mishandling newlines in pathnames, which you can fix if you
            follow the example under -0.

            "BEGIN" and "END" blocks may be used to capture control before or after the implicit program loop, just as
            in awk.
0
 
LVL 1

Author Comment

by:simadownnow
Comment Utility
I am still confused how to get this to run.  I am using windowsXP and that is what I need to run the script on.  Unfortunatley I cannot use a more reliable shell.  I want to run the command against a file, I don't want to have to copy and paste data between quotes.   So I would like to run a cmd line  such as perl (extract.pl) and have the script execute the commands from within extract.pl script which will then open the file with the data and parse it.  this is how I usually get commands to run.

So am I suppose to put  'm/^Ignored:/&&do{if($f==1){print $x};$f=0;};m/^ORC\|(CA|DC|OP)\|/&&do{$f=1};{$x.=$_;if(/\|CA\|/){$c++};if(/\|DC\|/){$d++};if(/\|OP\|/){$o++}}END{if($f==1){print $x};print "CA: ",$c;print ", DC: ",$d;print ", OP: ",$o;}' rxp.ign  in the extract.pl  so that it can run?  

0
 
LVL 51

Expert Comment

by:ahoffmann
Comment Utility
> .. would like to run a cmd line  such as perl (extract.pl)  ..
simply write the code between the single quotes in your file (extract.pl) and run it like
  perl extract.pl rxp.ign
0
 
LVL 1

Author Comment

by:simadownnow
Comment Utility
I did and I got this

C:\Logs>perl extract.pl rxp.ign
Useless use of a constant in void context at extract.pl line 6 (#1)
    (W void) You did something without a side effect in a context that does
    nothing with the return value, such as a statement that doesn't return a
    value from a block, or the left side of a scalar comma operator.  Very
    often this points not to stupidity on your part, but a failure of Perl
    to parse your program the way you thought it would.  For example, you'd
    get this if you mixed up your C precedence with Python precedence and
    said

        $one, $two = 1, 2;

    when you meant to say

        ($one, $two) = (1, 2);

    Another common error is to use ordinary parentheses to construct a list
    reference when you should be using square or curly brackets, for
    example, if you say

        $array = (1,2);

    when you should have said

        $array = [1,2];

    The square brackets explicitly turn a list value into a scalar value,
    while parentheses do not.  So when a parenthesized list is evaluated in
    a scalar context, the comma is treated like C's comma operator, which
    throws away the left argument, which is not what you want.  See
    perlref for more on this.

    This warning will not be issued for numerical constants equal to 0 or 1
    since they are often used in statements like

        1 while sub_with_side_effects();

    String constants that would normally evaluate to 0 or 1 are warned
0
6 Surprising Benefits of Threat Intelligence

All sorts of threat intelligence is available on the web. Intelligence you can learn from, and use to anticipate and prepare for future attacks.

 
LVL 84

Assisted Solution

by:ozo
ozo earned 20 total points
Comment Utility
in a dosshell command line, you would have to change the quotes
perl -ane  "m/^Ignored:/&&do{if($f==1){print $x};$f=0;};m/^ORC\|(CA|DC|OP)\|/&&do{$f=1};{$x.=$_;if(/\|CA\|/){$c++};if(/\|DC\|/){$d++};if(/\|OP\|/){$o++}}END{if($f==1){print $x};print 'CA: ',$c;print ', DC: ',$d;print ', OP: ',$o;}" rxp.ign
0
 
LVL 51

Expert Comment

by:ahoffmann
Comment Utility
> I did and I got this
use following as first line of your script

my $c=$d=$o=$f=0; my $x='';
0
 
LVL 1

Author Comment

by:simadownnow
Comment Utility
Okay I ran the exact command that  ozo did, and added the declarations that Ahoff. put up.  It works on counting the instances but I want to be able to extract each of those types of messages once the locator field is found as I mentioned in the question and put them in a file for each so CA.txt, DC.txt.  If a CA occurs I want the entire message from the end of the first ignore to the beginning of the next ignore if that makes it easy enough.   I dont want it to pull out CA if it finds it in a word within the entire message "like CALL" as well which I believe this is doing just like the script I wrote and pasted.  Hopefully we can make it a little stricter on searching and matching.   Almost there....
0
 
LVL 1

Author Comment

by:simadownnow
Comment Utility
What if I wanted what was just on the second line for each message and possibly count those i.e
*** Message Ignored, Incorrect Order Type:
*** Message Ignored, Multi-component order not supported.
*** Message Ignored, Incorrect Order Type: OP
*** Message Ignored, Incorrect Order Type: NW
*** Message Ignored, Incorrect Order Type: DC

# of OP messages = $OP

SO would it be     if(/\|***Message Ignored,Incorrect Order Type: OP\|/){$OP++};
I think the ***create issues due to wildcard, not sure how to use as a search character within Perl

I would like it to capture anything that appears after the *** on that line after the Ignored and before the MSH next line.  If these could also be totalled, that would be great as well.  This may help me to be able to change the script to what I need in other cases.
0
 
LVL 51

Expert Comment

by:ahoffmann
Comment Utility
if(/\*\*\*Message Ignored,Incorrect Order Type: OP/){$OP++};
0
 
LVL 1

Author Comment

by:simadownnow
Comment Utility
Okay I an going to work this in, to see if this helps.  So lets say if this line matches so the IF statement is TRUE and it increments OP by one, can I put the message that follows into a OP file and append all other OP messages when it matches that statement. so from the MSH to the end of that message or the next Ignored:?
0
 
LVL 51

Expert Comment

by:ahoffmann
Comment Utility
> .. can I put the message that follows into a OP file ..
assuming you opend that file at beginning, somehow like:

 open(OP, ">>path/to/OP-file")||die"ERROR: cannot open OP-file: $!";

then you can write to that file like:
   print OP $_; # assuming that $_ contains your current line
0
 
LVL 1

Author Comment

by:simadownnow
Comment Utility
But how does it know where to start the message and end the message for output into the OP file after it matches the current line I am searching on?
0
 
LVL 51

Accepted Solution

by:
ahoffmann earned 230 total points
Comment Utility
> .. how does it know where to start the message and end the message for output
it prints one line (which contains tha match)
0
 
LVL 1

Author Comment

by:simadownnow
Comment Utility
thanks for the help!
0

Featured Post

Do You Know the 4 Main Threat Actor Types?

Do you know the main threat actor types? Most attackers fall into one of four categories, each with their own favored tactics, techniques, and procedures.

Join & Write a Comment

On Microsoft Windows, if  when you click or type the name of a .pl file, you get an error "is not recognized as an internal or external command, operable program or batch file", then this means you do not have the .pl file extension associated with …
Email validation in proper way is  very important validation required in any web pages. This code is self explainable except that Regular Expression which I used for pattern matching. I originally published as a thread on my website : http://www…
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…
This tutorial demonstrates a quick way of adding group price to multiple Magento products.

744 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

16 Experts available now in Live!

Get 1:1 Help Now