[Okta Webinar] Learn how to a build a cloud-first strategyRegister Now

x
?
Solved

How to fix this parser bug in my PERL code?

Posted on 2011-10-09
23
Medium Priority
?
290 Views
Last Modified: 2012-05-12
This is the code that I use:

sub submitFileParser ($) {
	my $filename = shift;
	my @paragraphs;
#	local($/) = '';
	open( FILE, "< $filename" ) or die "Can't open $filename : $!";
	@paragraphs = <FILE>;
	close FILE;
	return read_paragraphs (@paragraphs);
}

sub read_paragraphs (@) {
	# read lines as parameters
	my @rippedParagraphs = @_;
	my @submits = ();
	# Storage for all sections
	# Temporary storages for single section of each type
	my (@Files, @CR, @RR, @CS, @Options, @Mailsent, @file_info);
	# Flags for file traversal logic
	my ($opt_flag, $file_flag, $mail_sent_to_flag);

	my $submitFileExist = 0;
	#read the file
	for ( @rippedParagraphs ) {
	    s/\r//g;
		if (/^USER=(\S+)\,/) {
			#obtain the login from USER=
			$geckLogin = $1;
		}
		if (/^\s*Submit\s+file\s*$/) {
			# We record the accumulated data:
			push(
				@submits,
				{
					"Options"              => [@Options],
					"FileInfo"             => [@file_info],
					"GeckLogin" 		   => $geckLogin,
					"SubmitFileExistFlag"     => $submitFileExist,
					"Mail sent to"         => [@Mailsent],
				}
			) if @file_info;
			@Options = @Mailsent = @file_info = ();
			$submitFileExist = 1;
			next;
		}
		
		if ($submitFileExist == 1) {
			if (/^\s*\=+\s*$/) {
				$submitFileExist++;
				$mail_sent_to_flag = 0;
			} 
			next;
		}
		if ($submitFileExist == 2) {
			if ($mail_sent_to_flag) {
				push(@Mailsent, $_);
				next;
			}
			#if (m|^\#\s*Sandbox\s+location\s*\:\s*\S*/sandbox/(.*?)/|) {
			#	# Match the login name in the submit file - if it has not
			#	# already been done
			#	$geckLogin ||= $1;
			#}
			# If we encounter a comment or empty string
			if (/^\#/ || !/\S/) {
				# we haven't encountered an option to start doing anything
				next unless $opt_flag || $file_flag;
				# If we're done with options, let's start reading file sections
				if ($opt_flag == 1) {
					$opt_flag = 0;
					$file_flag = 1;
				}
				elsif ($opt_flag > 1) {
					# Addresses the empty line within Options:
					$opt_flag--;
				}
				next;
			}
			if (/^Options/) {
				# We start reading options
				$opt_flag = 2;
				next;
			}
			if (/^Mail sent to/) {
				$mail_sent_to_flag = 1;
				push(@Mailsent, $_);
				next;
			}
			# Matching beginning of the line to determine the type of the string
			# and placing it in temporary storage
			/^R(R|elated\sRecords):\s*(.*\n)/ && push(@RR, $2) && goto CHECK;
			/^C(R|ode\sReviewer):\s*(.*\n)/ && push(@CR, $2) && goto CHECK;
			/^C(S|omments):\s*(.*\n)/ && push(@CS, $2) && goto CHECK;

			# General text is either files or options info, depending on the
			# value of the option flag
			$opt_flag ? push(@Options, $_) : push(@Files, $_);
			CHECK:
			if (@RR && @CR && @CS) {
				push(
					@file_info,
					{
						"Files"                => [@Files],
						"Comments"             => [@CS],
						"RelatedRecords"       => [@RR],
						"CodeReviewers"        => [@CR],
						"SubmitFileExistFlag"  => $submitFileExist,
					},
				);
				@Files = @CS = @RR = @CR = ();
			}
        }
	}
	
	if ($submitFileExist == 0) {
			my @noSubmitFileSubmits = ();
			$submitFileExist = 0; # two-line grammar didn't hold
			my $parsedData = parseWithoutSubmitFile(@rippedParagraphs);
			#submit file does not exist flag
			$submitFileExist = 0;
			push(
				@noSubmitFileSubmits,
				{
				"GeckLogin" 		   => $geckLogin,
				"ParsedData"		   => $parsedData,
				"SubmitFileExistFlag"  => $submitFileExist,
				"Cluster"              => $parsedData->{t},
				"JobID"                => $parsedData->{dollar_},
				"gLogFilesOption"      => exists $parsedData->{GLOGFILES},
				"gLogSbcheckOption"    => exists $parsedData->{GLOGSBCHECK},
				}
			) if $parsedData;
				return \@noSubmitFileSubmits;
		}
	push(
		@submits,
		{
			"Options"              => [@Options],
			"Mail sent to"         => [@Mailsent],
			"FileInfo"             => [@file_info],
			"GeckLogin" 		   => $geckLogin,
		}
	) if @file_info;
	return \@submits;
}


sub parseWithoutSubmitFile (@) {
	my $arg_flag = 0;
	my $parsedData = {};
	my $current_option = '';
		while (my $line = shift @_) {
		if ($arg_flag == 1) {
		if ($line =~ /^Currently (\$\_=.*)/) {
		local $_;
		eval "$1;";
		$parsedData->{dollar_} = $_;
		$arg_flag = 0;
		}
		elsif ($line =~ /^\s+\-(.*)/) {
		$current_option = $1;
		$parsedData->{$current_option} = undef;
		next;
		}
		elsif ($current_option && $line =~ /^\s+(.*)/) {
		$parsedData->{$current_option} = $1;
		$current_option = undef;
		}
		}
		else {
		if ($line =~ /^Original arguments:/) {
		$arg_flag = 1;
		next;
		}
		}
	}
return $parsedData;
}

Open in new window



This is the text file that I parse:

USER=testman, HOST=testman-deb6-64, ARCH=glnxa64
Revisions: /st/hub/share/apps/bat//share/mmit: 07/26-09:48:58; csubmitItem.pm: 2011/07/26-09:48:56
Original arguments:
        -t
        Atk
        -F
        20110914.submit
Currently $_='154551'

        main:/st/hub/share/apps/bat/bat2.15.17/share/../lib/csubmitCache.pm:44 called main::submissionHistory
        main:/st/hub/share/apps/bat/bat2.15.17/share/submit:3871 called main::CreateCacheFile

Current directory ($PWD) = /st/devel/sandbox/testman/Aslrtw
                Submit file
        ===========================
# Component        : Coder
# Sandbox location : /st/devel/sandbox/testman/Atk
# Submission for   : 2000
#
# Description:
#   Unlocking making changes
#
# Documentation impact:
#   None
#
# QE items:
#   None
#
# Type of change:
#   Unlocking making changes
#

# submit file for use with msubmit.  To use run the command
#      submit -F 24.submit
#   or use C-c C-c from emacs to run this command.
# "<a href='http://www-sandbox/testman/Atk/glnxa64'>/sandbox/testman/Atk_ests/glnxa64</a>"
# "No need for sbruntests: Interactive Tests Update"
Options:

-CJ "<a href='http://www-sandbox/testman/Atk_tests/glnxa64'>/sandbox/testman/Atk_tests/glnxa64</a>"
-nowrap
-subject "Unlocking making changes"
-KEYWORD1
-KEYWORD2

st/ert/variants/variants4.c
CR: testman2
RR: 123456
CS: Unlocking before making changes
and adding this line for this case

Mail sent to:
    st.devel.submit: Unlocking making changes
    Files:
    st/ert/variants/variants5.c

	
				Submit file
        ===========================
# Component        : Coder
# Sandbox location : /st/devel/sandbox/testman/Atk
# Submission for   : 2000
#
# Description:
#   Unlocking making changes
#
# Documentation impact:
#   None
#
# QE items:
#   None
#
# Type of change:
#   Unlocking making changes
#

# submit file for use with msubmit.  To use run the command
#      submit -F 14.submit
#   or use C-c C-c from emacs to run this command.
# "<a href='http://www-sandbox/testman/Atk_tests/glnxa64'>/sandbox/testman/Atk_tests/glnxa64</a>"
# "No need for sbruntests: Interactive Tests Update"
Options:

-CJ "<a href='http://www-sandbox/testman/Atk_tests/glnxa64'>/sandbox/testman/Atk_tests/glnxa64</a>"
-nowrap
-subject "Unlocking making changes"
-KEYWORD1
-KEYWORD2

st/ert/variants/variants5.c
st/ert/variants/variants6.c
CS: Unlocking before making changes
and adding this line for this case
CR: testman2
RR: 333333, 444444

st/ert/variants/variants7.c
CR: testman2
Comments: Unlocking before making changes
and adding this line for this case
RR: 555555, 666666

Mail sent to:
    st.devel.submit: Unlocking making changes
    Files:
    st/ert/variants/variants5.c

Open in new window



The problem is; this code only detects the first line of CS  (or Comments) line.

The expected output is:

Unlocking before making changes
and adding this line for this case

Open in new window


How can I change this code to detect any number of lines belongs of CS or Comments tag?
0
Comment
Question by:Tolgar
  • 14
  • 9
23 Comments
 

Author Comment

by:Tolgar
ID: 36940530
I think I have an idea but I don't know how to code it correctly:

If we look at line 92:

/^C(S|omments):\s*(.*\n)/ && push(@CS, $2) && goto CHECK;

Open in new window


Each match is upto the new line character.

If we change the new line character for this CS line with the following list then the problem can be solved:

(RR: or Related Records:) or (CR: or Code Reviewer:) or (Mail sent to:)

So I propose something like this:

/^C(S|omments):\s*(.{*(RR: or Related Records:) or (CR: or Code Reviewer:) or (Mail sent to:)})/ && push(@CS, $2) && goto CHECK;

Open in new window



How can I code this in Perl?

Thanks,
0
 
LVL 9

Expert Comment

by:parparov
ID: 36942742
This will wreck everything since you will lose RRs and CRs.
You asked a misleading question in the other thread, because it solves a partial problem on expense of creating another one.

As I said, we need to expand our flag set, install one indicating parsing a CS entry and proceed according to it.
0
 

Author Comment

by:Tolgar
ID: 36942908
@parparov:
Why do we lose RRs and CRs? I think it is because, we parse it line by line. Am I right?

On the other hand, are you gonna be able to expand the flag set?

Thanks,
0
Independent Software Vendors: We Want Your Opinion

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 
LVL 9

Expert Comment

by:parparov
ID: 36943086
Because you're pushing into @CS once you match either RR, CS, or CR without distinguishing what you actually matched.
0
 

Author Comment

by:Tolgar
ID: 36943127
What if we push something into CS that comes after CS line, if this line is not a CR: (or Code Reviewer:) or RR: (Related Records: ) or Mail sent to: line

Does this work?

What do you think?

Thanks,
0
 

Author Comment

by:Tolgar
ID: 36944367
@parparov: I wonder, if you will be able to work on it. I would appreciate if you can let me know.

Thanks,
0
 
LVL 9

Expert Comment

by:parparov
ID: 36944528
Still thinking
0
 

Author Comment

by:Tolgar
ID: 36944561
ok. Thanks.

I will wait for your reply.
0
 
LVL 9

Expert Comment

by:parparov
ID: 36944631
There is a problem with the grammar.

Consider a multiline CS statement at the end of CS/CR/RR block.
Consider an absent Mail sent to block
Now, how would we distinguish a continuing CS statement from another file entry within the same Submit File upper block.
Like this:
file/path1
CR: cr1 cr1
RR: rr1 rr1
CS: cs1
cs1
cs1
file/path2
file/path3
CR: cr1 cr2
RR: rr2 rr2
CS:cs2
cs/in/form/of/file/path/to/make/life/harder
...

Open in new window

0
 

Author Comment

by:Tolgar
ID: 36944689
file/path1
CR: cr1 cr1
RR: rr1 rr1
CS: cs1
cs1
cs1

file/path2
file/path3
CR: cr1 cr2
RR: rr2 rr2
CS:cs2
cs2

There is always a single blank line (with only one new line character.) between these blocks.

This is a line with only one new line character. (But new line character should cover both unix and windows new line)


Can we detect this an indicator?


Thanks,


0
 
LVL 9

Expert Comment

by:parparov
ID: 36944703
But a CS can contain a blank line, can't it?
It can worst case even emulate a whole entry?
0
 

Author Comment

by:Tolgar
ID: 36944778
I didn't really understand your point in here. Why does it emulate the whole entry in the worst case?

On the other hand, we can assume that the CS line will always contain some information.




0
 
LVL 9

Expert Comment

by:parparov
ID: 36944839
CS: this
comment

will make us headache
because after this blank line

file/path1
is actually part of the comment
and only here comment ends.

file/path2

Open in new window

0
 

Author Comment

by:Tolgar
ID: 36945202
Why is file/path1 part of the comment?

It comes before the comment.

Can't we push anything we find after CS line until a single new line line if they don't start with RR or CR?
If the line starts with RR or CR or only contains a single new line character, then we stop.

Does it sound reasonable?



0
 
LVL 9

Expert Comment

by:parparov
ID: 36945220
Why?
Because the comment is "FREE TEXT" so it can contain anything, even string file/path1

Or, you are saying comments CANNOT contain a blank line? That would simplify things.
0
 

Author Comment

by:Tolgar
ID: 36945291
Absolutely... Comments cannot contain blank lines.

How this helps...
0
 
LVL 9

Accepted Solution

by:
parparov earned 2000 total points
ID: 36945461
this helps indeed. Regard this code.
#!/usr/bin/perl

use strict;
use warnings;

our @HEADERS = ("GeckLogin", "Options", "FileInfo", "Mail sent to");
our @SUBHEADERS = ("Files", "Comments", "RelatedRecords", "CodeReviewers",);
# a prototype for convenience)
sub print_data1 ($);
sub print_data2 ($);
sub submitFileParser($);

my $data = submitFileParser(shift @ARGV);
my $geckLogin;
use Data::Dumper;
# A look at the data
#print Dumper $data;

# Examples of accessing data
print_data1($data);
print "++++++++++++++++++++\n";
print_data2($data);
print "++++++++++++++++++++\n";

my @rr = @{$data->[0]{FileInfo}[0]{RelatedRecords}};
#print Dumper \@rr;

sub print_data1 ($) {
	my $data = shift;

	for my $submit (@{$data}) {
		for my $header (@HEADERS) {
			print "$header:\n";
			if ($header eq 'GeckLogin') {
				print "$submit->{$header}\n";
			}
			elsif ($header eq 'FileInfo') {
				for my $subitem (@{$submit->{$header}}) {
					for my $subheader (@SUBHEADERS) {
						print "\t$subheader:\n";
						print @{$subitem->{$subheader}};
					}
				}
			}
			else {
				print @{$submit->{$header}};
			}
			print "\n";
		}
		print "\n";
	}
}

sub print_data2 ($) {
	my $data = shift;

	for my $header (@HEADERS) {
		if ($header eq 'GeckLogin') {
			print "GeckLogin: $data->[0]{GeckLogin}\n";
			next;
		}
		elsif ($header eq 'FileInfo') {
			for my $i (1..@{$data}) {
				for my $subheader (@SUBHEADERS) {
					print "\t$subheader:\n";
					for my $j (1..@{$data->[$i-1]{$header}}) {
						print "From submit file $i item $j\n";
						print @{$data->[$i-1]{$header}[$j-1]{$subheader}};
						print "\n";
					}
				}
			}
			next;
		}
		print "$header:\n";
		for my $i (1..@{$data}) {
			print "From submit file $i\n";
			print @{$data->[$i-1]{$header}};
			print "\n";
		}
		print "\n========\n";
	}
}

sub submitFileParser ($) {
	my $filename = shift;
	my @paragraphs;
#	local($/) = '';
	open( FILE, "< $filename" ) or die "Can't open $filename : $!";
	@paragraphs = <FILE>;
	close FILE;
	return read_paragraphs (@paragraphs);
}

sub read_paragraphs (@) {
	# read lines as parameters
	my @rippedParagraphs = @_;
	my @submits = ();
	# Storage for all sections
	# Temporary storages for single section of each type
	my (@Files, @CR, @RR, @CS, @Options, @Mailsent, @file_info);
	# Flags for file traversal logic
	my ($opt_flag, $file_flag, $mail_sent_to_flag, $cs_flag);

	my $submit_file = 0;
	#read the file
	for ( @rippedParagraphs ) {
		if (/^USER=(\S+)\,/) {
			#obtain the login from USER=
			$geckLogin = $1;
		}
		if (/^\s*Submit\s+file\s*$/) {
			# We record the accumulated data:
			push(
				@submits,
				{
					"Options"              => [@Options],
					"FileInfo"             => [@file_info],
					"GeckLogin" 		   => $geckLogin,
					"Mail sent to"         => [@Mailsent],
				}
			) if @file_info;
			@Options = @Mailsent = @file_info = ();
			$submit_file = 1;
			next;
		}
		if ($submit_file == 1) {
			if (/^\s*\=+\s*$/) {
				$submit_file++;
				$mail_sent_to_flag = 0;
			} else {
				$submit_file = 0; # two-line grammar didn't hold
			}
			next;
		}
		if ($submit_file == 2) {
			if ($mail_sent_to_flag) {
				push(@Mailsent, $_);
				next;
			}
			if (m|^\#\s*Sandbox\s+location\s*\:\s*\S*/sandbox/(.*?)/|) {
				# Match the login name in the submit file - if it has not
				# already been done
				$geckLogin ||= $1;
			}
			# If we encounter a comment or empty string
			if (/^\#/ || !/\S/) {
				if (! /\S/ && $cs_flag) {
					$cs_flag = 0;
					goto CHECK;
				}
				# we haven't encountered an option to start doing anything
				next unless $opt_flag || $file_flag;
				# If we're done with options, let's start reading file sections
				if ($opt_flag == 1) {
					$opt_flag = 0;
					$file_flag = 1;
				}
				elsif ($opt_flag > 1) {
					# Addresses the empty line within Options:
					$opt_flag--;
				}
				next;
			}
			if (/^Options/) {
				# We start reading options
				$opt_flag = 2;
				next;
			}
			if (/^Mail sent to/) {
				$mail_sent_to_flag = 1;
				push(@Mailsent, $_);
				$cs_flag = 0;
				next;
			}
			# Matching beginning of the line to determine the type of the string
			# and placing it in temporary storage
			if (/^R(R|elated\sRecords):\s*(.*\n)/) {
				$cs_flag = 0;
				push(@RR, $2);
				goto CHECK;
			}
			elsif (/^C(R|ode\sReviewer):\s*(.*\n)/) {
				$cs_flag = 0;
				push(@CR, $2);
				goto CHECK;
			}
			elsif (/^C(S|omments):\s*(.*\n)/) {
				$cs_flag = 1;
				push(@CS, $2);
				next;
			}
			# General text is either files or options info, depending on the
			# value of the option flag
			$cs_flag ? push(@CS, $_) : $opt_flag ? push(@Options, $_) : push(@Files, $_);
			CHECK:
			if (!$cs_flag && (@RR && @CR && @CS)) {
				push(
					@file_info,
					{
						"Files"                => [@Files],
						"Comments"             => [@CS],
						"RelatedRecords"       => [@RR],
						"CodeReviewers"        => [@CR],
					},
				);
				@Files = @CS = @RR = @CR = ();
			}
        }
	}
	push(
		@submits,
		{
			"Options"              => [@Options],
			"Mail sent to"         => [@Mailsent],
			"FileInfo"             => [@file_info],
			"GeckLogin" 		   => $geckLogin,
		}
	) if @file_info;
	return \@submits;
}

Open in new window

0
 

Author Comment

by:Tolgar
ID: 36945616
Perfect !!!! It worked..


0
 

Author Comment

by:Tolgar
ID: 36957212
@parparov: While I was testing the code, I realized that if I create the the text file in in unix the code does not work or vice versa. I checked the line endings. In one case it ends with LF and the other ends with CRLF.

How can I cover both cases in this code?


Thanks,






0
 

Author Comment

by:Tolgar
ID: 36957258
Note: In one way, I need to do this conversion:
  <CR><LF>   -->   <LF>
  <LF><CR>   -->   <LF>
  <CR>           -->   <LF>

Open in new window


Thanks,
0
 

Author Comment

by:Tolgar
ID: 36957319
Note: 2: I found something like this :

my $rippedText = scalar(grep(s/\015+$// | s/\015/\n/g, @r));

Open in new window


but how am I gonna apply it to the entire file in my code?

Thanks,
0
 
LVL 9

Expert Comment

by:parparov
ID: 36957831
You can do a 'dos2unix' on the input file before you run your perl script. But I guess this deserves a separate question. :)
0
 

Author Comment

by:Tolgar
ID: 36958275
@parparov: I created a new question. Can you please reply ASAP?

ID: 27393665

Thanks,
0

Featured Post

Concerto Cloud for Software Providers & ISVs

Can Concerto Cloud Services help you focus on evolving your application offerings, while delivering the best cloud experience to your customers? From DevOps to revenue models and customer support, the answer is yes!

Learn how Concerto can help you.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

I've just discovered very important differences between Windows an Unix formats in Perl,at least 5.xx.. MOST IMPORTANT: Use Unix file format while saving Your script. otherwise it will have ^M s or smth likely weird in the EOL, Then DO NOT use m…
I have been pestered over the years to produce and distribute regular data extracts, and often the request have explicitly requested the data be emailed as an Excel attachement; specifically Excel, as it appears: CSV files confuse (no Red or Green h…
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…
Six Sigma Control Plans
Suggested Courses

872 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question