Tolgar
asked on
How to parse a text file group by group in Perl?
This is a follow up question for ID: 27285438
This is the final code that I use. This code works very well except for the files section.
This is the code I use:
This is the file that I parse:
When I debug this code and do the following, I get the list of numbers for each group separately.
The result is here. And this is what I expected.
However, when I do the following:
The result is:
However, I expect to get the following
Because when you look at the text file that I parse,
these two are in one group:
and this one is in another group.
What is the problem with this code and how can I fix this issue?
If I cannot fix it the code will be entirely useless.
Can you please help me?
Thanks,
This is the final code that I use. This code works very well except for the files section.
This is the code I use:
my $cache_data = submitFileParser($textFile);
sub submitFileParser ($) {
my $filename = shift;
my @paragraphs;
# local($/) = '';
open( FILE, "< $filename" ) or die "Can't open $filename : $!";
@paragraphs = <FILE>;
close FILE;
return read_paragraphs (@paragraphs);
}
sub read_paragraphs (@) {
# read lines as parameters
my @rippedParagraphs = @_;
my $submitFileExist = 0;
my $submit_file = 0;
my $noSubmitFileFlag = 0;
#Decide if the file includes submit file or not
for ( @rippedParagraphs ) {
#obtain the login from USER=
if (/^USER=(\S+)\,/) {
$geckLogin = $1;
}
if (/^\s*Submit\s+file\s*$/) {
$submitFileExist = 1;
}
if ($submitFileExist == 1) {
if (/^\s*\=+\s*$/) {
$submitFileExist = 2;
}
}
}
if ($submitFileExist == 2){
my @submits = ();
# Storage for all sections
# Temporary storages for single section of each type
my (@Files, @CR, @RR, @CS, @Options, @Mailsent);
# Flags for file traversal logic
my ($opt_flag, $file_flag, $mail_sent_to_flag);
$submit_file = 0;
$noSubmitFileFlag = 0;
#read the file
for ( @rippedParagraphs ) {
if (/^\s*Submit\s+file\s*$/) {
# We record the accumulated data:
push(
@submits,
{
"Options" => [@Options],
"Files" => [@Files],
"Comments" => [@CS],
"RelatedRecords" => [@RR],
"CodeReviewers" => [@CR],
"GeckLogin" => $geckLogin,
"NoSubmitFileFlag" => $noSubmitFileFlag,
"Mail sent to" => [@Mailsent],
}
) if @Files;
@Options = @Files = @CR = @CS = @RR = ();
$submit_file = 1;
next;
}
if ($submit_file == 1) {
if (/^\s*\=+\s*$/) {
$submit_file++;
$mail_sent_to_flag = 0;
} else {
$submit_file = 0; # two-line grammar didn't hold
}
next;
}
if ($submit_file == 2) {
if ($mail_sent_to_flag) {
push(@Mailsent, $_);
next;
}
if (m|^\#\s*Sandbox\s+location\s*\:\s*\S*/sandbox/(.*?)/|) {
# Match the login name in the submit file - if it has not
# already been done
$geckLogin ||= $1;
}
# If we encounter a comment or empty string
if (/^\#/ || !/\S/) {
# we haven't encountered an option to start doing anything
next unless $opt_flag || $file_flag;
# If we're done with options, let's start reading file sections
if ($opt_flag == 1) {
$opt_flag = 0;
$file_flag = 1;
}
elsif ($opt_flag > 1) {
# Addresses the empty line within Options:
$opt_flag--;
}
next;
}
if (/^Options/) {
# We start reading options
$opt_flag = 2;
next;
}
if (/^Mail sent to/) {
$mail_sent_to_flag = 1;
push(@Mailsent, $_);
next;
}
# Matching beginning of the line to determine the type of the string
# and placing it in temporary storage
/^R(R|elated\sRecords):\s*(.*\n)/ && push(@RR, $2) && next;
/^C(R|ode\sReviewer):\s*(.*\n)/ && push(@CR, $2) && next;
/^C(S|omments):\s*(.*\n)/ && push(@CS, $2) && next;
# General text is either files or options info, depending on the
# value of the option flag
$opt_flag ? push(@Options, $_) : push(@Files, $_);
}
}
push(
@submits,
{
"Options" => [@Options],
"Files" => [@Files],
"Comments" => [@CS],
"RelatedRecords" => [@RR],
"CodeReviewers" => [@CR],
"Mail sent to" => [@Mailsent],
"GeckLogin" => $geckLogin,
"NoSubmitFileFlag" => $noSubmitFileFlag,
}
) if @Files;
return \@submits;
}
else{
my @noSubmitFileSubmits = ();
$submit_file = 0; # two-line grammar didn't hold
my $parsedData = parseWithoutSubmitFile(@rippedParagraphs);
#submit file does not exist flag
$noSubmitFileFlag = 1;
push(
@noSubmitFileSubmits,
{
"GeckLogin" => $geckLogin,
"ParsedData" => $parsedData,
"NoSubmitFileFlag" => $noSubmitFileFlag,
"Cluster" => $parsedData->{t},
"JobID" => $parsedData->{dollar_},
"gLogFilesOption" => exists $parsedData->{GLOGFILES},
"gLogSbcheckOption" => exists $parsedData->{GLOGSBCHECK},
}
) if $parsedData;
return \@noSubmitFileSubmits;
}
}
# we parse token differently if user makes the submission without submit file
sub parseWithoutSubmitFile (@) {
my $arg_flag = 0;
my $parsedData = {};
my $current_option = '';
while (my $line = shift @_) {
if ($arg_flag == 1) {
if ($line =~ /^Currently (\$\_=.*)/) {
local $_;
eval "$1;";
$parsedData->{dollar_} = $_;
$arg_flag = 0;
}
elsif ($line =~ /^\s+\-(.*)/) {
$current_option = $1;
$parsedData->{$current_option} = undef;
next;
}
elsif ($current_option && $line =~ /^\s+(.*)/) {
$parsedData->{$current_option} = $1;
$current_option = undef;
}
}
else {
if ($line =~ /^Original arguments:/) {
$arg_flag = 1;
next;
}
}
}
return $parsedData;
}
This is the file that I parse:
USER=testman, HOST=testman-deb6-64, ARCH=glnxa64
Revisions: /st/hub/share/apps/bat//share/mmit: 07/26-09:48:58; csubmitItem.pm: 2011/07/26-09:48:56
Original arguments:
-t
Atk
-F
20110914.submit
Currently $_='154551'
main:/st/hub/share/apps/bat/bat2.15.17/share/../lib/csubmitCache.pm:44 called main::submissionHistory
main:/st/hub/share/apps/bat/bat2.15.17/share/submit:3871 called main::CreateCacheFile
Current directory ($PWD) = /st/devel/sandbox/testman/Aslrtw
Submit file
===========================
# Component : Coder
# Sandbox location : /st/devel/sandbox/testman/Atk
# Submission for : 2000
#
# Description:
# Unlocking making changes
#
# Documentation impact:
# None
#
# QE items:
# None
#
# Type of change:
# Unlocking making changes
#
# submit file for use with msubmit. To use run the command
# submit -F 24.submit
# or use C-c C-c from emacs to run this command.
# "<a href='http://www-sandbox/testman/Atk/glnxa64'>/sandbox/testman/Atk_ests/glnxa64</a>"
# "No need for sbruntests: Interactive Tests Update"
Options:
-CJ "<a href='http://www-sandbox/testman/Atk_tests/glnxa64'>/sandbox/testman/Atk_tests/glnxa64</a>"
-nowrap
-subject "Unlocking making changes"
-KEYWORD1
-KEYWORD2
st/ert/variants/variants4.c
CR: testman2
RR: 123456
CS: Unlocking before making changes
Mail sent to:
st.devel.submit: Unlocking making changes
Files:
st/ert/variants/variants5.c
Submit file
===========================
# Component : Coder
# Sandbox location : /st/devel/sandbox/testman/Atk
# Submission for : 2000
#
# Description:
# Unlocking making changes
#
# Documentation impact:
# None
#
# QE items:
# None
#
# Type of change:
# Unlocking making changes
#
# submit file for use with msubmit. To use run the command
# submit -F 14.submit
# or use C-c C-c from emacs to run this command.
# "<a href='http://www-sandbox/testman/Atk_tests/glnxa64'>/sandbox/testman/Atk_tests/glnxa64</a>"
# "No need for sbruntests: Interactive Tests Update"
Options:
-CJ "<a href='http://www-sandbox/testman/Atk_tests/glnxa64'>/sandbox/testman/Atk_tests/glnxa64</a>"
-nowrap
-subject "Unlocking making changes"
-KEYWORD1
-KEYWORD2
st/ert/variants/variants5.c
st/ert/variants/variants6.c
CR: testman2
RR: 333333, 444444
CS: Unlocking before making changes
st/ert/variants/variants7.c
CR: testman2
RR: 555555, 666666
CS: Unlocking before making changes
Mail sent to:
st.devel.submit: Unlocking making changes
Files:
st/ert/variants/variants5.c
When I debug this code and do the following, I get the list of numbers for each group separately.
x @{$cache_data->[1]{RelatedRecords}};
The result is here. And this is what I expected.
0 "333333, 444444\cM\cJ"
1 "555555, 666666\cM\cJ"
However, when I do the following:
x @{$cache_data->[1]{Files}}
The result is:
0 "st/ert/variants/variants5.c\cM\cJ"
1 "st/ert/variants/variants6.c\cM\cJ"
2 "st/ert/variants/variants7.c\cM\cJ"
However, I expect to get the following
0 "st/ert/variants/variants5.c\cM\cJ"
"st/ert/variants/variants6.c\cM\cJ"
1 "st/ert/variants/variants7.c\cM\cJ"
Because when you look at the text file that I parse,
these two are in one group:
"st/ert/variants/variants5.c\cM\cJ"
"st/ert/variants/variants6.c\cM\cJ"
and this one is in another group.
"st/ert/variants/variants7.c\cM\cJ"
What is the problem with this code and how can I fix this issue?
If I cannot fix it the code will be entirely useless.
Can you please help me?
Thanks,
ASKER
you are right, CS does not have to be at the end.
But, if the code detects all of the below, then it means we are done with one group.
CR (or Code Reviewer),
CS (or Comments)
RR (or Related Records)
And we can continue with the other group.
Does this help a bit?
Thanks,
But, if the code detects all of the below, then it means we are done with one group.
CR (or Code Reviewer),
CS (or Comments)
RR (or Related Records)
And we can continue with the other group.
Does this help a bit?
Thanks,
Probably... Gotta think about it. Will post an update on Monday.
ASKER
I see. Would it be possible to do it during tomorrow?
I am planning to complete the rest of it during Sunday.
Thanks
I am planning to complete the rest of it during Sunday.
Thanks
I can't promise... my weekends are usually loaded with family matters.
ASKER
I understand. I would be pleased if you can do your best.
Thanks,
Thanks,
ASKER
Hi,
I wonder if there is any progress.
Thanks,
I wonder if there is any progress.
Thanks,
If we discard the 'Mail sent to:' field, would that be ok?
ASKER
absolutely. I don't even need it. As far as I remember, we included "mail sent to" in order to separate it from other fields.
Thanks,
Thanks,
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
@parparov: for some reason this line now does not catch the submit file flag:
do you have any idea?
when I print the @rippedParagraphs I can see the intire file with Submit file flag. But When I print it only rerturns \c\\.
I am pretty sure that it was working last night.
Here is my little modified code:
if (/^\s*Submit\s+file\s*$/) {
do you have any idea?
when I print the @rippedParagraphs I can see the intire file with Submit file flag. But When I print it only rerturns \c\\.
I am pretty sure that it was working last night.
Here is my little modified code:
sub read_paragraphs (@) {
# read lines as parameters
my @rippedParagraphs = @_;
my @submits = ();
# Storage for all sections
# Temporary storages for single section of each type
my (@Files, @CR, @RR, @CS, @Options, @Mailsent, @file_info);
# Flags for file traversal logic
my ($opt_flag, $file_flag, $mail_sent_to_flag);
my $submit_file = 0;
my $noSubmitFileFlag = 0;
#read the file
for ( @rippedParagraphs ) {
if (/^USER=(\S+)\,/) {
#obtain the login from USER=
$geckLogin = $1;
}
if (/^\s*Submit\s+file\s*$/) {
# We record the accumulated data:
push(
@submits,
{
"Options" => [@Options],
"FileInfo" => [@file_info],
"GeckLogin" => $geckLogin,
"NoSubmitFileFlag" => $noSubmitFileFlag,
"Mail sent to" => [@Mailsent],
}
) if @file_info;
@Options = @Mailsent = @file_info = ();
$submit_file = 1;
next;
}
else {
my @noSubmitFileSubmits = ();
$submit_file = 0; # two-line grammar didn't hold
my $parsedData = parseWithoutSubmitFile(@rippedParagraphs);
#submit file does not exist flag
$noSubmitFileFlag = 1;
push(
@noSubmitFileSubmits,
{
"GeckLogin" => $geckLogin,
"ParsedData" => $parsedData,
"NoSubmitFileFlag" => $noSubmitFileFlag,
"Cluster" => $parsedData->{t},
"JobID" => $parsedData->{dollar_},
"gLogFilesOption" => exists $parsedData->{GLOGFILES},
"gLogSbcheckOption" => exists $parsedData->{GLOGSBCHECK},
}
) if $parsedData;
return \@noSubmitFileSubmits;
}
if ($submit_file == 1) {
if (/^\s*\=+\s*$/) {
$submit_file++;
$mail_sent_to_flag = 0;
}
next;
}
if ($submit_file == 2) {
if ($mail_sent_to_flag) {
push(@Mailsent, $_);
next;
}
#if (m|^\#\s*Sandbox\s+location\s*\:\s*\S*/sandbox/(.*?)/|) {
# # Match the login name in the submit file - if it has not
# # already been done
# $geckLogin ||= $1;
#}
# If we encounter a comment or empty string
if (/^\#/ || !/\S/) {
# we haven't encountered an option to start doing anything
next unless $opt_flag || $file_flag;
# If we're done with options, let's start reading file sections
if ($opt_flag == 1) {
$opt_flag = 0;
$file_flag = 1;
}
elsif ($opt_flag > 1) {
# Addresses the empty line within Options:
$opt_flag--;
}
next;
}
if (/^Options/) {
# We start reading options
$opt_flag = 2;
next;
}
if (/^Mail sent to/) {
$mail_sent_to_flag = 1;
push(@Mailsent, $_);
next;
}
# Matching beginning of the line to determine the type of the string
# and placing it in temporary storage
/^R(R|elated\sRecords):\s*(.*\n)/ && push(@RR, $2) && goto CHECK;
/^C(R|ode\sReviewer):\s*(.*\n)/ && push(@CR, $2) && goto CHECK;
/^C(S|omments):\s*(.*\n)/ && push(@CS, $2) && goto CHECK;
# General text is either files or options info, depending on the
# value of the option flag
$opt_flag ? push(@Options, $_) : push(@Files, $_);
CHECK:
if (@RR && @CR && @CS) {
push(
@file_info,
{
"Files" => [@Files],
"Comments" => [@CS],
"RelatedRecords" => [@RR],
"CodeReviewers" => [@CR],
"NoSubmitFileFlag" => $noSubmitFileFlag,
},
);
@Files = @CS = @RR = @CR = ();
}
}
}
push(
@submits,
{
"Options" => [@Options],
"Mail sent to" => [@Mailsent],
"FileInfo" => [@file_info],
"GeckLogin" => $geckLogin,
}
) if @file_info;
return \@submits;
}
ASKER
@parparov: I think i found the problem. Please wait.
ASKER
@parparov: ok I fixed it.
But I found a limitation which comes from my initial definition of the problem.
If comments section has more than one lines then the second line is ignored.
So if comments have more than one line then, all these lines must be in one element of the comments array of the related group.
Can we fix this problem?
If this is a major change, I can create another question.
Thanks,
But I found a limitation which comes from my initial definition of the problem.
If comments section has more than one lines then the second line is ignored.
So if comments have more than one line then, all these lines must be in one element of the comments array of the related group.
Can we fix this problem?
If this is a major change, I can create another question.
Thanks,
ASKER
@parparov: I created another question for this change.
ID: 27382844
ID: 27382844
ASKER
@parparov:
When I run this line in the code:
I get this:
Is there any way to get rid of these \CM\CJ\Cj kind of characters (line endings I guess) in general?
They are after every variable.
Thanks,
When I run this line in the code:
@Related_Records = @{$cache_data->[$i]{FileInfo}->[$j]{RelatedRecords}};
$RelatedRecordList = (join("\n", @Related_Records))."\n";
I get this:
DB<5> x $RelatedRecordList
0 "123456\cM\cJ\cJ"
Is there any way to get rid of these \CM\CJ\Cj kind of characters (line endings I guess) in general?
They are after every variable.
Thanks,
ASKER
@parparov: Any idea about the last two posts?
Thanks,
Thanks,
\cJ are the very "\n"s you're joining with.
\cM are \r chars you can get rid of if you want by adding
\cM are \r chars you can get rid of if you want by adding
s/\r//g;
after for ( @rippedParagraphs ) {
How should we determine that a new section of submitted files ended and a new started?
You told me that CS marker is by no means mandatory end of that section.