?
Solved

How to break down information in an array which we read from a text file in PERL?

Posted on 2011-09-19
22
Medium Priority
?
335 Views
Last Modified: 2012-05-12
Follow up question for ID: 27312349

As I explained in the last post of the reference post, I would like to break the following per entry:

Options
Files
Comments
Related_Records
Code_Reviewers

The Geck_Login will be same for all entries because we capture it in the beginning of the file.

If we look at the text file again:

USER=testman, HOST=testman-deb6-64, ARCH=glnxa64
Revisions: /st/hub/share/apps/bat//share/mmit: 07/26-09:48:58; csubmitItem.pm: 2011/07/26-09:48:56
Original arguments:
        -t
        Atk
        -F
        20110914.submit
Currently $_='154551'

        main:/st/hub/share/apps/bat/bat2.15.17/share/../lib/csubmitCache.pm:44 called main::submissionHistory
        main:/st/hub/share/apps/bat/bat2.15.17/share/submit:3871 called main::CreateCacheFile

Current directory ($PWD) = /st/devel/sandbox/testman/Aslrtw
                Submit file
        ===========================
# Component        : Coder
# Sandbox location : /st/devel/sandbox/testman/Atk
# Submission for   : 2000
#
# Description:
#   Unlocking making changes
#
# Documentation impact:
#   None
#
# QE items:
#   None
#
# Type of change:
#   Unlocking making changes
#

# submit file for use with msubmit.  To use run the command
#      submit -F 24.submit
#   or use C-c C-c from emacs to run this command.
# "<a href='http://www-sandbox/testman/Atk/glnxa64'>/sandbox/testman/Atk_ests/glnxa64</a>"
# "No need for sbruntests: Interactive Tests Update"
Options:

-CJ "<a href='http://www-sandbox/testman/Atk_tests/glnxa64'>/sandbox/testman/Atk_tests/glnxa64</a>"
-nowrap
-subject "locking making changes"
-KEYWORD1
-KEYWORD2

st/ert/variants/variants5.c
CR: testman2
RR: 987654
CS: locking before making changes

Mail sent to:
    st.devel.submit: Unlocking making changes
    Files:
    st/ert/variants/variants5.c

	
				Submit file
        ===========================
# Component        : Coder
# Sandbox location : /st/devel/sandbox/testman/Atk
# Submission for   : 2000
#
# Description:
#   Unlocking making changes
#
# Documentation impact:
#   None
#
# QE items:
#   None
#
# Type of change:
#   Unlocking making changes
#

# submit file for use with msubmit.  To use run the command
#      submit -F 14.submit
#   or use C-c C-c from emacs to run this command.
# "<a href='http://www-sandbox/testman/Atk_tests/glnxa64'>/sandbox/testman/Atk_tests/glnxa64</a>"
# "No need for sbruntests: Interactive Tests Update"
Options:

-CJ "<a href='http://www-sandbox/testman/Atk_tests/glnxa64'>/sandbox/testman/Atk_tests/glnxa64</a>"
-nowrap
-subject "Unlocking making changes"
-KEYWORD1
-KEYWORD2

st/ert/variants/variants6.c
CR: testman3
RR: 123456
CS: Unlocking before making changes

Mail sent to:
    st.devel.submit: Unlocking making changes
    Files:
    st/ert/variants/variants5.c

Open in new window


I would like to have the following in arrays (From Submit File 1 and From Submit File 2 texts are only for clarification. We don't actually need them.):
@Options:
From Submit File 1:
-CJ "<a href='http://www-sandbox/testman/Atk_tests/glnxa64'>/sandbox/testman/Atk_tests/glnxa64</a>"
-nowrap
-subject "Unlocking making changes"
-KEYWORD1
-KEYWORD2

From Submit File 2:
-CJ "<a href='http://www-sandbox/testman/Atk_tests/glnxa64'>/sandbox/testman/Atk_tests/glnxa64</a>"
-nowrap
-subject "Unlocking making changes"
-KEYWORD1
-KEYWORD2


@Files:
From Submit File 1:
st/ert/variants/variants5.c

From Submit File 2:
st/ert/variants/variants6.c

@Comments:
From submit File 1:
locking before making changes

From submit file 2:
Unlocking before making changes

@Related_Records:
From submit File 1:
987654

From submit File 2:
123456

@Code_Reviewers:
From submit File 1:
testman2

From submit File 2:
testman3

Geck_login: testman

Open in new window



When we have this output, I will pass them to another code and log this broken down information into another file separately. That's why I need to know which information belongs to which submit file.

Note: There can be any number of Submit Files in one text file.

I would prefer to have only one Options, Files, Comments, Related_Records, Code_Reviewers arrays and manipulate this data inside these arrays for different information.

Let's say:
The first element of Options should only include options from Submit File 1.
The second element of Options should only include options from Submit File 2.

same thing for the list of files and others.

What i mean is we don't need to put every option or file or others in one element of an array. Same group of information from same Submit file should be in the same array element. Then I can dump this information anywhere I want without causing confusion

I hope this explains everything clearly.

Thanks,


0
Comment
Question by:Tolgar
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 14
  • 8
22 Comments
 
LVL 9

Expert Comment

by:parparov
ID: 36563826
Allow me to clarify:

There's one Files, Comments, Options etc. entry per 'submit file' section?
The user is read from TEST= at the beginning or from sandbox location?
Because sandbox location may return, theoretically, different users.

0
 

Author Comment

by:Tolgar
ID: 36564334
yes, there is one files, comments, option etc entry per submit file section.

Well, to read the user from the beginning of the file is more reliable but I would prefer to keep the sandbox location for now. If possible please also read it from the beginning. But I would like you to comment it out for now. I guess they will be in the same section of the code.

thanks,

0
 
LVL 9

Expert Comment

by:parparov
ID: 36568995
Here is a reworked code.
The return data structure has been changed. Please study the examples of data accessing.
#!/usr/bin/perl

use strict;
use warnings;

our @HEADERS = ("GeckLogin", "Options", "Files", "Comments", "RelatedRecords", "CodeReviewers");
# a prototype for convenience)
sub print_data1 ($);
sub print_data2 ($);

my $data = submitFileParser(shift @ARGV);
my $geckLogin;
use Data::Dumper;
# A look at the data
print Dumper $data;

# Examples of accessing data
print_data1($data);
print_data2($data);

sub print_data1 ($) {
	my $data = shift;

	for my $submit (@{$data}) {
		for my $header (@HEADERS) {
			print "$header:\n";
			if ($header eq 'GeckLogin') {
				print "$submit->{$header}\n";
			}
			else {
				print @{$submit->{$header}};
			}
			print "\n";
		}
		print "\n";
	}
}

sub print_data2 ($) {
	my $data = shift;

	for my $header (@HEADERS) {
		if ($header eq 'GeckLogin') {
			print "GeckLogin: $data->[0]{GeckLogin}\n";
			next;
		}
		print "$header:\n";
		for my $i (1..@{$data}) {
			print "From submit file $i\n";
			print @{$data->[$i-1]{$header}};
			print "\n";
		}
		print "\n";
	}
}

sub submitFileParser ($) {
	my $filename = shift;
	my @paragraphs;
#	local($/) = '';
	open( FILE, "< $filename" ) or die "Can't open $filename : $!";
	@paragraphs = <FILE>;
	close FILE;
	return read_paragraphs (@paragraphs);
}

sub read_paragraphs (@) {
	# read lines as parameters
	my @rippedParagraphs = @_;
	my @submits = ();
	# Storage for all sections
	# Temporary storages for single section of each type
	my (@Files, @CR, @RR, @CS, @Options);
	# Flags for file traversal logic
	my ($opt_flag, $file_flag);

	my $submit_file = 0;
	#read the file
	for ( @rippedParagraphs ) {
		if (/^USER=(\S+)\,/) {
			#obtain the login from USER=
			$geckLogin = $1;
		}
		if (/^\s*Submit\s+file\s*$/) {
			$submit_file = 1;
			next;
		}
		if ($submit_file == 1) {
			if (/^\s*\=+\s*$/) {
				$submit_file++;
			} else {
				$submit_file = 0; # two-line grammar didn't hold
			}
			next;
		}
		if ($submit_file == 2) {
			if (m|^\#\s*Sandbox\s+location\s*\:\s*\S*/sandbox/(.*?)/|) {
				# Match the login name in the submit file - if it has not
				# already been done
				$geckLogin ||= $1;
			}
			# If we encounter a comment or empty string
			if (/^\#/ || !/\S/) {
				# we haven't encountered an option to start doing anything
				next unless $opt_flag || $file_flag;
				# If we're done with options, let's start reading file sections
				if ($opt_flag == 1) {
					$opt_flag = 0;
					$file_flag = 1;
				}
				elsif ($opt_flag > 1) {
					# Addresses the empty line within Options:
					$opt_flag--;
				}
				next;
			}
			if (/^Options/) {
				# We start reading options
				$opt_flag = 2;
				next;
			}
			# Matching beginning of the line to determine the type of the string
			# and placing it in temporary storage
			/^R(R|elated\sRecords):\s*(.*\n)/ && push(@RR, $2) && next;
			/^C(R|ode\sReviewer):\s*(.*\n)/ && push(@CR, $2) && next;
			/^C(S|omments):\s*(.*\n)/ && push(@CS, $2) &&
				# CS record is the last one, we commit after it
				push(
					@submits,
					{
						"Options"              => [@Options],
						"Files"                => [@Files],
						"Comments"             => [@CS],
						"RelatedRecords"       => [@RR],
						"CodeReviewers"        => [@CR],
						"GeckLogin" 		   => $geckLogin,
					}
				) &&
				((@Options = @Files = @CR = @CS = @RR = ()) || ($submit_file = 0) || 1)
			&& next;

			# General text is either files or options info, depending on the
			# value of the option flag
			$opt_flag ? push(@Options, $_) : push(@Files, $_);
                }
	}
	return \@submits;
}

Open in new window

0
VIDEO: THE CONCERTO CLOUD FOR HEALTHCARE

Modern healthcare requires a modern cloud. View this brief video to understand how the Concerto Cloud for Healthcare can help your organization.

 

Author Comment

by:Tolgar
ID: 36569304
Hi,
Thank you for your prompt reply.

Can you please explain me what you mean in these lines?

Line 127
Line 142
Line 144

Note: The order of CR, CS and RR can be anything in the text. You know that right?

Thanks,
0
 
LVL 9

Expert Comment

by:parparov
ID: 36569490
No, I assumed the CS: is the last section of a submit. Otherwise I don't see how to get rid of the trailing "Mail sent to:"

Hope this explains lines 127, 142 and 144
0
 

Author Comment

by:Tolgar
ID: 36569572
ok. let's put this question for a later discussion.

I have another question. When I debug the code, I did the following.

231:                            my @Options = @{$cache_data}[0]->{Options};
  DB<3> x @Options
  empty array
  DB<4> x @{$cache_data}[0]->{Options}
0  ARRAY(0x15961d0)
   0  "-CJ \"<a href='http://www-sandbox/testman/Atk_tests/glnxa64'>/sandbox/testman/Atk_tests/glnxa64</a>\"\cM\cJ"
   1  "-nowrap\cM\cJ"
   2  "-subject \"Unlocking making changes\"\cM\cJ"
   3  "-KEYWORD1\cM\cJ"
   4  "-KEYWORD2\cM\cJ"

Open in new window


And @Options is empty in my first attempt. But then, when print the the right handside directly it worked. So how can I assign the right handside -which is an array- to a new array -like @Options- ?
0
 
LVL 9

Expert Comment

by:parparov
ID: 36570136
You did the dereferencing wrong way.
You need to:
231:                            my @Options = @{$cache_data->[0]{Options}};

Open in new window

0
 

Author Comment

by:Tolgar
ID: 36570696
ok.

I have 2 questions:

1-
It worked for all of them except for Related_Records.

This returns the correct data:

@{$cache_data->[0]{RelatedRecords}}

Open in new window


but this one returns empty array:

my @Related_Records = @{$cache_data->[0]{RelatedRecords}}

Open in new window


What am I doing wrong? The only difference is, this data is an integer.

2- Why do I get \cM\cJ at the end of all array elements.

e.g.

 DB<9> x @Comments
0  "Unlocking before making changes\cM\cJ"

Open in new window



Thanks,



0
 

Author Comment

by:Tolgar
ID: 36571021
Hi,
For the question which I have asked in ID: 36569304:

Can we change the code in a way that, we don't make any assumption on which one (CR, RR or CS) will be the last field and then "Mail sent to " can be treated as another field like RR, CS or CR.

Then I can just ignore that one when I pass them to another code.

Can we do that?

Thanks,

0
 
LVL 9

Expert Comment

by:parparov
ID: 36575103
Yes, we can do that. I'll post updatyed code later.
\cM\cJ is the carriage return+newline display
0
 

Author Comment

by:Tolgar
ID: 36580502
Hi,
When I log this information to a text file, is \cM\cJ going to be seen or are they gonna be processed?

I am waiting for your updated code.

Thanks,
0
 

Author Comment

by:Tolgar
ID: 36587130
hi,
I wonder if you would be able to post the updated code till Sunday morning.

Thanks,

0
 
LVL 9

Accepted Solution

by:
parparov earned 2000 total points
ID: 36588038
This code works with your example input, including related records I am testing explicitly:
#!/usr/bin/perl

use strict;
use warnings;

our @HEADERS = ("GeckLogin", "Options", "Files", "Comments", "RelatedRecords", "CodeReviewers", "Mail sent to");
# a prototype for convenience)
sub print_data1 ($);
sub print_data2 ($);
sub submitFileParser($);

my $data = submitFileParser(shift @ARGV);
my $geckLogin;
use Data::Dumper;
# A look at the data
print Dumper $data;

# Examples of accessing data
print_data1($data);
print "++++++++++++++++++++\n";
print_data2($data);
print "++++++++++++++++++++\n";

my @rr = @{$data->[0]{RelatedRecords}};
print Dumper \@rr;
print Dumper $data->[0]{RelatedRecords};

sub print_data1 ($) {
	my $data = shift;

	for my $submit (@{$data}) {
		for my $header (@HEADERS) {
			print "$header:\n";
			if ($header eq 'GeckLogin') {
				print "$submit->{$header}\n";
			}
			else {
				print @{$submit->{$header}};
			}
			print "\n";
		}
		print "\n";
	}
}

sub print_data2 ($) {
	my $data = shift;

	for my $header (@HEADERS) {
		if ($header eq 'GeckLogin') {
			print "GeckLogin: $data->[0]{GeckLogin}\n";
			next;
		}
		print "$header:\n";
		for my $i (1..@{$data}) {
			print "From submit file $i\n";
			print @{$data->[$i-1]{$header}};
			print "\n";
		}
		print "\n";
	}
}

sub submitFileParser ($) {
	my $filename = shift;
	my @paragraphs;
#	local($/) = '';
	open( FILE, "< $filename" ) or die "Can't open $filename : $!";
	@paragraphs = <FILE>;
	close FILE;
	return read_paragraphs (@paragraphs);
}

sub read_paragraphs (@) {
	# read lines as parameters
	my @rippedParagraphs = @_;
	my @submits = ();
	# Storage for all sections
	# Temporary storages for single section of each type
	my (@Files, @CR, @RR, @CS, @Options, @Mailsent);
	# Flags for file traversal logic
	my ($opt_flag, $file_flag, $mail_sent_to_flag);

	my $submit_file = 0;
	#read the file
	for ( @rippedParagraphs ) {
		if (/^USER=(\S+)\,/) {
			#obtain the login from USER=
			$geckLogin = $1;
		}
		if (/^\s*Submit\s+file\s*$/) {
			# We record the accumulated data:
			push(
				@submits,
				{
					"Options"              => [@Options],
					"Files"                => [@Files],
					"Comments"             => [@CS],
					"RelatedRecords"       => [@RR],
					"CodeReviewers"        => [@CR],
					"GeckLogin" 		   => $geckLogin,
					"Mail sent to"         => [@Mailsent],
				}
			) if @Files;
			@Options = @Files = @CR = @CS = @RR = ();
			$submit_file = 1;
			next;
		}
		if ($submit_file == 1) {
			if (/^\s*\=+\s*$/) {
				$submit_file++;
				$mail_sent_to_flag = 0;
			} else {
				$submit_file = 0; # two-line grammar didn't hold
			}
			next;
		}
		if ($submit_file == 2) {
			if ($mail_sent_to_flag) {
				push(@Mailsent, $_);
				next;
			}
			if (m|^\#\s*Sandbox\s+location\s*\:\s*\S*/sandbox/(.*?)/|) {
				# Match the login name in the submit file - if it has not
				# already been done
				$geckLogin ||= $1;
			}
			# If we encounter a comment or empty string
			if (/^\#/ || !/\S/) {
				# we haven't encountered an option to start doing anything
				next unless $opt_flag || $file_flag;
				# If we're done with options, let's start reading file sections
				if ($opt_flag == 1) {
					$opt_flag = 0;
					$file_flag = 1;
				}
				elsif ($opt_flag > 1) {
					# Addresses the empty line within Options:
					$opt_flag--;
				}
				next;
			}
			if (/^Options/) {
				# We start reading options
				$opt_flag = 2;
				next;
			}
			if (/^Mail sent to/) {
				$mail_sent_to_flag = 1;
				push(@Mailsent, $_);
				next;
			}
			# Matching beginning of the line to determine the type of the string
			# and placing it in temporary storage
			/^R(R|elated\sRecords):\s*(.*\n)/ && push(@RR, $2) && next;
			/^C(R|ode\sReviewer):\s*(.*\n)/ && push(@CR, $2) && next;
			/^C(S|omments):\s*(.*\n)/ && push(@CS, $2) && next;

			# General text is either files or options info, depending on the
			# value of the option flag
			$opt_flag ? push(@Options, $_) : push(@Files, $_);
        }
	}
	push(
		@submits,
		{
			"Options"              => [@Options],
			"Files"                => [@Files],
			"Comments"             => [@CS],
			"RelatedRecords"       => [@RR],
			"CodeReviewers"        => [@CR],
			"Mail sent to"         => [@Mailsent],
			"GeckLogin" 		   => $geckLogin,
		}
	) if @Files;
	return \@submits;
}

Open in new window

0
 

Author Comment

by:Tolgar
ID: 36608299
Hi,
This is works perfect.

I remember a discussion before but I couldn't find the answer in the discussions. So, the dicussion was about the line endings in Windows and in Unix.

My question is:

Can this code parse text files that are created both in Unix and Windows? Because they will have different line endings.

Thanks,



0
 

Author Comment

by:Tolgar
ID: 36611629
Hi,
How can I get the length the of $data in your code?

Because, for the length of it, I will loop through its contents.

Thanks,

0
 

Author Comment

by:Tolgar
ID: 36612839
Let me clarify the last question:

$data in our case has two parts. One is from the first submit file group and the second one is from the second submit file group.

So I should get "2" as result of this command.

Thanks,

0
 
LVL 9

Assisted Solution

by:parparov
parparov earned 2000 total points
ID: 36613161
The length of data is
my $data_length = scalar @{$data}

Open in new window

gives number of elements in the list
my $data_largest_index = $#{$data}

Open in new window

gives the last index ($data_length-1) in the list.

This code preserves the line endings as they are, they do not affect the code.
The files on Windows usually have a carriage return ("\r" or ^M) at the end in addition to newline. You can get rid of these chars, for example, by using utility dos2unix (or add them by using unix2dos) in linux.
0
 

Author Closing Comment

by:Tolgar
ID: 36617551
perfect solution!!!
0
 

Author Comment

by:Tolgar
ID: 36626109
I have a follow up question:

ID:27331899


Thanks,
0
 

Author Comment

by:Tolgar
ID: 36818767
@parparov:

Can you please expain me what this means? Especially, why we say if @Files; at the end.

push(
		@submits,
		{
			"Options"              => [@Options],
			"Files"                => [@Files],
			"Comments"             => [@CS],
			"RelatedRecords"       => [@RR],
			"CodeReviewers"        => [@CR],
			"Mail sent to"         => [@Mailsent],
			"GeckLogin" 		   => $geckLogin,
			"NoSubmitFileFlag"     => $noSubmitFileFlag,
		}
	) if @Files;
	return \@submits;

Open in new window



Thanks,
0
 
LVL 9

Expert Comment

by:parparov
ID: 36894611
It means to push something only if some actual files were encountered. Otherwise it unconditionally push empty arrays into the resulting data structures.
0
 

Author Comment

by:Tolgar
ID: 36894885
Thanks for the clarification
0

Featured Post

New feature and membership benefit!

New feature! Upgrade and increase expert visibility of your issues with Priority Questions.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Many time we need to work with multiple files all together. If its windows system then we can use some GUI based editor to accomplish our task. But what if you are on putty or have only CLI(Command Line Interface) as an option to  edit your files. I…
I have been pestered over the years to produce and distribute regular data extracts, and often the request have explicitly requested the data be emailed as an Excel attachement; specifically Excel, as it appears: CSV files confuse (no Red or Green h…
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…
Six Sigma Control Plans

752 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question