Solved

Concordance of text

Posted on 2013-12-12
12
395 Views
Last Modified: 2013-12-16
Nearly two weeks ago I asked for, and received a script that iterates over a set of lines in a stanza.
That question and answer can be found here at EE:
Perl s
Asked by: weaverpankaew2
Solved by: wilcoxon
http://www.experts-exchange.com/Programming/Misc/Q_28305352.html

Now I need some expanded features I would prefer if wilcoxon would continue, but this is by no means a requirement. If you can provide a solution I will gladly award the points:
The text is now keyed to the following map. Each section is unique and the text layout is specific. The output notation does not change I only need to have access to an array that holds each grouping of words based on their origin in the text. For example @_ is a collection of the book descriptions. Any input as to how to better collect this data for access later is appreciated.
For now I need to change the attached script to do exactly what it is currently doing for individual stanza, and do this for all the text, while maintaining a record of the its location for future access.


This is the Key to the text, the hierarchy is ordered so the relevance is crucial the indexed location, for example the each key may be preceded by and/or followed by a specific key, this information will be important to me in the future.
Two exceptions exist: the @& notation (a remark) is an exception to the hierarchy because it can be precede or follow any other notation, and the @* notation (the beginning of stanzas for the current Canto), can only be followed by the next @^ (the next Canto in the book), or the @# notation (the beginning of the next book.)
Blank lines  and CR/LF's do not need to be indexed with the element of the key, stripping the text of these characters is fine--as long as the words are separated by spaces the collection of the words in the key can be concatenated.

The KEY, the TEXT, the example OUTPUT, and the SCRIPT is below:

...


the KEY
___________________________________________________
=for Key to input-text Notation Marks
  @& - Disregard text remark. Heading material or notations to ourselves (a text comment)
  @# - Book Number
  @_ - Short description of the book (where @# holds the current value of Book Number)
  @$ - Proem (a short section of 9 line stanzas that open each Book)
  @^ - Canto Number
  @% - Argument (the four line summaries that begin each canto)
  @* - Beginning of Stanzas of current Canto (where @^ holds this current value of the Canto)
=cut Key to input-text Notation Marks




the TEXT (This is the structure of the text with the Keys inserted):
___________________________________________________
@&The Faerie Queene: Book II
@#II
 
@_THE SECOND BOOKE
OF THE FAERIE QVEENE.

Contayning
THE LEGEND OF SIR GVYON.

OR
OF TEMPERAUNCE.


@$Right well I wote most mighty Soueraine,
That all this famous antique history,
Of some th' aboundance of an idle braine
Will iudged be, and painted forgery,
Rather then matter of iust memory,
Sith none, that breatheth liuing aire, does know,
Where is that happy land of Faery,
Which I so much do vaunt, yet no where show,
But vouch antiquities, which no body can know.

But let that man with better sence aduize,
That of the world least part to vs is red:
And dayly how through hardy enterprize,
Many great Regions are discouered,
Which to late age were neuer mentioned.
Who euer heard of th' Indian Peru?
Or who in venturous vessell measured
The Amazon huge riuer now found trew?
Or fruitfullest Virginia who did euer vew?

Yet all these were, when no man did them know;
Yet haue from wisest ages hidden beene:
And later times things more vnknowne shall show.
Why then should witlesse man so much misweene
That nothing is, but that which he hath seene?
What if within the Moones faire shining spheare?
What if in euery other starre vnseene
Of other worldes he happily should heare?
He wonder would much more: yet such to some appeare.

Of Faerie lond yet if he more inquire,
By certaine signes here set in sundry place
He may it find; ne let him then admire,
But yield his sence to be too blunt and bace,
That no'te without an hound fine footing trace.
And thou, O fairest Princesse vnder sky,
In this faire mirrhour maist behold thy face,
And thine owne realmes in lond of Faery,
 And in this antique Image thy great auncestry.

The which O pardon me thus to enfold
In couert vele, and wrap in shadowes light,
That feeble eyes your glory may behold,
Which else could not endure those beames bright,
But would be dazled with exceeding light.
O pardon, and vouchsafe with patient eare
The braue aduentures of this Faery knight
The good Sir Guyon gratiously to heare,
In whom great rule of Temp'raunce goodly doth appeare.



@&Canto I
@^i

@%Guyon by Archimage abusd,
The Redcrosse knight awaytes,
Findes Mordant and Amauia slaine
With pleasures poisoned baytes.

@*That cunning Architect of cancred guile,
Whom Princes late displeasure left in bands,
For falsed letters and suborned wile,
Soone as the Redcrosse knight he vnderstands,
To beene departed out of Eden lands,
To serue againe his soueraine Elfin Queene,
His artes he moues, and out of caytiues hands
Himselfe he frees by secret meanes vnseene;
His shackles emptie left, him selfe escaped cleene.

And forth he fares full of malicious mind,
To worken mischiefe and auenging woe,
Where euer he that godly knight may find,
His onely hart sore, and his onely foe,
Sith Vna now he algates must forgoe,
Whom his victorious hands did earst restore
To natiue crowne and kingdome late ygoe:
Where she enioyes sure peace for euermore,
As weather-beaten ship arriu'd on happie shore.



The OUTPUT (an example):
___________________________________________________
fq.txt1, [today's date & time], [user], [computer_name]

@& - Remark Array populated! XX entries
@# - Book Array populated! XX entries
@_ - Description Array populated! XX entries
@$ - Proem Array populated! XX entries
@^ - Canto Array populated! XX entries
@% - Argument Array populated! XX entries
@* - Stanzas Array populated! XX entries

(later, maybe not this round of requests, I will want a command line interaction to "view" these arrays in a specific format, such as "view selection?: 1" (assuming they would be numbered selections on screen) would yield the list of remarks and their index such as Book 1, Canto 4, Stanza 27, "The text here is not legible, the 1596 version of the text was notated in this way... etc.")

FILENAME: fq.txt1
BOOK NUMBER: 1
CANTO COUNT: 1
STANZA COUNT: 2
TOTAL LINE COUNT: 18
TOTAL WORD COUNT: 127
UNIQUE WORD COUNT: 110

UNIQUE WORDS: Amazon, And, But, Faery, I, Many, Of, Or, Peru?, Rather, Regions, Right, Sith, Soueraine, That, The, Virginia, Where, Which, Who, Will, aduize, age, aire, all, an, and, antique, antiquities, are, be, better, body, braine, breatheth, can, dayly, did, discouered, do, does, enterprize, euer, famous, forgery, found, fruitfullest, great, happy, hardy, heard, history, how, huge, idle, in, is, iudged, iust, know, land, late, least, let, liuing, man, matter, measured, memory, mentioned, mighty, most, much, neuer, no, none, now, of, painted, part, red, riuer, sence, show, so, some, th'Indian, th'aboundance, that, the, then, this, through, to, trew?, vaunt, venturous, vessell, vew?, vouch, vs, well, were, where, which, who, with, world, wote, yet

REPEATED WORDS: But, I, Or, That, Which, euer, is, known, o, of, that, to, who

INDEX OF WORDS:
WORD | FILNAME | STANZA | LOCATION | LINE [; STANZA | LOCATION | LINE; etc...]
Amazon, fq.txt1, 2, 2, The Amazon huge riuer now found trew?
And, fq.txt1, 2, 1, And dayly how through hardy enterprize,
But, fq.txt1, 1, 1, But vouch antiquities, which no body can know.; 2, 1, But let that man with better sence aduize,
Faery, fq.txt1, 1, 7, Where is that happy land of Faery,
I, fq.txt1, 1, 3, Right well I wote most mighty Soueraine,; 1, 2, Which I so much do vaunt, yet no where show,
...




The SCRIPT, developed by wilcoxon here at EE parses the Stanzas, its output is almost perfect, we are still working out some formatting issues, but the script is included here as a reference for others:
___________________________________________________
use strict;
use warnings;
use File::Slurp qw(slurp);
my $fil = shift or die "Usage: $0 textfile\n";
my @lines = slurp($fil);
# my @lines = slurp($fil, chomp => 1);
my ($stanza, $word_cnt, %uwords);
# loop over all lines of the stanza
while (@lines) {
     # # make sure there are no blank lines to start
     shift @lines while ($lines[0] =~ m{^\s*$});
     # increase stanza count
     $stanza++;
     # loop over the lines in the 9-line stanza
     for my $ln (1..9) {
          my $line = shift @lines or die "ran out of lines mid-stanza";
          # split the line and remove punctuation - can add others to char class
          my @words = map { s{[.,;:]$}{}; $_ } split m{\s+}, $line;
          # loop over words
          for my $i (0..@words-1) {
               $word_cnt++;
               $uwords{$words[$i]} = [] unless exists($uwords{$words[$i]});
               push @{$uwords{$words[$i]}}, [$stanza, $line, $i+1];
               }
     }
}
# output
print "FILENAME: $fil\n",
    "STANZA COUNT: $stanza\n",
    "TOTAL LINE COUNT: ", $stanza*9, "\n",
    "TOTAL WORD COUNT: $word_cnt\n",
    "UNIQUE WORD COUNT: ", scalar(keys %uwords), "\n\n",
    "UNIQUE WORDS: ", join(', ', sort keys %uwords), "\n\n",
    "REPEATED WORDS: ", grep({ @{$uwords{$_}}-1 } sort keys %uwords), "\n\n",
    "INDEX OF WORDS:\n";
foreach my $word (sort keys %uwords) {
    print "$word, $fil, ", join('; ', map { join ',', @$_ } @{$uwords{$word}}), "\n";
}
0
Comment
Question by:Todd Weaver
  • 9
  • 2
12 Comments
 

Author Comment

by:Todd Weaver
Comment Utility
thank you.
0
 
LVL 84

Expert Comment

by:ozo
Comment Utility
# I'm not sure I understood the output you wanted.
# Let me know what changes you want from this:
use strict;
use warnings;
my $fil = $ARGV[0] or die "Usage: $0 textfile\n";
my ($book, $stanza, $canto, $line, $book_cnt, $line_cnt, $word_cnt, $canto_cnt, $stanza_cnt, %uwords, @words);
# loop over all lines of the stanza
while( <> ){
    chomp;
    # # make sure there are no blank lines to start
    m{\S} or $stanza+=!!$stanza, next;
    m{^\@[&_]} and next;
    # increase stanza count
    s/^\@#(.*)// and $book=$1, ++$book_cnt;
    s/^\@\^(.*)// and $canto=$1, ++$canto_cnt;
    s/^\@\$// and $canto="Proem", $stanza=1;
    s/^\@[*\$]// and $stanza=1;
#    next unless $stanza;
    s/^(\@%|\s+)//;
    ++$line_cnt;
     # split the line and remove punctuation - can add others to char class
     $word_cnt += @words = split m{[\s.,;:?]+};
     # loop over words
     my $i=0;
     $line=$_;
     push @{$uwords{$_}}, "$_, $ARGV, Book:$book, Canto:$canto, Stanza:$stanza, Location:".++$i." $line" for @words;

}
# output
my @sort =  sort keys %uwords;
$"=", ";
print "FILENAME: $fil\n",
    "BOOK NUMBER: $book\n",
    "CANTO COUNT: $canto_cnt\n",
    "STANZA COUNT: $stanza\n",
    "TOTAL LINE COUNT: $line_cnt\n",
    "TOTAL WORD COUNT: $word_cnt\n",
    "UNIQUE WORD COUNT: ".(keys %uwords)."\n\n",
    "UNIQUE WORDS: @sort\n\n",
    "REPEATED WORDS: ", (join', ', grep{ @{$uwords{$_}}>1 } @sort), "\n\n",
    "INDEX OF WORDS:\n";
for( @uwords{@sort} ){
    print join('; ',@{$_}), "\n";
}
0
 

Author Comment

by:Todd Weaver
Comment Utility
This seems to be a really excellent answer.  The output is not quite right.

please pay careful attention to "!!!Please see this remark" it indicates a minor change and a crucial element of the output.

(((NOTE: BOOK and CANTO counting will ultimately need to be in literary notation so this means in roman numerals, for example: 1,2,3,4,5... is I, II, III, IV, V,... for BOOKs, and i,ii,iii,iv,v,... for CANTOs
If this is confusing then just use numbers and I can look into how to change these into roman numerals later...)))


An Explanation of the KEYS section of the OUTPUT:
The "keys" section (first section of OUTPUT) is a list of arrays that were collected but not displayed individually, though their word content is included in the INDEX OF WORDS, each array is filled with the words from their segment of text in the file. The first section is just a report that Keys were found and what keys and that the arrays were populated for future improvements to the report or to satisfy new reports.


An Explanation of the INDEX OF WORDS section of the OUTPUT:

!!!Please see this remark.
I now see that some elements need to be joined the the output list in the index of words, I'm sorry for this change request) -- I see now that Book and Argument/Proem/Canto must be added and noted in the output, and that LINE and TEXT are different and are noted herein)

the elements of the indexed word OUTPUT:
WORD (the word) | BOOK | CANTO/ARGUMENT/DESCRIPTION/PROEM (a count for CANTO and PROEM the word "ARGUMENT" or "DESCRIPTION" for Argument or Description, if the word is in the DESCRIPTION then the report for this word is finished at this point)| STANZA ( for CANTO, PROEM and ARGUMENT this is a count) )|LINE (the line number of the CANTO, ARGUMENT or PROEM)  | LOCATION (which word, the location, like 3) | TEXT (the text) [; (semicolon for repeated words, like "I" STANZA (count)|LINE (the line number of the CANTO, ARGUMENT or PROEM)  | LOCATION (which word, the location, like 3 for third word) | TEXT (the text)

An ARGUMENT has _one_ STANZA and is only 4 lines ONLY!
A CANTO has _many_ STANZA's and each stanza is 9 lines ONLY!

!!! please see this remark:
A good example of the INDEX OF WORDS output is for the words "but" and "the" :
WORD | BOOK | CANTO/ARGUMENT/DESCRIPTION/PROEM| STANZA | LINE| LOCATION | TEXT
But      I, PROEM, 1, 9, 1, But vouch antiquities, which no body can know.; I, PROEM, 2, 1, 1, But let that man with better sence aduize,

The      I, DESCRIPTION; I, ARGUMENT, 1, 2, 1, The Redcrosse knight awaytes,; I, PROEM, 2, 2, 3, That of the world least part to vs is red:; I, PROEM, 3, 6, 4, What if within the Moones faire shining spheare?; I, i, 1, 4, 3, Soone as the Redcrosse knight he vnderstands,;

1) If the word  is not in a CANTO and is instead in the PROEM then replace "i"  with the word PROEM.

2) If the word is not in a CANTO and is not in the PROEM, then it must be in the ARGUMENT or in the DESCRIPTION. The ARGUMENT notation would be the same as the CANTO notation, except the STANZA count would always and only be "1" because ARGUMENTS are only ever _one_ stanza.

3) If the word is in the description it only needs to be noted as DESCRIPTION, no STANZA, no LINE, no, LOCATION  no TEXT. Example: WORD, I, DESCRIPTION( i.e.: Legend, Book 1, in the description)


I hope this is more clear. So far I have about 90 percent of what I want. If we can clean up the OUTPUT I think everything I have asked for is otherwise present in your script.


The updated OUTPUT:
___________________________________________________
fq.txt1 (the file read), [today's date & time], [user], [computer_name]

@& - Remark Array populated! XX entries (how many remarks)
@# - Book Array populated! XX entries (how many Books)
@_ - Description Array populated! XX entries (how many Descriptions)
@$ - Proem Array populated! XX entries (how many Proems)
@^ - Canto Array populated! XX entries (how many Cantos)
@% - Argument Array populated! XX entries (how many Arguments)
@* - Stanzas Array populated! XX entries (how many Stanzas)

FILENAME: fq.txt1
BOOK NUMBER ( comment: count of items in @#): 1
CANTO COUNT ( comment: count of items in@^): 1
STANZA COUNT ( comment: count of items in@*): 2
TOTAL LINE COUNT ( comment: count of items in @*): 18
TOTAL WORD COUNT ( comment: count of items in @_, @$, @%, @* only these 4 lists ): 127
UNIQUE WORD COUNT ( Comment: count of File): 110

UNIQUE WORDS: Amazon, And, But, Faery, I, Many, Of, Or, Peru?, Rather, Regions, Right, Sith, Soueraine, That, The, Virginia, Where, Which, Who, Will, aduize, age, aire, all, an, and, antique, antiquities, are, be, better, body, braine, breatheth, can, dayly, did, discouered, do, does, enterprize, euer, famous, forgery, found, fruitfullest, great, happy, hardy, heard, history, how, huge, idle, in, is, iudged, iust, know, land, late, least, let, liuing, man, matter, measured, memory, mentioned, mighty, most, much, neuer, no, none, now, of, painted, part, red, riuer, sence, show, so, some, th'Indian, th'aboundance, that, the, then, this, through, to, trew?, vaunt, venturous, vessell, vew?, vouch, vs, well, were, where, which, who, with, world, wote, yet

REPEATED WORDS: But, I, Or, That, Which, euer, is, known, o, of, that, to, who

INDEX OF WORDS:
WORD | BOOK | CANTO/ARGUMENT/DESCRIPTION/PROEM| STANZA | LINE| LOCATION | TEXT
But      I, PROEM, 1, 9, 1, But vouch antiquities, which no body can know.; I, PROEM, 2, 1, 1, But let that man with better sence aduize,
...
The      I, DESCRIPTION; I, ARGUMENT, 1, 2, 1, The Redcrosse knight awaytes,; I, PROEM, 2, 2, 3, That of the world least part to vs is red:; I, PROEM, 3, 6, 4, What if within the Moones faire shining spheare?; I, i, 1, 4, 3, Soone as the Redcrosse knight he vnderstands,;
...
...
etc...
0
 

Author Comment

by:Todd Weaver
Comment Utility
I apologize, I had not piped the output to a text file for a closer look before posting my response.

I had a closer look at this output and it is verbose and includes the elements of my report request.  I am looking into the print commands to discover if I can clean the report on my own.  I will post any questions that I have and any code that I find tricky of complicated.

Please if you have time, my goal is the gain the output that I have posted above. This is my goal.

Thank you.
0
 
LVL 84

Accepted Solution

by:
ozo earned 500 total points
Comment Utility
use strict;
use warnings;
@ARGV or die "Usage: $0 textfile\n";
my ($book, $stanza, $canto, $line, $line_cnt, $word_cnt, $CADP, %cnt, %uwords, @words);
# loop over all lines of the stanza
while( <> ){
    chomp;
    # # make sure there are no blank lines to start
    m{\S} or ++$stanza, next;
    m{^\@&} and ++$cnt{REMARK}, next;
    # increase stanza count
    s/^\@#(.*)// and $book=$1, ++$cnt{$CADP="BOOK"};
    s/^\@\^(.*)// and $canto=$1, ++$cnt{$CADP="CANTO"};
    s/^\@\$// and $stanza=1, ++$cnt{$CADP="PROEM"};
    s/^\@%// and ++$cnt{$CADP="ARGUMENT"};
    s/^\@_// and ++$cnt{$CADP="DESCRIPTION"};
    s/^\@[*\$]// and $stanza=1, $CADP=$canto;
    ++$line_cnt;
     # split the line and remove punctuation - can add others to char class
     $word_cnt += @words = split m{[\s.,;:?]+};
     # loop over words
     my $i=0;
     $line=$_;
     push @{$uwords{$_}}, "$_, $ARGV, $book, $CADP, $stanza, ".++$i.", $line" for @words;

}
# output
my @sort =  sort keys %uwords;
$"=", ";
print map{"$_ COUNT: $cnt{$_}\n"}qw(BOOK REMARK DESCRITION PROEM CANTO STANZA ARGUMENT);
print
    "TOTAL LINE COUNT: $line_cnt\n",
    "TOTAL WORD COUNT: $word_cnt\n",
    "UNIQUE WORD COUNT: ".(keys %uwords)."\n\n",
    "UNIQUE WORDS: @sort\n\n",
    "REPEATED WORDS: ", (join', ', grep{ @{$uwords{$_}}>1 } @sort), "\n\n",
    "INDEX OF WORDS:\n";
for( @uwords{@sort} ){
    print join('; ',@{$_}), "\n";
}
0
Find Ransomware Secrets With All-Source Analysis

Ransomware has become a major concern for organizations; its prevalence has grown due to past successes achieved by threat actors. While each ransomware variant is different, we’ve seen some common tactics and trends used among the authors of the malware.

 

Author Comment

by:Todd Weaver
Comment Utility
C:\Spenser>perl -w concordance.pl fq.txt > output.txt
Use of uninitialized value in concatenation (.) or string at concordance.pl line 30, <> line 7014.
Use of uninitialized value in concatenation (.) or string at concordance.pl line 30, <> line 7014.

C:\Spenser>

I believe the first time this error is report is for the typo of  "DESCRITION" when I corrected this to "DESCRIPTION" and ran the script again, it gave only one "uninitialized value in concatenation" error: Perhaps "STANZA" is not in the array yet..

print map{"$_ COUNT: $cnt{$_}\n"}qw(BOOK REMARK DESCRITION PROEM CANTO STANZA ARGUMENT);
0
 

Author Comment

by:Todd Weaver
Comment Utility
The script is better that ever, and the output is getting very close to the results that I desire.

Some feedback, Stanza are 9 lines only, the stanza count in should bump up every 9 lines. Stanza begin with only one key "@*" from here the count should be a new stanza every 9 lines, therefore the result of the total stanza could should be evenly-divisible by 9.

here is a smaller segment of the text that includes all the keys we are looking for and should produce a clean report.  Please can we work on this text so that we can get a working prototype

Here is the text to work with (please add hyhpens and ampersands to the split (I get an error when I put these chars in the the split I dunno if I must add and escape slash to & or -  (\& or \-)?:

Please use this text:

@&The Faerie Queene: Book II
@#II
 
@_THE SECOND BOOKE
OF THE FAERIE QVEENE.

Contayning
THE LEGEND OF SIR GVYON.

OR
OF TEMPERAUNCE.


@$Right well I wote most mighty Soueraine,
That all this famous antique history,
Of some th' aboundance of an idle braine
Will iudged be, and painted forgery,
Rather then matter of iust memory,
Sith none, that breatheth liuing aire, does know,
Where is that happy land of Faery,
Which I so much do vaunt, yet no where show,
But vouch antiquities, which no body can know.

But let that man with better sence aduize,
That of the world least part to vs is red:
And dayly how through hardy enterprize,
Many great Regions are discouered,
Which to late age were neuer mentioned.
Who euer heard of th' Indian Peru?
Or who in venturous vessell measured
The Amazon huge riuer now found trew?
Or fruitfullest Virginia who did euer vew?

Yet all these were, when no man did them know;
Yet haue from wisest ages hidden beene:
And later times things more vnknowne shall show.
Why then should witlesse man so much misweene
That nothing is, but that which he hath seene?
What if within the Moones faire shining spheare?
What if in euery other starre vnseene
Of other worldes he happily should heare?
He wonder would much more: yet such to some appeare.

Of Faerie lond yet if he more inquire,
By certaine signes here set in sundry place
He may it find; ne let him then admire,
But yield his sence to be too blunt and bace,
That no'te without an hound fine footing trace.
And thou, O fairest Princesse vnder sky,
In this faire mirrhour maist behold thy face,
And thine owne realmes in lond of Faery,
 And in this antique Image thy great auncestry.

The which O pardon me thus to enfold
In couert vele, and wrap in shadowes light,
That feeble eyes your glory may behold,
Which else could not endure those beames bright,
But would be dazled with exceeding light.
O pardon, and vouchsafe with patient eare
The braue aduentures of this Faery knight
The good Sir Guyon gratiously to heare,
In whom great rule of Temp'raunce goodly doth appeare.



@&Canto I
@^i

@%Guyon by Archimage abusd,
The Redcrosse knight awaytes,
Findes Mordant and Amauia slaine
With pleasures poisoned baytes.

@*That cunning Architect of cancred guile,
Whom Princes late displeasure left in bands,
For falsed letters and suborned wile,
Soone as the Redcrosse knight he vnderstands,
To beene departed out of Eden lands,
To serue againe his soueraine Elfin Queene,
His artes he moues, and out of caytiues hands
Himselfe he frees by secret meanes vnseene;
His shackles emptie left, him selfe escaped cleene.

And forth he fares full of malicious mind,
To worken mischiefe and auenging woe,
Where euer he that godly knight may find,
His onely hart sore, and his onely foe,
Sith Vna now he algates must forgoe,
Whom his victorious hands did earst restore
To natiue crowne and kingdome late ygoe:
Where she enioyes sure peace for euermore,
As weather-beaten ship arriu'd on happie shore.
0
 

Author Comment

by:Todd Weaver
Comment Utility
It seems that this regex is greedily slurping up the newline character and placing it in the index so that the output contains the cr:
 
s/^\@#(.*)// and $book=$1, ++$cnt{$CADP="BOOK"};
s/^\@\^(.*)// and $canto=$1, ++$cnt{$CADP="CANTO"};

this out put:
Amauia, II
, ARGUMENT, 9, 4, Findes Mordant and Amauia slaine

should be look like this:
Amauia,
     II, ARGUMENT, 9, 4, Findes Mordant and Amauia slaine

further output change, please correct this:
And, II
, PROEM, 2, 1, And dayly how through hardy enterprize,
And, II
, PROEM, 3, 1, And later times things more vnknowne shall show.
And, II
, PROEM, 4, 1, And thou, O fairest Princesse vnder sky,
And, II
, PROEM, 4, 1, And thine owne realmes in lond of Faery,
And, II
, PROEM, 4, 2,  And in this antique Image thy great auncestry.
And, II
, i
, 2, 1, And forth he fares full of malicious mind,


To be this (for readability):
And,
     II, PROEM, 2, 1, And dayly how through hardy enterprize,
     II, PROEM, 3, 1, And later times things more vnknowne shall show.
     II, PROEM, 4, 1, And thou, O fairest Princesse vnder sky,
     II, PROEM, 4, 1, And thine owne realmes in lond of Faery,
     II, PROEM, 4, 2,  And in this antique Image thy great auncestry.
     II, i, 2, 1, And forth he fares full of malicious mind,

If this happens then I feel that we are 99% finished.  The final problem is the initialization of a variable is placing "whitespace character" in the index.. it is the first enty in the "INDEX OF WORDS", please have a look.

Thank you,

Todd Weaver
0
 

Author Comment

by:Todd Weaver
Comment Utility
adding "\s" solved extra cr at end of regex (I think);

    s/^\@#(.*)\s// and $book=$1, ++$cnt{$CADP="BOOK"};
    s/^\@\^(.*)\s// and $canto=$1, ++$cnt{$CADP="CANTO"};

output (now):
Amauia, II, ARGUMENT, 9, 4, Findes Mordant and Amauia slaine

Amazon, II, PROEM, 2, 2, The Amazon huge riuer now found trew?

And, II, PROEM, 2, 1, And dayly how through hardy enterprize,
And, II, PROEM, 3, 1, And later times things more vnknowne shall show.
And, II, PROEM, 4, 1, And thou, O fairest Princesse vnder sky,
And, II, PROEM, 4, 1, And thine owne realmes in lond of Faery,
And, II, PROEM, 4, 2,  And in this antique Image thy great auncestry.
And, II, i, 2, 1, And forth he fares full of malicious mind,

Archimage, II, ARGUMENT, 9, 3, Guyon by Archimage abusd,

Architect, II, i, 1, 3, That cunning Architect of cancred guile,

As, II, i, 2, 1, As weather-beaten ship arriu'd on happie shore.
0
 

Author Comment

by:Todd Weaver
Comment Utility
this is the code that I believe is 99% finished, please see the comments where I have placed "# tw" (maybe four total) to see the required changes.

use strict;
use warnings;
@ARGV or die "Usage: $0 textfile\n";
my ($book, $stanza, $canto, $line, $line_cnt, $word_cnt, $CADP, %cnt, %uwords, @words);
# loop over all lines of the stanza
while( <> ){
    chomp;
    # # make sure there are no blank lines to start
    m{\S} or ++$stanza, next;
    m{^\@&} and ++$cnt{REMARK}, next;
    # increase stanza count
   
  # tw strip trailing newline char
  # s/^\@#(.*)// and $book=$1, ++$cnt{$CADP="BOOK"};
  # s/^\@\^(.*)// and $canto=$1, ++$cnt{$CADP="CANTO"};
    s/^\@#(.*)\s// and $book=$1, ++$cnt{$CADP="BOOK"};
    s/^\@\^(.*)\s// and $canto=$1, ++$cnt{$CADP="CANTO"};
   
    s/^\@\$// and $stanza=1, ++$cnt{$CADP="PROEM"};
    s/^\@%// and ++$cnt{$CADP="ARGUMENT"};
    s/^\@_// and ++$cnt{$CADP="DESCRIPTION"};
    s/^\@[*\$]// and $stanza=1, $CADP=$canto;
    ++$line_cnt;
     # split the line and remove punctuation - can add others to char class
   # tw added split on hyphen, ampersand
   # $word_cnt += @words = split m{[\s.,;:?]+};
     $word_cnt += @words = split m{[\s.,;:?&-]+};
     
     # loop over words
     my $i=0;
     $line=$_;
   # tw push $line_cnt into @uwords $line_cnt should always be 1-9 for stanza and proem, 1-4 for argument, and not the actual line count of the file as this information is not useful in the output concordance
   # push @{$uwords{$_}}, "$_, $ARGV, $book, $CADP, $stanza, ".++$i.", $line" for @words;
   # push @{$uwords{$_}}, "$_, $book, $CADP, $stanza, ".++$i.", $line" for @words;
     push @{$uwords{$_}}, "$_, $book, $CADP, $stanza, ".++$i.", $line_cnt, $line" for @words;

}
# output
my @sort =  sort keys %uwords;
$"=", ";
print map{"$_ COUNT: $cnt{$_}\n"}qw(BOOK REMARK DESCRIPTION PROEM CANTO STANZA ARGUMENT);
print
    "TOTAL LINE COUNT: $line_cnt\n",
    "TOTAL WORD COUNT: $word_cnt\n",
    "UNIQUE WORD COUNT: ".(keys %uwords)."\n\n",
   
  # tw commented for readability, later make this output a switch user choice
  # "UNIQUE WORDS: @sort\n\n",
  # "REPEATED WORDS: ", (join', ', grep{ @{$uwords{$_}}>1 } @sort), "\n\n",

    "INDEX OF WORDS:\nWORD | BOOK | C/A/D/P | STANZA | LINE | LOCATION | TEXT\n";
for( @uwords{@sort} ){
  # tw
  # print join('; ',@{$_}), "\n";
    print join('',@{$_}), "\n";
}
0
 

Author Closing Comment

by:Todd Weaver
Comment Utility
I have appended this solution to a follow-up question.  Please feel free to provide a solution to my follow-up question.  Thanks you for your help.

Thank you,

Todd Weaver
0

Featured Post

How your wiki can always stay up-to-date

Quip doubles as a “living” wiki and a project management tool that evolves with your organization. As you finish projects in Quip, the work remains, easily accessible to all team members, new and old.
- Increase transparency
- Onboard new hires faster
- Access from mobile/offline

Join & Write a Comment

Suggested Solutions

A short article about a problem I had getting the GPS LocationListener working.
Although it can be difficult to imagine, someday your child will have a career of his or her own. He or she will likely start a family, buy a home and start having their own children. So, while being a kid is still extremely important, it’s also …
An introduction to basic programming syntax in Java by creating a simple program. Viewers can follow the tutorial as they create their first class in Java. Definitions and explanations about each element are given to help prepare viewers for future …
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…

762 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

6 Experts available now in Live!

Get 1:1 Help Now