How do I change my nawk one-liners to produce child files with more descriptive filenames?

I have a pair of large/long XML files that I'm breaking apart with nawk,, so that I can work more easily with the pieces that are actually relevant to my project. Both of these files consist of raw election results.

The code I have is doing what I want, but it's producing files that lack descriptive filenames, which makes it much more time consuming for me to identify which of the child files correspond to the data I want to work with.

This is the source of my first XML file. This is the code that's splitting this file apart:

nawk ' {print > "result"(NR%1?i:i++)".txt"; }' i=1 PI.txt

Open in new window


nawk is splitting up the parent file every time it finds a new line.

This is the source of my second XML file. This is the code that's splitting this file apart:

nawk -v RS="</?Results>" -v FS="<Result>" '{ for(N=1; N<=NF; N++) if($N ~ /<[/]/) print FS $N > "result"++C".xml" }' AllStateGeneral2014.xml

Open in new window


Here, nawk is splitting the parent file into children every time it finds a new Result.

Again, the first XML file is being split on a line-by-line basis; the second is being split apart wherever nawk finds a new "Result" element. In both cases, however, the resulting filenames look like this:

result1.xml result2.xml result3.xml

... and so on.

It would save a lot of time if the filenames were more descriptive, and looked like this:

result1-John.xml result2-Jane.xml result3-Jake.xml

In the case of the first file, it would be acceptable if only the first word of the line were incorporated into the filename.

In the case of the second XML file, it would be ideal if the text that appears between <Candidate> and </Candidate> could be part of the filename. How do I modify the one-liners above to get nawk to create more descriptive filenames?
Kaya SeabloomAsked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

gheistCommented:
Nawk is not the best XML parser in the world
There are hundreds more:
Like listed here:
http://www.maketecheasier.com/manipulate-html-and-xml-files-from-commnad-line/
0
MikeOM_DBACommented:
Here is an idea:
nawk 'BEGIN{split(FILENAME,fn,".")}{print > fn[1] (NR%1?i:i++)".txt"; }' i=1 PI.txt
nawk -v RS="</?Results>" -v FS="<Result>" 'BEGIN{split(FILENAME,fn,".")}{ for(N=1; N<=NF; N++) if($N ~ /<[/]/) print FS $N > fn[1] ++C".xml" }' AllStateGeneral2014.xml

Open in new window

0
Kaya SeabloomAuthor Commented:
gheist, you're right, nawk isn't the best XML parser around. However, one of my files is just a plain text file. The other is  XML, but is not very complex. It's the same elements repeated over and over, containing different text.

Mike, the code you provided is processed by the shell without throwing any errors, but in both cases, all the child files have numerical filenames, just as before - e.g. 0.xml, 1.xml, 2.xml, etc. I'm not getting output that's different than what I had before.
0
Cloud Class® Course: Python 3 Fundamentals

This course will teach participants about installing and configuring Python, syntax, importing, statements, types, strings, booleans, files, lists, tuples, comprehensions, functions, and classes.

MikeOM_DBACommented:
show me your script.
0
Kaya SeabloomAuthor Commented:
It's pretty simple:
http://pastebin.com/tt8hgH3a

The source files (the .txt file and the .xml file) are copied/cached via a cron job every so often, and then I'm just using nawk to split them up. The original nawk one-liners are, of course, in my original post above. The paste contains the modified one-liners from your answer.

If I run either of these modified one-liners right on the command line, I get the same result: hundreds of child files with numerical filenames.
0
Duncan RoeSoftware DeveloperCommented:
I've started to look at your first example. Simply by not producing a result file for a blank line, you go down to 492 files from 625
#!/bin/sh
awk '
/^[[:space:]]*$/{next}
{
  print > "result"(NR%1?i:i++)".txt"
}
' i=1 PI.txt

Open in new window

0
Duncan RoeSoftware DeveloperCommented:
This script produces files containing the first word of each line
#!/bin/sh
awk '
BEGIN {skip_next=0;i=1}

# Skip blank lines and lines starting with braces
/^([[:space:]]*$|[{}])/{next}

# skip heading lines (starting with ^L) and following non-blank line
/^\f/{skip_next=1; next}

# The main show
{
  if (skip_next)
  {
    skip_next=0
    next
  }
  print > "result" i++ "-" $1 ".txt"
}
' PI.txt

Open in new window

This is how the directory looks
result1-Initiative.txt     result10-(Precincts.txt    result100-Legislative.txt
result101-(Precincts.txt   result102-Maralyn.txt      result103-Robert.txt
result104-Write-in.txt     result105-Legislative.txt  result106-(Precincts.txt
result107-Cindy.txt        result108-Write-in.txt     result109-Legislative.txt
result11-Yes.txt           result110-(Precincts.txt   result111-Ruth.txt
result112-Alvin.txt        result113-Write-in.txt     result114-Legislative.txt
result115-(Precincts.txt   result116-Karen.txt        result117-Martin.txt
result118-Write-in.txt     result119-Legislative.txt  result12-No.txt
result120-(Precincts.txt   result121-Tina.txt         result122-Michael.txt
result123-Write-in.txt     result124-Legislative.txt  result125-(Precincts.txt
result126-Mia.txt          result127-Jeanette.txt     result128-Write-in.txt
result129-Legislative.txt  result13-Advisory.txt      result130-(Precincts.txt
result131-Sharon.txt       result132-Write-in.txt     result133-Legislative.txt
result134-(Precincts.txt   result135-Eileen.txt       result136-Write-in.txt

Open in new window

Files with names like result134-(Precincts.txt are awkward to deal with however. You need to escape the opening parenthesis, which is special to the shell. Stay tuned
0
Duncan RoeSoftware DeveloperCommented:
This one removes parentheses. You could insert any other characters that you don't want between the square brackets in the gensub call
#!/bin/sh
awk '
BEGIN {skip_next=0;i=1}

# Skip blank lines and lines starting with braces
/^([[:space:]]*$|[{}])/{next}

# skip heading lines (starting with ^L) and following non-blank line
/^\f/{skip_next=1; next}

# The main show
{
  if (skip_next)
  {
    skip_next=0
    next
  }
  fnam=gensub("[()]","","g",$1)
  print > "result" i++ "-" fnam ".txt"
}
' PI.txt

Open in new window

The directory listing now looks like
result1-Initiative.txt     result10-Precincts.txt     result100-Legislative.txt
result101-Precincts.txt    result102-Maralyn.txt      result103-Robert.txt
result104-Write-in.txt     result105-Legislative.txt  result106-Precincts.txt
result107-Cindy.txt        result108-Write-in.txt     result109-Legislative.txt
result11-Yes.txt           result110-Precincts.txt    result111-Ruth.txt
result112-Alvin.txt        result113-Write-in.txt     result114-Legislative.txt
result115-Precincts.txt    result116-Karen.txt        result117-Martin.txt
result118-Write-in.txt     result119-Legislative.txt  result12-No.txt
result120-Precincts.txt    result121-Tina.txt         result122-Michael.txt
result123-Write-in.txt     result124-Legislative.txt  result125-Precincts.txt
result126-Mia.txt          result127-Jeanette.txt     result128-Write-in.txt

Open in new window

It annoys me that the directory listing isn't in numerical order. Have you found some way to get around that? My usual remedy is to have enough leading zeroes so the string sort is also numerical. Will give it one more try
0
Duncan RoeSoftware DeveloperCommented:
And here it is!
#!/bin/sh
awk '
BEGIN {skip_next=0;i=1}

# Skip blank lines and lines starting with braces
/^([[:space:]]*$|[{}])/{next}

# skip heading lines (starting with ^L) and following non-blank line
/^\f/{skip_next=1; next}

# The main show
{
  if (skip_next)
  {
    skip_next=0
    next
  }
  fnam=sprintf("result%03d-%s.txt",i++,gensub("[()]","","g",$1))
  print > fnam
}

Open in new window

The directory listing now looks like
result001-Initiative.txt   result002-Precincts.txt    result003-Yes.txt
result004-No.txt           result005-Initiative.txt   result006-Precincts.txt
result007-Yes.txt          result008-No.txt           result009-Initiative.txt
result010-Precincts.txt    result011-Yes.txt          result012-No.txt
result013-Advisory.txt     result014-Precincts.txt    result015-Repealed.txt
result016-Maintained.txt   result017-Advisory.txt     result018-Precincts.txt
result019-Repealed.txt     result020-Maintained.txt   result021-US.txt
result022-Precincts.txt    result023-Suzan.txt        result024-Pedro.txt
result025-Write-in.txt     result026-US.txt           result027-Precincts.txt
result028-Jim.txt          result029-Craig.txt        result030-Write-in.txt
result031-US.txt           result032-Precincts.txt    result033-Dave.txt

Open in new window

If you have more than 999 results, change %03d to %04d, and so on.
0
Duncan RoeSoftware DeveloperCommented:
To get back to having a 1-line awk command, put the awk script in a file, say do_PI.awk. Now your 1-liner is
 awk -f do_PI.awk PI.txt
do-PI.awk.txt
0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
Duncan RoeSoftware DeveloperCommented:
What do you want to do for part 2? It seems all the result files start <Result><RaceName>and most of them are pretty similar even after that e.g
<Result><RaceName>State Measures - Initiative Measure No. 1351 Concerns
<Result><RaceName>Advisory Votes - Advisory Vote No. 9 (Engrossed Subst
<Result><RaceName>Legislative District 17 - State Representative Pos. 2
<Result><RaceName>Legislative District 18 - State Representative Pos. 1
<Result><RaceName>Legislative District 18 - State Representative Pos. 1
<Result><RaceName>Legislative District 18 - State Representative Pos. 2
<Result><RaceName>Legislative District 18 - State Representative Pos. 2
<Result><RaceName>Legislative District 19 - State Representative Pos. 1
<Result><RaceName>Legislative District 19 - State Representative Pos. 1
<Result><RaceName>Legislative District 19 - State Representative Pos. 2
<Result><RaceName>Legislative District 19 - State Representative Pos. 2
<Result><RaceName>Legislative District 20 - State Representative Pos. 1
<Result><RaceName>Congressional District 1 - U.S. Representative</RaceN
<Result><RaceName>Legislative District 20 - State Representative Pos. 1
<Result><RaceName>Legislative District 20 - State Representative Pos. 2
<Result><RaceName>Legislative District 20 - State Representative Pos. 2
<Result><RaceName>Legislative District 21 - State Senator</RaceName><Ca
<Result><RaceName>Legislative District 21 - State Senator</RaceName><Ca
<Result><RaceName>Legislative District 21 - State Representative Pos. 1
<Result><RaceName>Legislative District 21 - State Representative Pos. 1
<Result><RaceName>Legislative District 21 - State Representative Pos. 2
<Result><RaceName>Legislative District 21 - State Representative Pos. 2
<Result><RaceName>Legislative District 22 - State Representative Pos. 1
<Result><RaceName>Congressional District 1 - U.S. Representative</RaceN
<Result><RaceName>Legislative District 22 - State Representative Pos. 1
<Result><RaceName>Legislative District 22 - State Representative Pos. 2
<Result><RaceName>Legislative District 22 - State Representative Pos. 2

Open in new window

What do you want to do?
0
Kaya SeabloomAuthor Commented:
Hi Duncan... as I said in my original question:
In the case of the second XML file, it would be ideal if the text that appears between <Candidate> and </Candidate> could be part of the filename.

This makes sense because what goes in the Candidate element is always unique.
0
Duncan RoeSoftware DeveloperCommented:
Apart from Yes, No and maybe a few others. I guess you've accepted my answer because you're happy to do the other one yourself? It's kind-of similar: you might like to replace the spaces in candidates' names with underscores for easier handling. You would use index and substr to get the names.
0
Kaya SeabloomAuthor Commented:
Yeah, I was able to do the other one, and move ahead with my project. Thanks!
0
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Linux OS Dev

From novice to tech pro — start learning today.

Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.