Link to home
Start Free TrialLog in
Avatar of Kaya Seabloom
Kaya Seabloom

asked on

How do I change my nawk one-liners to produce child files with more descriptive filenames?

I have a pair of large/long XML files that I'm breaking apart with nawk,, so that I can work more easily with the pieces that are actually relevant to my project. Both of these files consist of raw election results.

The code I have is doing what I want, but it's producing files that lack descriptive filenames, which makes it much more time consuming for me to identify which of the child files correspond to the data I want to work with.

This is the source of my first XML file. This is the code that's splitting this file apart:

nawk ' {print > "result"(NR%1?i:i++)".txt"; }' i=1 PI.txt

Open in new window


nawk is splitting up the parent file every time it finds a new line.

This is the source of my second XML file. This is the code that's splitting this file apart:

nawk -v RS="</?Results>" -v FS="<Result>" '{ for(N=1; N<=NF; N++) if($N ~ /<[/]/) print FS $N > "result"++C".xml" }' AllStateGeneral2014.xml

Open in new window


Here, nawk is splitting the parent file into children every time it finds a new Result.

Again, the first XML file is being split on a line-by-line basis; the second is being split apart wherever nawk finds a new "Result" element. In both cases, however, the resulting filenames look like this:

result1.xml result2.xml result3.xml

... and so on.

It would save a lot of time if the filenames were more descriptive, and looked like this:

result1-John.xml result2-Jane.xml result3-Jake.xml

In the case of the first file, it would be acceptable if only the first word of the line were incorporated into the filename.

In the case of the second XML file, it would be ideal if the text that appears between <Candidate> and </Candidate> could be part of the filename. How do I modify the one-liners above to get nawk to create more descriptive filenames?
Avatar of gheist
gheist
Flag of Belgium image

Nawk is not the best XML parser in the world
There are hundreds more:
Like listed here:
http://www.maketecheasier.com/manipulate-html-and-xml-files-from-commnad-line/
Here is an idea:
nawk 'BEGIN{split(FILENAME,fn,".")}{print > fn[1] (NR%1?i:i++)".txt"; }' i=1 PI.txt
nawk -v RS="</?Results>" -v FS="<Result>" 'BEGIN{split(FILENAME,fn,".")}{ for(N=1; N<=NF; N++) if($N ~ /<[/]/) print FS $N > fn[1] ++C".xml" }' AllStateGeneral2014.xml

Open in new window

Avatar of Kaya Seabloom
Kaya Seabloom

ASKER

gheist, you're right, nawk isn't the best XML parser around. However, one of my files is just a plain text file. The other is  XML, but is not very complex. It's the same elements repeated over and over, containing different text.

Mike, the code you provided is processed by the shell without throwing any errors, but in both cases, all the child files have numerical filenames, just as before - e.g. 0.xml, 1.xml, 2.xml, etc. I'm not getting output that's different than what I had before.
show me your script.
It's pretty simple:
http://pastebin.com/tt8hgH3a

The source files (the .txt file and the .xml file) are copied/cached via a cron job every so often, and then I'm just using nawk to split them up. The original nawk one-liners are, of course, in my original post above. The paste contains the modified one-liners from your answer.

If I run either of these modified one-liners right on the command line, I get the same result: hundreds of child files with numerical filenames.
Avatar of Duncan Roe
I've started to look at your first example. Simply by not producing a result file for a blank line, you go down to 492 files from 625
#!/bin/sh
awk '
/^[[:space:]]*$/{next}
{
  print > "result"(NR%1?i:i++)".txt"
}
' i=1 PI.txt

Open in new window

This script produces files containing the first word of each line
#!/bin/sh
awk '
BEGIN {skip_next=0;i=1}

# Skip blank lines and lines starting with braces
/^([[:space:]]*$|[{}])/{next}

# skip heading lines (starting with ^L) and following non-blank line
/^\f/{skip_next=1; next}

# The main show
{
  if (skip_next)
  {
    skip_next=0
    next
  }
  print > "result" i++ "-" $1 ".txt"
}
' PI.txt

Open in new window

This is how the directory looks
result1-Initiative.txt     result10-(Precincts.txt    result100-Legislative.txt
result101-(Precincts.txt   result102-Maralyn.txt      result103-Robert.txt
result104-Write-in.txt     result105-Legislative.txt  result106-(Precincts.txt
result107-Cindy.txt        result108-Write-in.txt     result109-Legislative.txt
result11-Yes.txt           result110-(Precincts.txt   result111-Ruth.txt
result112-Alvin.txt        result113-Write-in.txt     result114-Legislative.txt
result115-(Precincts.txt   result116-Karen.txt        result117-Martin.txt
result118-Write-in.txt     result119-Legislative.txt  result12-No.txt
result120-(Precincts.txt   result121-Tina.txt         result122-Michael.txt
result123-Write-in.txt     result124-Legislative.txt  result125-(Precincts.txt
result126-Mia.txt          result127-Jeanette.txt     result128-Write-in.txt
result129-Legislative.txt  result13-Advisory.txt      result130-(Precincts.txt
result131-Sharon.txt       result132-Write-in.txt     result133-Legislative.txt
result134-(Precincts.txt   result135-Eileen.txt       result136-Write-in.txt

Open in new window

Files with names like result134-(Precincts.txt are awkward to deal with however. You need to escape the opening parenthesis, which is special to the shell. Stay tuned
This one removes parentheses. You could insert any other characters that you don't want between the square brackets in the gensub call
#!/bin/sh
awk '
BEGIN {skip_next=0;i=1}

# Skip blank lines and lines starting with braces
/^([[:space:]]*$|[{}])/{next}

# skip heading lines (starting with ^L) and following non-blank line
/^\f/{skip_next=1; next}

# The main show
{
  if (skip_next)
  {
    skip_next=0
    next
  }
  fnam=gensub("[()]","","g",$1)
  print > "result" i++ "-" fnam ".txt"
}
' PI.txt

Open in new window

The directory listing now looks like
result1-Initiative.txt     result10-Precincts.txt     result100-Legislative.txt
result101-Precincts.txt    result102-Maralyn.txt      result103-Robert.txt
result104-Write-in.txt     result105-Legislative.txt  result106-Precincts.txt
result107-Cindy.txt        result108-Write-in.txt     result109-Legislative.txt
result11-Yes.txt           result110-Precincts.txt    result111-Ruth.txt
result112-Alvin.txt        result113-Write-in.txt     result114-Legislative.txt
result115-Precincts.txt    result116-Karen.txt        result117-Martin.txt
result118-Write-in.txt     result119-Legislative.txt  result12-No.txt
result120-Precincts.txt    result121-Tina.txt         result122-Michael.txt
result123-Write-in.txt     result124-Legislative.txt  result125-Precincts.txt
result126-Mia.txt          result127-Jeanette.txt     result128-Write-in.txt

Open in new window

It annoys me that the directory listing isn't in numerical order. Have you found some way to get around that? My usual remedy is to have enough leading zeroes so the string sort is also numerical. Will give it one more try
SOLUTION
Avatar of Duncan Roe
Duncan Roe
Flag of Australia image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
ASKER CERTIFIED SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
What do you want to do for part 2? It seems all the result files start <Result><RaceName>and most of them are pretty similar even after that e.g
<Result><RaceName>State Measures - Initiative Measure No. 1351 Concerns
<Result><RaceName>Advisory Votes - Advisory Vote No. 9 (Engrossed Subst
<Result><RaceName>Legislative District 17 - State Representative Pos. 2
<Result><RaceName>Legislative District 18 - State Representative Pos. 1
<Result><RaceName>Legislative District 18 - State Representative Pos. 1
<Result><RaceName>Legislative District 18 - State Representative Pos. 2
<Result><RaceName>Legislative District 18 - State Representative Pos. 2
<Result><RaceName>Legislative District 19 - State Representative Pos. 1
<Result><RaceName>Legislative District 19 - State Representative Pos. 1
<Result><RaceName>Legislative District 19 - State Representative Pos. 2
<Result><RaceName>Legislative District 19 - State Representative Pos. 2
<Result><RaceName>Legislative District 20 - State Representative Pos. 1
<Result><RaceName>Congressional District 1 - U.S. Representative</RaceN
<Result><RaceName>Legislative District 20 - State Representative Pos. 1
<Result><RaceName>Legislative District 20 - State Representative Pos. 2
<Result><RaceName>Legislative District 20 - State Representative Pos. 2
<Result><RaceName>Legislative District 21 - State Senator</RaceName><Ca
<Result><RaceName>Legislative District 21 - State Senator</RaceName><Ca
<Result><RaceName>Legislative District 21 - State Representative Pos. 1
<Result><RaceName>Legislative District 21 - State Representative Pos. 1
<Result><RaceName>Legislative District 21 - State Representative Pos. 2
<Result><RaceName>Legislative District 21 - State Representative Pos. 2
<Result><RaceName>Legislative District 22 - State Representative Pos. 1
<Result><RaceName>Congressional District 1 - U.S. Representative</RaceN
<Result><RaceName>Legislative District 22 - State Representative Pos. 1
<Result><RaceName>Legislative District 22 - State Representative Pos. 2
<Result><RaceName>Legislative District 22 - State Representative Pos. 2

Open in new window

What do you want to do?
Hi Duncan... as I said in my original question:
In the case of the second XML file, it would be ideal if the text that appears between <Candidate> and </Candidate> could be part of the filename.

This makes sense because what goes in the Candidate element is always unique.
Apart from Yes, No and maybe a few others. I guess you've accepted my answer because you're happy to do the other one yourself? It's kind-of similar: you might like to replace the spaces in candidates' names with underscores for easier handling. You would use index and substr to get the names.
Yeah, I was able to do the other one, and move ahead with my project. Thanks!