Kaya Seabloom
asked on
How do I change my nawk one-liners to produce child files with more descriptive filenames?
I have a pair of large/long XML files that I'm breaking apart with nawk,, so that I can work more easily with the pieces that are actually relevant to my project. Both of these files consist of raw election results.
The code I have is doing what I want, but it's producing files that lack descriptive filenames, which makes it much more time consuming for me to identify which of the child files correspond to the data I want to work with.
This is the source of my first XML file. This is the code that's splitting this file apart:
nawk is splitting up the parent file every time it finds a new line.
This is the source of my second XML file. This is the code that's splitting this file apart:
Here, nawk is splitting the parent file into children every time it finds a new Result.
Again, the first XML file is being split on a line-by-line basis; the second is being split apart wherever nawk finds a new "Result" element. In both cases, however, the resulting filenames look like this:
result1.xml result2.xml result3.xml
... and so on.
It would save a lot of time if the filenames were more descriptive, and looked like this:
result1-John.xml result2-Jane.xml result3-Jake.xml
In the case of the first file, it would be acceptable if only the first word of the line were incorporated into the filename.
In the case of the second XML file, it would be ideal if the text that appears between <Candidate> and </Candidate> could be part of the filename. How do I modify the one-liners above to get nawk to create more descriptive filenames?
The code I have is doing what I want, but it's producing files that lack descriptive filenames, which makes it much more time consuming for me to identify which of the child files correspond to the data I want to work with.
This is the source of my first XML file. This is the code that's splitting this file apart:
nawk ' {print > "result"(NR%1?i:i++)".txt"; }' i=1 PI.txt
nawk is splitting up the parent file every time it finds a new line.
This is the source of my second XML file. This is the code that's splitting this file apart:
nawk -v RS="</?Results>" -v FS="<Result>" '{ for(N=1; N<=NF; N++) if($N ~ /<[/]/) print FS $N > "result"++C".xml" }' AllStateGeneral2014.xml
Here, nawk is splitting the parent file into children every time it finds a new Result.
Again, the first XML file is being split on a line-by-line basis; the second is being split apart wherever nawk finds a new "Result" element. In both cases, however, the resulting filenames look like this:
result1.xml result2.xml result3.xml
... and so on.
It would save a lot of time if the filenames were more descriptive, and looked like this:
result1-John.xml result2-Jane.xml result3-Jake.xml
In the case of the first file, it would be acceptable if only the first word of the line were incorporated into the filename.
In the case of the second XML file, it would be ideal if the text that appears between <Candidate> and </Candidate> could be part of the filename. How do I modify the one-liners above to get nawk to create more descriptive filenames?
Here is an idea:
nawk 'BEGIN{split(FILENAME,fn,".")}{print > fn[1] (NR%1?i:i++)".txt"; }' i=1 PI.txt
nawk -v RS="</?Results>" -v FS="<Result>" 'BEGIN{split(FILENAME,fn,".")}{ for(N=1; N<=NF; N++) if($N ~ /<[/]/) print FS $N > fn[1] ++C".xml" }' AllStateGeneral2014.xml
ASKER
gheist, you're right, nawk isn't the best XML parser around. However, one of my files is just a plain text file. The other is XML, but is not very complex. It's the same elements repeated over and over, containing different text.
Mike, the code you provided is processed by the shell without throwing any errors, but in both cases, all the child files have numerical filenames, just as before - e.g. 0.xml, 1.xml, 2.xml, etc. I'm not getting output that's different than what I had before.
Mike, the code you provided is processed by the shell without throwing any errors, but in both cases, all the child files have numerical filenames, just as before - e.g. 0.xml, 1.xml, 2.xml, etc. I'm not getting output that's different than what I had before.
show me your script.
ASKER
It's pretty simple:
http://pastebin.com/tt8hgH3a
The source files (the .txt file and the .xml file) are copied/cached via a cron job every so often, and then I'm just using nawk to split them up. The original nawk one-liners are, of course, in my original post above. The paste contains the modified one-liners from your answer.
If I run either of these modified one-liners right on the command line, I get the same result: hundreds of child files with numerical filenames.
http://pastebin.com/tt8hgH3a
The source files (the .txt file and the .xml file) are copied/cached via a cron job every so often, and then I'm just using nawk to split them up. The original nawk one-liners are, of course, in my original post above. The paste contains the modified one-liners from your answer.
If I run either of these modified one-liners right on the command line, I get the same result: hundreds of child files with numerical filenames.
I've started to look at your first example. Simply by not producing a result file for a blank line, you go down to 492 files from 625
#!/bin/sh
awk '
/^[[:space:]]*$/{next}
{
print > "result"(NR%1?i:i++)".txt"
}
' i=1 PI.txt
This script produces files containing the first word of each line
#!/bin/sh
awk '
BEGIN {skip_next=0;i=1}
# Skip blank lines and lines starting with braces
/^([[:space:]]*$|[{}])/{next}
# skip heading lines (starting with ^L) and following non-blank line
/^\f/{skip_next=1; next}
# The main show
{
if (skip_next)
{
skip_next=0
next
}
print > "result" i++ "-" $1 ".txt"
}
' PI.txt
This is how the directory looks
result1-Initiative.txt result10-(Precincts.txt result100-Legislative.txt
result101-(Precincts.txt result102-Maralyn.txt result103-Robert.txt
result104-Write-in.txt result105-Legislative.txt result106-(Precincts.txt
result107-Cindy.txt result108-Write-in.txt result109-Legislative.txt
result11-Yes.txt result110-(Precincts.txt result111-Ruth.txt
result112-Alvin.txt result113-Write-in.txt result114-Legislative.txt
result115-(Precincts.txt result116-Karen.txt result117-Martin.txt
result118-Write-in.txt result119-Legislative.txt result12-No.txt
result120-(Precincts.txt result121-Tina.txt result122-Michael.txt
result123-Write-in.txt result124-Legislative.txt result125-(Precincts.txt
result126-Mia.txt result127-Jeanette.txt result128-Write-in.txt
result129-Legislative.txt result13-Advisory.txt result130-(Precincts.txt
result131-Sharon.txt result132-Write-in.txt result133-Legislative.txt
result134-(Precincts.txt result135-Eileen.txt result136-Write-in.txt
Files with names like result134-(Precincts.txt are awkward to deal with however. You need to escape the opening parenthesis, which is special to the shell. Stay tuned
This one removes parentheses. You could insert any other characters that you don't want between the square brackets in the gensub call
#!/bin/sh
awk '
BEGIN {skip_next=0;i=1}
# Skip blank lines and lines starting with braces
/^([[:space:]]*$|[{}])/{next}
# skip heading lines (starting with ^L) and following non-blank line
/^\f/{skip_next=1; next}
# The main show
{
if (skip_next)
{
skip_next=0
next
}
fnam=gensub("[()]","","g",$1)
print > "result" i++ "-" fnam ".txt"
}
' PI.txt
The directory listing now looks like
result1-Initiative.txt result10-Precincts.txt result100-Legislative.txt
result101-Precincts.txt result102-Maralyn.txt result103-Robert.txt
result104-Write-in.txt result105-Legislative.txt result106-Precincts.txt
result107-Cindy.txt result108-Write-in.txt result109-Legislative.txt
result11-Yes.txt result110-Precincts.txt result111-Ruth.txt
result112-Alvin.txt result113-Write-in.txt result114-Legislative.txt
result115-Precincts.txt result116-Karen.txt result117-Martin.txt
result118-Write-in.txt result119-Legislative.txt result12-No.txt
result120-Precincts.txt result121-Tina.txt result122-Michael.txt
result123-Write-in.txt result124-Legislative.txt result125-Precincts.txt
result126-Mia.txt result127-Jeanette.txt result128-Write-in.txt
It annoys me that the directory listing isn't in numerical order. Have you found some way to get around that? My usual remedy is to have enough leading zeroes so the string sort is also numerical. Will give it one more try
SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
What do you want to do for part 2? It seems all the result files start <Result><RaceName>and most of them are pretty similar even after that e.g
<Result><RaceName>State Measures - Initiative Measure No. 1351 Concerns
<Result><RaceName>Advisory Votes - Advisory Vote No. 9 (Engrossed Subst
<Result><RaceName>Legislative District 17 - State Representative Pos. 2
<Result><RaceName>Legislative District 18 - State Representative Pos. 1
<Result><RaceName>Legislative District 18 - State Representative Pos. 1
<Result><RaceName>Legislative District 18 - State Representative Pos. 2
<Result><RaceName>Legislative District 18 - State Representative Pos. 2
<Result><RaceName>Legislative District 19 - State Representative Pos. 1
<Result><RaceName>Legislative District 19 - State Representative Pos. 1
<Result><RaceName>Legislative District 19 - State Representative Pos. 2
<Result><RaceName>Legislative District 19 - State Representative Pos. 2
<Result><RaceName>Legislative District 20 - State Representative Pos. 1
<Result><RaceName>Congressional District 1 - U.S. Representative</RaceN
<Result><RaceName>Legislative District 20 - State Representative Pos. 1
<Result><RaceName>Legislative District 20 - State Representative Pos. 2
<Result><RaceName>Legislative District 20 - State Representative Pos. 2
<Result><RaceName>Legislative District 21 - State Senator</RaceName><Ca
<Result><RaceName>Legislative District 21 - State Senator</RaceName><Ca
<Result><RaceName>Legislative District 21 - State Representative Pos. 1
<Result><RaceName>Legislative District 21 - State Representative Pos. 1
<Result><RaceName>Legislative District 21 - State Representative Pos. 2
<Result><RaceName>Legislative District 21 - State Representative Pos. 2
<Result><RaceName>Legislative District 22 - State Representative Pos. 1
<Result><RaceName>Congressional District 1 - U.S. Representative</RaceN
<Result><RaceName>Legislative District 22 - State Representative Pos. 1
<Result><RaceName>Legislative District 22 - State Representative Pos. 2
<Result><RaceName>Legislative District 22 - State Representative Pos. 2
What do you want to do?
ASKER
Hi Duncan... as I said in my original question:
This makes sense because what goes in the Candidate element is always unique.
In the case of the second XML file, it would be ideal if the text that appears between <Candidate> and </Candidate> could be part of the filename.
This makes sense because what goes in the Candidate element is always unique.
Apart from Yes, No and maybe a few others. I guess you've accepted my answer because you're happy to do the other one yourself? It's kind-of similar: you might like to replace the spaces in candidates' names with underscores for easier handling. You would use index and substr to get the names.
ASKER
Yeah, I was able to do the other one, and move ahead with my project. Thanks!
There are hundreds more:
Like listed here:
http://www.maketecheasier.com/manipulate-html-and-xml-files-from-commnad-line/