Solved

How do I change my nawk one-liners to produce child files with more descriptive filenames?

Posted on 2014-10-27
14
285 Views
Last Modified: 2014-11-19
I have a pair of large/long XML files that I'm breaking apart with nawk,, so that I can work more easily with the pieces that are actually relevant to my project. Both of these files consist of raw election results.

The code I have is doing what I want, but it's producing files that lack descriptive filenames, which makes it much more time consuming for me to identify which of the child files correspond to the data I want to work with.

This is the source of my first XML file. This is the code that's splitting this file apart:

nawk ' {print > "result"(NR%1?i:i++)".txt"; }' i=1 PI.txt

Open in new window


nawk is splitting up the parent file every time it finds a new line.

This is the source of my second XML file. This is the code that's splitting this file apart:

nawk -v RS="</?Results>" -v FS="<Result>" '{ for(N=1; N<=NF; N++) if($N ~ /<[/]/) print FS $N > "result"++C".xml" }' AllStateGeneral2014.xml

Open in new window


Here, nawk is splitting the parent file into children every time it finds a new Result.

Again, the first XML file is being split on a line-by-line basis; the second is being split apart wherever nawk finds a new "Result" element. In both cases, however, the resulting filenames look like this:

result1.xml result2.xml result3.xml

... and so on.

It would save a lot of time if the filenames were more descriptive, and looked like this:

result1-John.xml result2-Jane.xml result3-Jake.xml

In the case of the first file, it would be acceptable if only the first word of the line were incorporated into the filename.

In the case of the second XML file, it would be ideal if the text that appears between <Candidate> and </Candidate> could be part of the filename. How do I modify the one-liners above to get nawk to create more descriptive filenames?
0
Comment
Question by:Kaya Seabloom
  • 7
  • 4
  • 2
  • +1
14 Comments
 
LVL 61

Expert Comment

by:gheist
ID: 40408539
Nawk is not the best XML parser in the world
There are hundreds more:
Like listed here:
http://www.maketecheasier.com/manipulate-html-and-xml-files-from-commnad-line/
0
 
LVL 29

Expert Comment

by:MikeOM_DBA
ID: 40408735
Here is an idea:
nawk 'BEGIN{split(FILENAME,fn,".")}{print > fn[1] (NR%1?i:i++)".txt"; }' i=1 PI.txt
nawk -v RS="</?Results>" -v FS="<Result>" 'BEGIN{split(FILENAME,fn,".")}{ for(N=1; N<=NF; N++) if($N ~ /<[/]/) print FS $N > fn[1] ++C".xml" }' AllStateGeneral2014.xml

Open in new window

0
 

Author Comment

by:Kaya Seabloom
ID: 40412310
gheist, you're right, nawk isn't the best XML parser around. However, one of my files is just a plain text file. The other is  XML, but is not very complex. It's the same elements repeated over and over, containing different text.

Mike, the code you provided is processed by the shell without throwing any errors, but in both cases, all the child files have numerical filenames, just as before - e.g. 0.xml, 1.xml, 2.xml, etc. I'm not getting output that's different than what I had before.
0
 
LVL 29

Expert Comment

by:MikeOM_DBA
ID: 40412436
show me your script.
0
 

Author Comment

by:Kaya Seabloom
ID: 40412560
It's pretty simple:
http://pastebin.com/tt8hgH3a

The source files (the .txt file and the .xml file) are copied/cached via a cron job every so often, and then I'm just using nawk to split them up. The original nawk one-liners are, of course, in my original post above. The paste contains the modified one-liners from your answer.

If I run either of these modified one-liners right on the command line, I get the same result: hundreds of child files with numerical filenames.
0
 
LVL 34

Expert Comment

by:Duncan Roe
ID: 40416893
I've started to look at your first example. Simply by not producing a result file for a blank line, you go down to 492 files from 625
#!/bin/sh
awk '
/^[[:space:]]*$/{next}
{
  print > "result"(NR%1?i:i++)".txt"
}
' i=1 PI.txt

Open in new window

0
 
LVL 34

Expert Comment

by:Duncan Roe
ID: 40416909
This script produces files containing the first word of each line
#!/bin/sh
awk '
BEGIN {skip_next=0;i=1}

# Skip blank lines and lines starting with braces
/^([[:space:]]*$|[{}])/{next}

# skip heading lines (starting with ^L) and following non-blank line
/^\f/{skip_next=1; next}

# The main show
{
  if (skip_next)
  {
    skip_next=0
    next
  }
  print > "result" i++ "-" $1 ".txt"
}
' PI.txt

Open in new window

This is how the directory looks
result1-Initiative.txt     result10-(Precincts.txt    result100-Legislative.txt
result101-(Precincts.txt   result102-Maralyn.txt      result103-Robert.txt
result104-Write-in.txt     result105-Legislative.txt  result106-(Precincts.txt
result107-Cindy.txt        result108-Write-in.txt     result109-Legislative.txt
result11-Yes.txt           result110-(Precincts.txt   result111-Ruth.txt
result112-Alvin.txt        result113-Write-in.txt     result114-Legislative.txt
result115-(Precincts.txt   result116-Karen.txt        result117-Martin.txt
result118-Write-in.txt     result119-Legislative.txt  result12-No.txt
result120-(Precincts.txt   result121-Tina.txt         result122-Michael.txt
result123-Write-in.txt     result124-Legislative.txt  result125-(Precincts.txt
result126-Mia.txt          result127-Jeanette.txt     result128-Write-in.txt
result129-Legislative.txt  result13-Advisory.txt      result130-(Precincts.txt
result131-Sharon.txt       result132-Write-in.txt     result133-Legislative.txt
result134-(Precincts.txt   result135-Eileen.txt       result136-Write-in.txt

Open in new window

Files with names like result134-(Precincts.txt are awkward to deal with however. You need to escape the opening parenthesis, which is special to the shell. Stay tuned
0
Free Trending Threat Insights Every Day

Enhance your security with threat intelligence from the web. Get trending threat insights on hackers, exploits, and suspicious IP addresses delivered to your inbox with our free Cyber Daily.

 
LVL 34

Expert Comment

by:Duncan Roe
ID: 40416927
This one removes parentheses. You could insert any other characters that you don't want between the square brackets in the gensub call
#!/bin/sh
awk '
BEGIN {skip_next=0;i=1}

# Skip blank lines and lines starting with braces
/^([[:space:]]*$|[{}])/{next}

# skip heading lines (starting with ^L) and following non-blank line
/^\f/{skip_next=1; next}

# The main show
{
  if (skip_next)
  {
    skip_next=0
    next
  }
  fnam=gensub("[()]","","g",$1)
  print > "result" i++ "-" fnam ".txt"
}
' PI.txt

Open in new window

The directory listing now looks like
result1-Initiative.txt     result10-Precincts.txt     result100-Legislative.txt
result101-Precincts.txt    result102-Maralyn.txt      result103-Robert.txt
result104-Write-in.txt     result105-Legislative.txt  result106-Precincts.txt
result107-Cindy.txt        result108-Write-in.txt     result109-Legislative.txt
result11-Yes.txt           result110-Precincts.txt    result111-Ruth.txt
result112-Alvin.txt        result113-Write-in.txt     result114-Legislative.txt
result115-Precincts.txt    result116-Karen.txt        result117-Martin.txt
result118-Write-in.txt     result119-Legislative.txt  result12-No.txt
result120-Precincts.txt    result121-Tina.txt         result122-Michael.txt
result123-Write-in.txt     result124-Legislative.txt  result125-Precincts.txt
result126-Mia.txt          result127-Jeanette.txt     result128-Write-in.txt

Open in new window

It annoys me that the directory listing isn't in numerical order. Have you found some way to get around that? My usual remedy is to have enough leading zeroes so the string sort is also numerical. Will give it one more try
0
 
LVL 34

Assisted Solution

by:Duncan Roe
Duncan Roe earned 500 total points
ID: 40416941
And here it is!
#!/bin/sh
awk '
BEGIN {skip_next=0;i=1}

# Skip blank lines and lines starting with braces
/^([[:space:]]*$|[{}])/{next}

# skip heading lines (starting with ^L) and following non-blank line
/^\f/{skip_next=1; next}

# The main show
{
  if (skip_next)
  {
    skip_next=0
    next
  }
  fnam=sprintf("result%03d-%s.txt",i++,gensub("[()]","","g",$1))
  print > fnam
}

Open in new window

The directory listing now looks like
result001-Initiative.txt   result002-Precincts.txt    result003-Yes.txt
result004-No.txt           result005-Initiative.txt   result006-Precincts.txt
result007-Yes.txt          result008-No.txt           result009-Initiative.txt
result010-Precincts.txt    result011-Yes.txt          result012-No.txt
result013-Advisory.txt     result014-Precincts.txt    result015-Repealed.txt
result016-Maintained.txt   result017-Advisory.txt     result018-Precincts.txt
result019-Repealed.txt     result020-Maintained.txt   result021-US.txt
result022-Precincts.txt    result023-Suzan.txt        result024-Pedro.txt
result025-Write-in.txt     result026-US.txt           result027-Precincts.txt
result028-Jim.txt          result029-Craig.txt        result030-Write-in.txt
result031-US.txt           result032-Precincts.txt    result033-Dave.txt

Open in new window

If you have more than 999 results, change %03d to %04d, and so on.
0
 
LVL 34

Accepted Solution

by:
Duncan Roe earned 500 total points
ID: 40416943
To get back to having a 1-line awk command, put the awk script in a file, say do_PI.awk. Now your 1-liner is
 awk -f do_PI.awk PI.txt
do-PI.awk.txt
0
 
LVL 34

Expert Comment

by:Duncan Roe
ID: 40416944
What do you want to do for part 2? It seems all the result files start <Result><RaceName>and most of them are pretty similar even after that e.g
<Result><RaceName>State Measures - Initiative Measure No. 1351 Concerns
<Result><RaceName>Advisory Votes - Advisory Vote No. 9 (Engrossed Subst
<Result><RaceName>Legislative District 17 - State Representative Pos. 2
<Result><RaceName>Legislative District 18 - State Representative Pos. 1
<Result><RaceName>Legislative District 18 - State Representative Pos. 1
<Result><RaceName>Legislative District 18 - State Representative Pos. 2
<Result><RaceName>Legislative District 18 - State Representative Pos. 2
<Result><RaceName>Legislative District 19 - State Representative Pos. 1
<Result><RaceName>Legislative District 19 - State Representative Pos. 1
<Result><RaceName>Legislative District 19 - State Representative Pos. 2
<Result><RaceName>Legislative District 19 - State Representative Pos. 2
<Result><RaceName>Legislative District 20 - State Representative Pos. 1
<Result><RaceName>Congressional District 1 - U.S. Representative</RaceN
<Result><RaceName>Legislative District 20 - State Representative Pos. 1
<Result><RaceName>Legislative District 20 - State Representative Pos. 2
<Result><RaceName>Legislative District 20 - State Representative Pos. 2
<Result><RaceName>Legislative District 21 - State Senator</RaceName><Ca
<Result><RaceName>Legislative District 21 - State Senator</RaceName><Ca
<Result><RaceName>Legislative District 21 - State Representative Pos. 1
<Result><RaceName>Legislative District 21 - State Representative Pos. 1
<Result><RaceName>Legislative District 21 - State Representative Pos. 2
<Result><RaceName>Legislative District 21 - State Representative Pos. 2
<Result><RaceName>Legislative District 22 - State Representative Pos. 1
<Result><RaceName>Congressional District 1 - U.S. Representative</RaceN
<Result><RaceName>Legislative District 22 - State Representative Pos. 1
<Result><RaceName>Legislative District 22 - State Representative Pos. 2
<Result><RaceName>Legislative District 22 - State Representative Pos. 2

Open in new window

What do you want to do?
0
 

Author Comment

by:Kaya Seabloom
ID: 40417416
Hi Duncan... as I said in my original question:
In the case of the second XML file, it would be ideal if the text that appears between <Candidate> and </Candidate> could be part of the filename.

This makes sense because what goes in the Candidate element is always unique.
0
 
LVL 34

Expert Comment

by:Duncan Roe
ID: 40417749
Apart from Yes, No and maybe a few others. I guess you've accepted my answer because you're happy to do the other one yourself? It's kind-of similar: you might like to replace the spaces in candidates' names with underscores for easier handling. You would use index and substr to get the names.
0
 

Author Comment

by:Kaya Seabloom
ID: 40453793
Yeah, I was able to do the other one, and move ahead with my project. Thanks!
0

Featured Post

How to run any project with ease

Manage projects of all sizes how you want. Great for personal to-do lists, project milestones, team priorities and launch plans.
- Combine task lists, docs, spreadsheets, and chat in one
- View and edit from mobile/offline
- Cut down on emails

Join & Write a Comment

Attention: This article will no longer be maintained. If you have any questions, please feel free to mail me. jgh@FreeBSD.org Please see http://www.freebsd.org/doc/en_US.ISO8859-1/articles/freebsd-update-server/ for the updated article. It is avail…
Installing FreeBSD… FreeBSD is a darling of an operating system. The stability and usability make it a clear choice for servers and desktops (for the cunning). Savvy?  The Ports collection makes available every popular FOSS application and packag…
Learn several ways to interact with files and get file information from the bash shell. ls lists the contents of a directory: Using the -a flag displays hidden files: Using the -l flag formats the output in a long list: The file command gives us mor…
This video shows how to set up a shell script to accept a positional parameter when called, pass that to a SQL script, accept the output from the statement back and then manipulate it in the Shell.

762 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

21 Experts available now in Live!

Get 1:1 Help Now