awk - xml parsing

Posted on 2010-11-13
Last Modified: 2012-05-10
I wanted to print the tag name after <Transaction> in the attached file. That is I wanted to print "AirAvailability_12". I did the following

awk '
   if ( $2 == "Transaction") n=1;
   if ( n == 2 ) {
   print $2;
   if ( n == 1) n++;

This works in s.txt  but not s1.txt (both files attached).

Any thoughts? Appreciate if someone points something I am missing.
Question by:vignesh_prabhu

Author Comment

ID: 34127943
s.txt and s1.txt have the same contents. Only formatting differs.
LVL 68

Expert Comment

ID: 34127989
That's because in s1.txt there is no linefeed between >Transaction> and <AirAvailability_12>.

You'll have to do something like

awk '{FS="[<>]"; if($6 == "Transaction") print $8}' s1.txt

Author Comment

ID: 34128034
I am sorry but I do not see any output when I execute

awk '{FS="[<>]"; if($6 == "Transaction") print $8}' s1.txt
Threat Intelligence Starter Resources

Integrating threat intelligence can be challenging, and not all companies are ready. These resources can help you build awareness and prepare for defense.

LVL 68

Accepted Solution

woolmilkporc earned 500 total points
ID: 34128338
I have to be sorry - I overlooked that there are no linefeeds at all!

So please try

awk 'BEGIN {FS="[<>]"} {for (i=1;i<=NF;i++) if($i=="Transaction") print $(i+2)}'

Open in new window

LVL 76

Expert Comment

ID: 34128727
To make the awk script uniform, you should convert the single line XML file s1.txt into one matching s.txt.
I.e. create a perl script that will go and reformat the xml file that is being fed into a common format.  I.e. you define which types of entries must be on a line by them selves, which entries have the open value close. etc.

Are there multiple processes that generate these XML files.

IMHO it is easier to make the input file have uniform layout versus trying to come up with a script that will match any variation.

Expert Comment

ID: 34129152
May a propose a different approach than awk?

xpath will deliver the first element after Transaction (with all it's childs), so cut out the first shown element and you have the name of the element directly following Transaction, and it does not matter how the XML is formatted, or how often Transaction is in the HTML:

xpath file.xml "//Transaction/*[1]" 2>/dev/null | sed -e 's/^<\([^>]*\)>.*/\1/g'

Open in new window

"//Transaction/*[1]"  <- Xpath query to select the first element after all Transaction elements

2>/dev/null <- Xpath has some additional output on stderr, eliminate it

sed -e  's/^<\([^>]*\)>.*/\1/g' <- cut aut the first element to display it

xpath should be available or easily installable on any unix

Author Comment

ID: 34130842
arnold - Yes, the XML needs to be formatted so the script works. Unfortunately the XML are always in the s1.txt format. To be on the safer side, I have slightly modified woolmilkporc code as below. This works for both s.txt and s1.txt.

awk '
for (i=1;i<=NF;i++) {
 if ($i != "") {
  if ($i == "Transaction") {
  if (j==1) {
  print $i

Thanks woolmilkporc.

hatrix76 - Thanks for the suggestion. Unfortunately I do not have xpath installed in our servers. Requested the admin to do so. Until then I have to go with the awk variant.

Featured Post

How to run any project with ease

Manage projects of all sizes how you want. Great for personal to-do lists, project milestones, team priorities and launch plans.
- Combine task lists, docs, spreadsheets, and chat in one
- View and edit from mobile/offline
- Cut down on emails

Join & Write a Comment

If you have a server on collocation with the super-fast CPU, that doesn't mean that you get it running at full power. Here is a preamble. When doing inventory of Linux servers, that I'm administering, I've found that some of them are running on l…
Batch, VBS, and scripts in general are incredibly useful for repetitive tasks.  Some tasks can take a while to complete and it can be annoying to check back only to discover that your script finished 5 minutes ago.  Some scripts may complete nearly …
In this fourth video of the Xpdf series, we discuss and demonstrate the PDFinfo utility, which retrieves the contents of a PDF's Info Dictionary, as well as some other information, including the page count. We show how to isolate the page count in a…
In this fifth video of the Xpdf series, we discuss and demonstrate the PDFdetach utility, which is able to list and, more importantly, extract attachments that are embedded in PDF files. It does this via a command line interface, making it suitable …

747 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

12 Experts available now in Live!

Get 1:1 Help Now