Avatar of vignesh_prabhu
vignesh_prabhu
Flag for India asked on

awk - xml parsing

I wanted to print the tag name after <Transaction> in the attached file. That is I wanted to print "AirAvailability_12". I did the following

awk '
BEGIN {
FS="[<>]";
n=0;
}
{
   if ( $2 == "Transaction") n=1;
   if ( n == 2 ) {
   print $2;
   exit;
   }
   if ( n == 1) n++;
}'

This works in s.txt  but not s1.txt (both files attached).

Any thoughts? Appreciate if someone points something I am missing.
s.txt
s1.txt
Shell ScriptingScripting LanguagesLinux

Avatar of undefined
Last Comment
vignesh_prabhu

8/22/2022 - Mon
vignesh_prabhu

ASKER
s.txt and s1.txt have the same contents. Only formatting differs.
woolmilkporc

That's because in s1.txt there is no linefeed between >Transaction> and <AirAvailability_12>.

You'll have to do something like

awk '{FS="[<>]"; if($6 == "Transaction") print $8}' s1.txt
vignesh_prabhu

ASKER
I am sorry but I do not see any output when I execute

awk '{FS="[<>]"; if($6 == "Transaction") print $8}' s1.txt
Your help has saved me hundreds of hours of internet surfing.
fblack61
ASKER CERTIFIED SOLUTION
woolmilkporc

THIS SOLUTION ONLY AVAILABLE TO MEMBERS.
View this solution by signing up for a free trial.
Members can start a 7-Day free trial and enjoy unlimited access to the platform.
See Pricing Options
Start Free Trial
GET A PERSONALIZED SOLUTION
Ask your own question & get feedback from real experts
Find out why thousands trust the EE community with their toughest problems.
arnold

To make the awk script uniform, you should convert the single line XML file s1.txt into one matching s.txt.
I.e. create a perl script that will go and reformat the xml file that is being fed into a common format.  I.e. you define which types of entries must be on a line by them selves, which entries have the open value close. etc.

Are there multiple processes that generate these XML files.

IMHO it is easier to make the input file have uniform layout versus trying to come up with a script that will match any variation.
Hatrix76

May a propose a different approach than awk?

xpath will deliver the first element after Transaction (with all it's childs), so cut out the first shown element and you have the name of the element directly following Transaction, and it does not matter how the XML is formatted, or how often Transaction is in the HTML:

xpath file.xml "//Transaction/*[1]" 2>/dev/null | sed -e 's/^<\([^>]*\)>.*/\1/g'

Open in new window

explanation:
"//Transaction/*[1]"  <- Xpath query to select the first element after all Transaction elements

2>/dev/null <- Xpath has some additional output on stderr, eliminate it

sed -e  's/^<\([^>]*\)>.*/\1/g' <- cut aut the first element to display it


xpath should be available or easily installable on any unix
best
Ray
vignesh_prabhu

ASKER
arnold - Yes, the XML needs to be formatted so the script works. Unfortunately the XML are always in the s1.txt format. To be on the safer side, I have slightly modified woolmilkporc code as below. This works for both s.txt and s1.txt.

awk '
BEGIN {
FS="[<>]"
}
{
for (i=1;i<=NF;i++) {
 if ($i != "") {
  if ($i == "Transaction") {
  j=1
  continue
  }
  if (j==1) {
  print $i
  exit
  }
 }
}
}'

Thanks woolmilkporc.

hatrix76 - Thanks for the suggestion. Unfortunately I do not have xpath installed in our servers. Requested the admin to do so. Until then I have to go with the awk variant.
⚡ FREE TRIAL OFFER
Try out a week of full access for free.
Find out why thousands trust the EE community with their toughest problems.