parse xml with unix

Need to parse through xml data and pull out specfic tages and put them into a file.  Will read through several files at a time and would like the output file to have the data in one line per file separated by commas.

Example,
I need to pull out the values in <tag1> and <tag2> and put in a file separated with a comma.  

<Begin>
      <ApplicationArea>
            <Sender>
                  <tag10>xxxxxxxxxxxx</tag10>
            </Sender>
            <CreationDateTime>2013-09-09T04:15:00</CreationDateTime>
            <UserArea>
                  <ApplicationAreaUserArea>
                        <tag1>142299051</tag1>
                        <tag2>142299051</tag2>
                  </ApplicationAreaUserArea>
            </UserArea>
      </ApplicationArea>
</Begin>

Not sure if this is correct,  it does pull the data

tag1="$(echo "cat /Begin/Sender/UserArea/ApplicationAreaUserArea/tag1/text()" | xmllint --nocdata --shell $file | sed '1d;$d')"
tag2="$(echo "cat /Begin/Sender/UserArea/ApplicationAreaUserArea/tag2/text()" | xmllint --nocdata --shell $file | sed '1d;$d')"
bjeAsked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

simon3270Commented:
That looks fine, after a tidy-up of the actual path to
    /Begin/ApplicationArea/UserArea/ApplicationAreaUserArea/tag1/text()
(ditto for tag2)

then just use
    echo $tag1,$tag2
to print the values on one line.

There are a number of ways you could use other shell tools (grep or awk, for example) for this, but they would assume that the tag1 value and tags were on a separate line, as shown in your example.  They would usually break if the layout was any other way.
0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
bjeAuthor Commented:
Thanks.

Some of my tags have extra information for example
<ApplicationArea xmlns="http://www.openapplications.org/oagis/9">

How do I use the command to pull the data? When I ran it with this information no data is pulled.

Or , could provide an awk example.  The information I need to pull is on separate lines.  Can you pull the data by tag?

Thanks for the help
0
simon3270Commented:
If the layout is generally as you've shown, you could have something like:

 tag1=$(grep '<tag1>' $file | sed 's/^.*<tag1>\([^<]*\).*$/\1/')

I don'have xmllint handy to try out your extra information, but I will give it a go tomorrow.
0
The Ultimate Tool Kit for Technolgy Solution Provi

Broken down into practical pointers and step-by-step instructions, the IT Service Excellence Tool Kit delivers expert advice for technology solution providers. Get your free copy for valuable how-to assets including sample agreements, checklists, flowcharts, and more!

skullnobrainsCommented:
would this lazy version do

sed -ne 's,.*<tag[12]>\([0-9]*\)</tag[12]>,\1,p' $FILE | xargs -n 2 echo | tr \  ,

Open in new window


or would this more evolved one do ?

{ cat <<EOF
<Begin>
      <ApplicationArea>
            <Sender>
                  <tag10>xxxxxxxxxxxx</tag10>
            </Sender>
            <CreationDateTime>2013-09-09T04:15:00</CreationDateTime>
            <UserArea>
                  <ApplicationAreaUserArea>
                        <tag1>142299051</tag1>
                        <tag2>142299051</tag2>
                  </ApplicationAreaUserArea>
            </UserArea>
      </ApplicationArea>
</Begin><Begin>
      <ApplicationArea>
            <Sender>
                  <tag10>xxxxxxxxxxxx</tag10>
            </Sender>
            <CreationDateTime>2013-09-09T04:15:00</CreationDateTime>
            <UserArea>
                  <ApplicationAreaUserArea>
                        <tag1>142299051</tag1>
                        <tag2>142299051</tag2>
                  </ApplicationAreaUserArea>
            </UserArea>
      </ApplicationArea>
</Begin><Begin>
      <ApplicationArea>
            <Sender>
                  <tag10>xxxxxxxxxxxx</tag10>
            </Sender>
            <CreationDateTime>2013-09-09T04:15:00</CreationDateTime>
            <UserArea>
                  <ApplicationAreaUserArea>
                        <tag1>142299051</tag1>
                        <tag2>142299051</tag2>
                  </ApplicationAreaUserArea>
            </UserArea>
      </ApplicationArea>
</Begin><Begin>
      <ApplicationArea>
            <Sender>
                  <tag10>xxxxxxxxxxxx</tag10>
            </Sender>
            <CreationDateTime>2013-09-09T04:15:00</CreationDateTime>
            <UserArea>
                  <ApplicationAreaUserArea>
                        <tag1>142299057</tag1>
                        <tag2>142299059</tag2>
                  </ApplicationAreaUserArea>
            </UserArea>
      </ApplicationArea>
</Begin>
EOF
} \
| sed -ne '
	b begin
	
	:got1
	h
	d

	:got2
	H
	x
	s/\n/,/
	w /tmp/y
	d

	:begin
	s,.*<tag1>\([0-9]*\)</tag1>,\1,
	t got1
	s,.*<tag2>\([0-9]*\)</tag2>,\1,
	t got2
	d
'

Open in new window


writing to /tmp/y

see results below

$ sh /tmp/x ; cat /tmp/y
142299051,142299051
142299051,142299051
142299051,142299051
142299057,142299059

Open in new window

0
bjeAuthor Commented:
In the command,

sed -ne 's,.*<tag[12]>\([0-9]*\)</tag[12]>,\1,p' $FILE | xargs -n 2 echo | tr \  ,


what does the the [12] mean?

Thanks.
0
skullnobrainsCommented:
[ab] means either a or b

this parser relies on the fact your xml files always have the exact structure you posted

the one on top is much better but it does not check that tag1 and tag2 have the same parent , tag1 has to precede tag2, and they have to both be on single isolated lines. it is feasible to improve it further if required. in that case, post a sample source file as messy as it might get
0
bjeAuthor Commented:
Thanks for all the solutions,

This one works the best,
tag1=$(grep '<tag1>' $file | sed 's/^.*<tag1>\([^<]*\).*$/\1/')
tag2=$(grep '<tag2>' $file | sed 's/^.*<tag2>\([^<]*\).*$/\1/')
tag3=$(grep '<tag3>' $file | sed 's/^.*<tag3>\([^<]*\).*$/\1/')

It is pulling the information i need.

How to i put the information on one line in my output separated by a comma?  Have tried differnt ways , but nothing works.

output file - - there are muliple files that the script will read through.

tag1, tag2, tag3
tag1, tag2, tag3
tag1, tag2, tag3


Thanks.
0
simon3270Commented:
for file in *.xml
do
    tag1=$(grep '<tag1>' $file | sed 's/^.*<tag1>\([^<]*\).*$/\1/')
    tag2=$(grep '<tag2>' $file | sed 's/^.*<tag2>\([^<]*\).*$/\1/')
    tag3=$(grep '<tag3>' $file | sed 's/^.*<tag3>\([^<]*\).*$/\1/')
    echo ${tag1},${tag2},${tag3}
done > output_file

Open in new window

0
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Programming

From novice to tech pro — start learning today.