?
Solved

Extract data from XML file to plain text

Posted on 2005-03-23
15
Medium Priority
?
438 Views
Last Modified: 2013-12-26
Given the following file

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type='text/xsl' href='SystemCheckSummary.xslt'?>
<SystemCheckSummary>
<Auto1><AutoCalRecord><Channel>1</Channel><Spot>1.8</Spot></AutoCalRecord><AutoCalRecord><Channel>1</Channel><Spot>5.0</Spot></AutoCalRecord>

<Auto2><AutoCalRecord><Channel>2</Channel><Spot>1.8</Spot></AutoCalRecord><AutoCalRecord><Channel>2</Channel><Spot>5.0</Spot></AutoCalRecord>

How can I extract the data so it looks like this using shell script?  The amount of data will vary.
I have never tried parsing an XML file, so I dont know where to start.

Type  Channel    Spot  
Auto1    1            1.8
Auto1    1            5.0
Auto2    2            1.8
Auto2    2            5.0

Thanx
0
Comment
Question by:RudePuppyDog
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 5
  • 4
  • 2
  • +3
15 Comments
 
LVL 48

Expert Comment

by:Tintin
ID: 13618909
The best way to parse XML is with an XML parser.

A shell script is definitely not suitable unless you have a very well defined and very specific set of XML data you are parsing.

A good place to start is http://expat.sourceforge.net/
0
 
LVL 84

Expert Comment

by:ozo
ID: 13619251
#if the file will always be in the exact format you've given, this could work
perl -ne 'BEGIN{print "Type\tChannel\tSpot\n"}if( /<(\w+)><AutoCalRecord>/ ){ $t=$1; print "$t\t$1\t$2\n" while m:<AutoCalRecord><Channel>(.*?)</Channel><Spot>(.*?)</Spot></AutoCalRecord>:g}' < XMLfile
0
 
LVL 16

Expert Comment

by:manav_mathur
ID: 13620701
If you are using perl, why not use XML::Parser??

Manav
0
Get MySQL database support online, now!

At Percona’s web store you can order your MySQL database support needs in minutes. No hassles, no fuss, just pick and click. Pay online with a credit card.

 
LVL 84

Expert Comment

by:ozo
ID: 13623431
For one thing, the file format given is not valid XML
0
 

Assisted Solution

by:garypen
garypen earned 100 total points
ID: 13628787
As previously noted, this isn't valid XML. However, given that it was valid XML, then the best way to convert XML data into any other format (text, html, alternative markup...) is by using an XSLT stylesheet.

I've not got time to write a full transformation for you, but you can use something like the following to get started.

-bash-3.00$ more SystemCheckSummary.xslt
<?xml version="1.0"?>
<xsl:stylesheet
        xmlns="http://www.w3.org/1999/xhtml"
        xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
        version="1.0"
>

<xsl:output method="text" indent="yes" encoding="iso-8859-15"/>
<xsl:strip-space elements="*"/>

<xsl:template match="Auto1">
Type1
</xsl:template>

<xsl:template match="Auto2">
Type2
</xsl:template>

</xsl:stylesheet>
-bash-3.00$

Use your favourite XSLT processor to process the text file and the associated stylesheet. Solaris 10 ships with the xsltproc (1) utility for this task.

-bash-3.00$ xsltproc in.xml

Type1

Type2
-bash-3.00$

You can see here that the two lines Type1 and Type2 are output. To get your desired result you need to write more templates and learn about stylesheet processing.

There was a clue in the XML you submitted. Since that XML already contained a reference to a stylesheet:

<?xml-stylesheet type='text/xsl' href='SystemCheckSummary.xslt'?>


This means, I guess, that someone has already done this work. Can't you just get the stylesheet from the person who supplied the XML?
0
 

Author Comment

by:RudePuppyDog
ID: 13631530
I do have the stylesheet, but this will be running on a UNIX system.  I need to extract the data from XML to put into a different format so that our application can use it.  I am currently working on using expat.

The solution by ozo (perl command) works if the data was in seperate lines (as formatted in my question).  The file actually has 3 lines (the 3rd line is very large).

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type='text/xsl' href='SystemCheckSummary.xslt'?> <SystemCheckSummary><Auto1><AutoCalRecord><Channel>1</Channel><Spot>1.8</Spot></AutoCalRecord><AutoCalRecord><Channel>1</Channel><Spot>5.0</Spot></AutoCalRecord><Auto2><AutoCalRecord><Channel>2</Channel><Spot>1.8</Spot></AutoCalRecord><AutoCalRecord><Channel>2</Channel><Spot>5.0</Spot></AutoCalRecord></Auto1></SystemCheckSummary>

Thanx for your help.
0
 
LVL 84

Accepted Solution

by:
ozo earned 800 total points
ID: 13632909
perl -ne 'BEGIN{print "Type\tChannel\tSpot\n"}print $t=$1||$t,"\t$2\t$3\n" while m"(?:<(\w+)>)?<AutoCalRecord><Channel>(.*?)</Channel><Spot>(.*?)</Spot></AutoCalRecord>"g' < XMLfile
0
 

Expert Comment

by:garypen
ID: 13635857
I don't understand why the fact that it's running on a UNIX system is relevant. In the example I provided, I ran my stylesheet processor (xsltproc) on a UNIX system (Solaris).

So, if you already have the stylesheet and your are using a modern UNIX OS (e.g. Solaris 10), you could just use whichever stylesheet processing command is provided and thus have to do no work at all.



0
 
LVL 2

Assisted Solution

by:ALEx604
ALEx604 earned 100 total points
ID: 13638583
#!/bin/bash

file="NAME OF/PATH OF XML FILE"
grep "Auto[0-9]" $file > tmp.data
sed 's/</ /g' tmp.data > tmp.data.2
cat tmp.data.2 > tmp.data
sed 's/>/ /g' tmp.data > tmp.data.2
cat tmp.data.2 > tmp.data
sed 's/AutoCalRecord//g' tmp.data > tmp.data.2
cat tmp.data.2 > tmp.data
sed 's/Spot//g' tmp.data > tmp.data.2
cat tmp.data.2 > tmp.data

sed 's/Channel//g' tmp.data > tmp.data.2
cat tmp.data.2 > tmp.data

sed 's/\///g' tmp.data > tmp.data.2
cat tmp.data.2 > tmp.data

echo " Type  Channel  Spot  Channel  Spot" > output.txt
cat tmp.data >> output.txt
rm tmp.data
rm tmp.data.2

cat output.txt


0
 
LVL 2

Expert Comment

by:ALEx604
ID: 13638650
grep "Auto[0-9]" $file > tmp.data
should have been
grep "Auto[[0-9]]" $file > tmp.data

i always forget that. :p
0
 
LVL 48

Expert Comment

by:Tintin
ID: 13647739
ALEx604.

Did you realise you could get rid of all the UUOC's and temporary files by using pipelines?

If you really wanted a shell solution, you can do:

echo -e "Type\tChannel\tSpot"
grep "<Auto" file.xml |sed "s/^<\(Auto.\).*Channel>\([0-9]\).*Spot>\([0-9]\.[0-9]\).*/\1[TAB] \2[TAB]\3/"

Where [TAB] is Ctrl-V TAB
0
 
LVL 84

Expert Comment

by:ozo
ID: 13650093
grep "<Auto" file.xml |sed "s/^<\(Auto.\).*Channel>\([0-9]\).*Spot>\([0-9]\.[0-9]\).*/\1[TAB] \2[TAB]\3/"
does not produce the requested
Type  Channel    Spot  
Auto1    1            1.8
Auto1    1            5.0
Auto2    2            1.8
Auto2    2            5.0
nor does
 #!/bin/bash

file="NAME OF/PATH OF XML FILE"
grep "Auto[0-9]" $file > tmp.data
sed 's/</ /g' tmp.data > tmp.data.2
cat tmp.data.2 > tmp.data
sed 's/>/ /g' tmp.data > tmp.data.2
cat tmp.data.2 > tmp.data
sed 's/AutoCalRecord//g' tmp.data > tmp.data.2
cat tmp.data.2 > tmp.data
sed 's/Spot//g' tmp.data > tmp.data.2
cat tmp.data.2 > tmp.data

sed 's/Channel//g' tmp.data > tmp.data.2
cat tmp.data.2 > tmp.data

sed 's/\///g' tmp.data > tmp.data.2
cat tmp.data.2 > tmp.data

echo " Type  Channel  Spot  Channel  Spot" > output.txt
cat tmp.data >> output.txt
rm tmp.data
rm tmp.data.2

cat output.txt
0
 
LVL 48

Expert Comment

by:Tintin
ID: 13657404
ozo.

OK, the columns don't quite line up exactly the same, but the results are the same.
0
 
LVL 84

Expert Comment

by:ozo
ID: 13657766
with grep "<Auto" file.xml |sed "s/^<\(Auto.\).*Channel>\([0-9]\).*Spot>\([0-9]\.[0-9]\).*/\1[TAB] \2[TAB]\3/"
Auto1    1            1.8
and
Auto2    2            1.8
are missing
0
 
LVL 48

Expert Comment

by:Tintin
ID: 13657813
Ahh.  I see it now.  I wasn't looking carefully enough at the required output and the input.
0

Featured Post

Percona Live Europe 2017 | Sep 25 - 27, 2017

The Percona Live Open Source Database Conference Europe 2017 is the premier event for the diverse and active European open source database community, as well as businesses that develop and use open source database software.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Have you tried to learn about Unicode, UTF-8, and multibyte text encoding and all the articles are just too "academic" or too technical? This article aims to make the whole topic easy for just about anyone to understand.
In this post we will learn different types of Android Layout and some basics of an Android App.
This video will show you how to get GIT to work in Eclipse.   It will walk you through how to install the EGit plugin in eclipse and how to checkout an existing repository.
Michael from AdRem Software explains how to view the most utilized and worst performing nodes in your network, by accessing the Top Charts view in NetCrunch network monitor (https://www.adremsoft.com/). Top Charts is a view in which you can set seve…
Suggested Courses
Course of the Month13 days, 7 hours left to enroll

800 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question