?
Solved

Extraction of metadata from XML in SHELL SCRIPT

Posted on 2013-06-14
14
Medium Priority
?
445 Views
Last Modified: 2013-06-17
Greetings,
I am trying to extract data from XML files.
They all look like this:

<?xml version="1.0" encoding="UTF-8"?><?xml-stylesheet href="http://www.accessdata.fda.gov/spl/stylesheet/spl.xsl" type="text/xsl"?>
<document xmlns="urn:hl7-org:v3" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="urn:hl7-org:v3 http://www.accessdata.fda.gov/spl/schema/spl.xsd">
<!--subchilds go here-->
</document>

I need to get the following out from this XML and into Variables:
- the version number: 1.0
- the endoding: UTF-8
- the stylesheet href: http://www.accessdata.fda.gov/spl/stylesheet/spl.xsl
- the document xmlns: urn:hl7-org:v3
- the document xmlns:xsi:  http://www.w3.org/2001/XMLSchema-instance
- the document xsi:schemaLocation: urn:hl7-org:v3 http://www.accessdata.fda.gov/spl/schema/spl.xsd

From what I understand, my source XML must be checked against a business rules engine.  Unfortunately the rules engine is not part of hte script and must be called upon separately.

To answer this, I am writing the six values to a text file.
I have all of the steps done, except the actual extraction from the xml.

How do I proceed?  I am using XMLLINT, but the solaris version which does not have the --xpath extension, but I can do cat calls in XMLLINT.

I am open to other options.
Thanks.
0
Comment
Question by:Evan Cutler
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 7
  • 7
14 Comments
 
LVL 12

Expert Comment

by:tel2
ID: 39249347
Hi arcee123,

Q1. Are you open to a Perl solution?
Q2. Please provide expected output in the format you want it, for the sample input you've provided.

Thanks.
tel2
0
 
LVL 9

Author Comment

by:Evan Cutler
ID: 39249393
Hi Tel,
to be honest, the output is irrelevant. ...however I am restricted to base install on Solaris 10.  I do kmow it has "a version" of perl, vut I dont kniw which. ...

What are you thinking?
0
 
LVL 12

Expert Comment

by:tel2
ID: 39249403
Hi arceee123,

> to be honest, the output is irrelevant
So what would you like the script to do then, if the output is irrelevant?

I was thinking of using Perl to generate any output you may require.  Having sample output up front, in the required format, often makes life much easier for everyone, as the programmer can make sure they have it right first time and don't get surprises later, and waste a lot of time sorting them.  That may not be what happens in this case, but I'd want to guard against it by having sample output before I took this on.  And I'd also what to know whether you want all the output in a single file, or what.

I'm not sure whether I'll be the one to provide a solution yet.

Thanks.
tel2
0
Optimize your web performance

What's in the eBook?
- Full list of reasons for poor performance
- Ultimate measures to speed things up
- Primary web monitoring types
- KPIs you should be monitoring in order to increase your ROI

 
LVL 9

Author Comment

by:Evan Cutler
ID: 39249408
Ok....I see where your going work this.  The issue is I was going to call am application using the values as parameters in the shell script. The application will use the values in the parameters to do it's thing.
To that endi had no thought to output.   I hope this opens up some doors for you .

 Again thank you so much for this.
0
 
LVL 9

Author Comment

by:Evan Cutler
ID: 39249409
If you have any ideas.   I am not above using the perl to output the values somewhere and I load them into the parameters usinga file read in the shell script.
0
 
LVL 12

Expert Comment

by:tel2
ID: 39249418
Hi arcee,

> I am not above using the perl to output the values somewhere and I load them into the parameters usinga file read in the shell script.

OK, so if the file was space separated, like this (this is a 3 record example):

1.0 UTF-8 http://www.accessdata.fda.gov/spl/stylesheet/spl.xsl urn:hl7-org:v3 http://www.w3.org/2001/XMLSchema-instance urn:hl7-org:v3 http://www.accessdata.fda.gov/spl/schema/spl.xsd
2.1 UTF-8 http://www.accessdata.fda.gov/spl/stylesheet/spl.xsl urn:hl7-org:v3 http://www.w3.org/2001/XMLSchema-instance urn:hl7-org:v3 http://www.accessdata.fda.gov/spl/schema/spl.xsd
1.1 UTF-8 http://www.accessdata.fda.gov/spl/stylesheet/spl.xsl urn:hl7-org:v3 http://www.w3.org/2001/XMLSchema-instance urn:hl7-org:v3 http://www.accessdata.fda.gov/spl/schema/spl.xsd

That should work with something like:
    perl ... >filename
    cat filename | while read VER ENC STYLE XMLNS XMLNSXSI SCHEMA
    do
        ...
    done

Or even just:

    perl ... | while read VER ENC STYLE XMLNS XMLNSXSI SCHEMA
    ...as above

Q3. OK?

I note that the schema location value has a space in it, but SCHEMA will pick all the rest of the space-delimited fields up, since it's the last one in the read command.
Q4.  Should OK?

Q5. Is it possible that any of the other values could contain spaces or be blank?  If so, we should change the delimter and do it some other way.
0
 
LVL 9

Author Comment

by:Evan Cutler
ID: 39249420
I do not know the answer to the last question. ... but for the rest of it, you are awesome. That'll work
0
 
LVL 12

Expert Comment

by:tel2
ID: 39249446
Hi arcee,

Try the following bash script.  Note that it assumes the input files are anything ending in '.xml' in the current directory, but that can be easily changed.

#!/bin/bash

perl -0ne '/version="(.*?)".+encoding="(.*?)".+xml-stylesheet href="(.*?)".+xmlns="(.*?)".+xmlns:xsi="(.*?)".+xsi:schemaLocation="(.*?)"/s;print "$1 $2 $3 $4 $5 $6\n"' *.xml >xml_summary.out

cat xml_summary.out |\
while read VER ENC STYLE XMLNS XMLNSXSI SCHEMA
do
        echo "VER=$VER"
        echo "ENC=$ENC"
        echo "STYLE=$STYLE"
        echo "XMLNS=$XMLNS"
        echo "XMLNSXSI=$XMLNSXSI"
        echo "SCHEMA=$SCHEMA"
        echo
done

Open in new window

Here's some sample output that I get from it:
VER=1.0
ENC=UTF-8
STYLE=http://www.accessdata.fda.gov/spl/stylesheet/spl.xsl
XMLNS=urn:hl7-org:v3
XMLNSXSI=http://www.w3.org/2001/XMLSchema-instance
SCHEMA=urn:hl7-org:v3 http://www.accessdata.fda.gov/spl/schema/spl.xsd

VER=2.1
ENC=UTF-8
STYLE=http://www.accessdata.fda.gov/spl/stylesheet/spl.xsl
XMLNS=urn:hl7-org:v3
XMLNSXSI=http://www.w3.org/2001/XMLSchema-instance
SCHEMA=urn:hl7-org:v3 http://www.accessdata.fda.gov/spl/schema/spl.xsd

Is that what you're after?
0
 
LVL 9

Author Comment

by:Evan Cutler
ID: 39249450
Yeah, something just like that.
Thank you so much. Let me try it or ant get backto you.
0
 
LVL 12

Expert Comment

by:tel2
ID: 39249669
OK.  As you've probably realised, the "echo"s are just for demo purposes and can be removed.

And as you probably also know, this is slightly more concise (no cat & pipe):
while read VER ENC STYLE XMLNS XMLNSXSI SCHEMA
do
    ...
done <xml_summary.out

Open in new window

Or forget the temporary file, as previously mentioned, like this:
perl -0ne '/version="(.*?)".+encoding="(.*?)".+xml-stylesheet href="(.*?)".+xmlns="(.*?)".+xmlns:xsi="(.*?)".+xsi:schemaLocation="(.*?)"/s;print "$1 $2 $3 $4 $5 $6\n"' *.xml |\
while read VER ENC STYLE XMLNS XMLNSXSI SCHEMA
do
    ...
done

Open in new window

But that makes it harder to troubleshoot if you have a problem, coz there's no file to examine.
0
 
LVL 12

Accepted Solution

by:
tel2 earned 2000 total points
ID: 39252006
Note that if there's any chance that any of your xml files could contain a nul character (ASCII 0), then change the:
    perl -0ne ...
to:
    perl -0777 -ne ...
I just prefer the former for brevity, since it is usually not a problem.

And to make the regex a bit more readable, you could use Perl's 'x' modifier like this:
perl -0ne '/
        version="(.*?)".+
        encoding="(.*?)".+
        xml-stylesheet\ href="(.*?)".+
        xmlns="(.*?)".+
        xmlns:xsi="(.*?)".+
        xsi:schemaLocation="(.*?)"
        /sx;print "$1 $2 $3 $4 $5 $6\n"' *.xml |\
while read VER ENC STYLE XMLNS XMLNSXSI SCHEMA
do
        echo "VER=$VER"
        echo "ENC=$ENC"
        echo "STYLE=$STYLE"
        echo "XMLNS=$XMLNS"
        echo "XMLNSXSI=$XMLNSXSI"
        echo "SCHEMA=$SCHEMA"
        echo
done

Open in new window

Note how I had to escape the space after the word "stylesheet" since /x ignores whitespace by default.
0
 
LVL 9

Author Comment

by:Evan Cutler
ID: 39252008
that is so awesome.
I sent the code up to see how it works.
Please bear with me,
first thing in the morning I will have an answer.
thank you sooo much.

Evan
0
 
LVL 9

Author Closing Comment

by:Evan Cutler
ID: 39254591
Genius.
Absolutely Genius...
first time it ran with no problems.

Thank you so much,
you have no idea how much hock you got me out of.

Thanks again.
0
 
LVL 12

Expert Comment

by:tel2
ID: 39254694
Glad to be of service, arcee.  Thanks for the points.

Now that I've achieved genius status, would this be a good time to break the (possibly) bad news?

If any of your xml files doesn't contain all of those attributes in the sequence they appear in your example, then that file will fail to match and will not be processed by the while loop.

To check this, I suggest you:
- Count your xml files
- Count the number of files processed by the while loop
- Investigate any differences in the above
If you need help doing the above, let me know, but I don't have any genius ideas on how to solve the problems if there are any.

A pleasure doing business.  Call again.

TRS
0

Featured Post

Get 15 Days FREE Full-Featured Trial

Benefit from a mission critical IT monitoring with Monitis Premium or get it FREE for your entry level monitoring needs.
-Over 200,000 users
-More than 300,000 websites monitored
-Used in 197 countries
-Recommended by 98% of users

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

This tech tip describes how to install the Solaris Operating System from a tape backup that was created using the Solaris flash archive utility. I have used this procedure on the Solaris 8 and 9 OS, and it shoudl also work well on the Solaris 10 rel…
Why Shell Scripting? Shell scripting is a powerful method of accessing UNIX systems and it is very flexible. Shell scripts are required when we want to execute a sequence of commands in Unix flavored operating systems. “Shell” is the command line i…
Learn how to find files with the shell using the find and locate commands. Use locate to find a needle in a haystack.: With locate, check if the file still exists.: Use find to get the actual location of the file.:
Learn how to navigate the file tree with the shell. Use pwd to print the current working directory: Use ls to list a directory's contents: Use cd to change to a new directory: Use wildcards instead of typing out long directory names: Use ../ to move…
Suggested Courses
Course of the Month13 days, 6 hours left to enroll

777 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question