Solved

Extraction of metadata from XML in SHELL SCRIPT

Posted on 2013-06-14
14
422 Views
Last Modified: 2013-06-17
Greetings,
I am trying to extract data from XML files.
They all look like this:

<?xml version="1.0" encoding="UTF-8"?><?xml-stylesheet href="http://www.accessdata.fda.gov/spl/stylesheet/spl.xsl" type="text/xsl"?>
<document xmlns="urn:hl7-org:v3" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="urn:hl7-org:v3 http://www.accessdata.fda.gov/spl/schema/spl.xsd">
<!--subchilds go here-->
</document>

I need to get the following out from this XML and into Variables:
- the version number: 1.0
- the endoding: UTF-8
- the stylesheet href: http://www.accessdata.fda.gov/spl/stylesheet/spl.xsl
- the document xmlns: urn:hl7-org:v3
- the document xmlns:xsi:  http://www.w3.org/2001/XMLSchema-instance
- the document xsi:schemaLocation: urn:hl7-org:v3 http://www.accessdata.fda.gov/spl/schema/spl.xsd

From what I understand, my source XML must be checked against a business rules engine.  Unfortunately the rules engine is not part of hte script and must be called upon separately.

To answer this, I am writing the six values to a text file.
I have all of the steps done, except the actual extraction from the xml.

How do I proceed?  I am using XMLLINT, but the solaris version which does not have the --xpath extension, but I can do cat calls in XMLLINT.

I am open to other options.
Thanks.
0
Comment
Question by:Evan Cutler
  • 7
  • 7
14 Comments
 
LVL 11

Expert Comment

by:tel2
Comment Utility
Hi arcee123,

Q1. Are you open to a Perl solution?
Q2. Please provide expected output in the format you want it, for the sample input you've provided.

Thanks.
tel2
0
 
LVL 9

Author Comment

by:Evan Cutler
Comment Utility
Hi Tel,
to be honest, the output is irrelevant. ...however I am restricted to base install on Solaris 10.  I do kmow it has "a version" of perl, vut I dont kniw which. ...

What are you thinking?
0
 
LVL 11

Expert Comment

by:tel2
Comment Utility
Hi arceee123,

> to be honest, the output is irrelevant
So what would you like the script to do then, if the output is irrelevant?

I was thinking of using Perl to generate any output you may require.  Having sample output up front, in the required format, often makes life much easier for everyone, as the programmer can make sure they have it right first time and don't get surprises later, and waste a lot of time sorting them.  That may not be what happens in this case, but I'd want to guard against it by having sample output before I took this on.  And I'd also what to know whether you want all the output in a single file, or what.

I'm not sure whether I'll be the one to provide a solution yet.

Thanks.
tel2
0
 
LVL 9

Author Comment

by:Evan Cutler
Comment Utility
Ok....I see where your going work this.  The issue is I was going to call am application using the values as parameters in the shell script. The application will use the values in the parameters to do it's thing.
To that endi had no thought to output.   I hope this opens up some doors for you .

 Again thank you so much for this.
0
 
LVL 9

Author Comment

by:Evan Cutler
Comment Utility
If you have any ideas.   I am not above using the perl to output the values somewhere and I load them into the parameters usinga file read in the shell script.
0
 
LVL 11

Expert Comment

by:tel2
Comment Utility
Hi arcee,

> I am not above using the perl to output the values somewhere and I load them into the parameters usinga file read in the shell script.

OK, so if the file was space separated, like this (this is a 3 record example):

1.0 UTF-8 http://www.accessdata.fda.gov/spl/stylesheet/spl.xsl urn:hl7-org:v3 http://www.w3.org/2001/XMLSchema-instance urn:hl7-org:v3 http://www.accessdata.fda.gov/spl/schema/spl.xsd
2.1 UTF-8 http://www.accessdata.fda.gov/spl/stylesheet/spl.xsl urn:hl7-org:v3 http://www.w3.org/2001/XMLSchema-instance urn:hl7-org:v3 http://www.accessdata.fda.gov/spl/schema/spl.xsd
1.1 UTF-8 http://www.accessdata.fda.gov/spl/stylesheet/spl.xsl urn:hl7-org:v3 http://www.w3.org/2001/XMLSchema-instance urn:hl7-org:v3 http://www.accessdata.fda.gov/spl/schema/spl.xsd

That should work with something like:
    perl ... >filename
    cat filename | while read VER ENC STYLE XMLNS XMLNSXSI SCHEMA
    do
        ...
    done

Or even just:

    perl ... | while read VER ENC STYLE XMLNS XMLNSXSI SCHEMA
    ...as above

Q3. OK?

I note that the schema location value has a space in it, but SCHEMA will pick all the rest of the space-delimited fields up, since it's the last one in the read command.
Q4.  Should OK?

Q5. Is it possible that any of the other values could contain spaces or be blank?  If so, we should change the delimter and do it some other way.
0
 
LVL 9

Author Comment

by:Evan Cutler
Comment Utility
I do not know the answer to the last question. ... but for the rest of it, you are awesome. That'll work
0
How to run any project with ease

Manage projects of all sizes how you want. Great for personal to-do lists, project milestones, team priorities and launch plans.
- Combine task lists, docs, spreadsheets, and chat in one
- View and edit from mobile/offline
- Cut down on emails

 
LVL 11

Expert Comment

by:tel2
Comment Utility
Hi arcee,

Try the following bash script.  Note that it assumes the input files are anything ending in '.xml' in the current directory, but that can be easily changed.

#!/bin/bash

perl -0ne '/version="(.*?)".+encoding="(.*?)".+xml-stylesheet href="(.*?)".+xmlns="(.*?)".+xmlns:xsi="(.*?)".+xsi:schemaLocation="(.*?)"/s;print "$1 $2 $3 $4 $5 $6\n"' *.xml >xml_summary.out

cat xml_summary.out |\
while read VER ENC STYLE XMLNS XMLNSXSI SCHEMA
do
        echo "VER=$VER"
        echo "ENC=$ENC"
        echo "STYLE=$STYLE"
        echo "XMLNS=$XMLNS"
        echo "XMLNSXSI=$XMLNSXSI"
        echo "SCHEMA=$SCHEMA"
        echo
done

Open in new window

Here's some sample output that I get from it:
VER=1.0
ENC=UTF-8
STYLE=http://www.accessdata.fda.gov/spl/stylesheet/spl.xsl
XMLNS=urn:hl7-org:v3
XMLNSXSI=http://www.w3.org/2001/XMLSchema-instance
SCHEMA=urn:hl7-org:v3 http://www.accessdata.fda.gov/spl/schema/spl.xsd

VER=2.1
ENC=UTF-8
STYLE=http://www.accessdata.fda.gov/spl/stylesheet/spl.xsl
XMLNS=urn:hl7-org:v3
XMLNSXSI=http://www.w3.org/2001/XMLSchema-instance
SCHEMA=urn:hl7-org:v3 http://www.accessdata.fda.gov/spl/schema/spl.xsd

Is that what you're after?
0
 
LVL 9

Author Comment

by:Evan Cutler
Comment Utility
Yeah, something just like that.
Thank you so much. Let me try it or ant get backto you.
0
 
LVL 11

Expert Comment

by:tel2
Comment Utility
OK.  As you've probably realised, the "echo"s are just for demo purposes and can be removed.

And as you probably also know, this is slightly more concise (no cat & pipe):
while read VER ENC STYLE XMLNS XMLNSXSI SCHEMA
do
    ...
done <xml_summary.out

Open in new window

Or forget the temporary file, as previously mentioned, like this:
perl -0ne '/version="(.*?)".+encoding="(.*?)".+xml-stylesheet href="(.*?)".+xmlns="(.*?)".+xmlns:xsi="(.*?)".+xsi:schemaLocation="(.*?)"/s;print "$1 $2 $3 $4 $5 $6\n"' *.xml |\
while read VER ENC STYLE XMLNS XMLNSXSI SCHEMA
do
    ...
done

Open in new window

But that makes it harder to troubleshoot if you have a problem, coz there's no file to examine.
0
 
LVL 11

Accepted Solution

by:
tel2 earned 500 total points
Comment Utility
Note that if there's any chance that any of your xml files could contain a nul character (ASCII 0), then change the:
    perl -0ne ...
to:
    perl -0777 -ne ...
I just prefer the former for brevity, since it is usually not a problem.

And to make the regex a bit more readable, you could use Perl's 'x' modifier like this:
perl -0ne '/
        version="(.*?)".+
        encoding="(.*?)".+
        xml-stylesheet\ href="(.*?)".+
        xmlns="(.*?)".+
        xmlns:xsi="(.*?)".+
        xsi:schemaLocation="(.*?)"
        /sx;print "$1 $2 $3 $4 $5 $6\n"' *.xml |\
while read VER ENC STYLE XMLNS XMLNSXSI SCHEMA
do
        echo "VER=$VER"
        echo "ENC=$ENC"
        echo "STYLE=$STYLE"
        echo "XMLNS=$XMLNS"
        echo "XMLNSXSI=$XMLNSXSI"
        echo "SCHEMA=$SCHEMA"
        echo
done

Open in new window

Note how I had to escape the space after the word "stylesheet" since /x ignores whitespace by default.
0
 
LVL 9

Author Comment

by:Evan Cutler
Comment Utility
that is so awesome.
I sent the code up to see how it works.
Please bear with me,
first thing in the morning I will have an answer.
thank you sooo much.

Evan
0
 
LVL 9

Author Closing Comment

by:Evan Cutler
Comment Utility
Genius.
Absolutely Genius...
first time it ran with no problems.

Thank you so much,
you have no idea how much hock you got me out of.

Thanks again.
0
 
LVL 11

Expert Comment

by:tel2
Comment Utility
Glad to be of service, arcee.  Thanks for the points.

Now that I've achieved genius status, would this be a good time to break the (possibly) bad news?

If any of your xml files doesn't contain all of those attributes in the sequence they appear in your example, then that file will fail to match and will not be processed by the while loop.

To check this, I suggest you:
- Count your xml files
- Count the number of files processed by the while loop
- Investigate any differences in the above
If you need help doing the above, let me know, but I don't have any genius ideas on how to solve the problems if there are any.

A pleasure doing business.  Call again.

TRS
0

Featured Post

6 Surprising Benefits of Threat Intelligence

All sorts of threat intelligence is available on the web. Intelligence you can learn from, and use to anticipate and prepare for future attacks.

Join & Write a Comment

When you do backups in the Solaris Operating System, the file system must be inactive. Otherwise, the output may be inconsistent. A file system is inactive when it's unmounted or it's write-locked by the operating system. Although the fssnap utility…
Installing FreeBSD… FreeBSD is a darling of an operating system. The stability and usability make it a clear choice for servers and desktops (for the cunning). Savvy?  The Ports collection makes available every popular FOSS application and packag…
Learn several ways to interact with files and get file information from the bash shell. ls lists the contents of a directory: Using the -a flag displays hidden files: Using the -l flag formats the output in a long list: The file command gives us mor…
Learn how to get help with Linux/Unix bash shell commands. Use help to read help documents for built in bash shell commands.: Use man to interface with the online reference manuals for shell commands.: Use man to search man pages for unknown command…

728 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

14 Experts available now in Live!

Get 1:1 Help Now