Want to protect your cyber security and still get fast solutions? Ask a secure question today.Go Premium

x
  • Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 451
  • Last Modified:

Extraction of metadata from XML in SHELL SCRIPT

Greetings,
I am trying to extract data from XML files.
They all look like this:

<?xml version="1.0" encoding="UTF-8"?><?xml-stylesheet href="http://www.accessdata.fda.gov/spl/stylesheet/spl.xsl" type="text/xsl"?>
<document xmlns="urn:hl7-org:v3" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="urn:hl7-org:v3 http://www.accessdata.fda.gov/spl/schema/spl.xsd">
<!--subchilds go here-->
</document>

I need to get the following out from this XML and into Variables:
- the version number: 1.0
- the endoding: UTF-8
- the stylesheet href: http://www.accessdata.fda.gov/spl/stylesheet/spl.xsl
- the document xmlns: urn:hl7-org:v3
- the document xmlns:xsi:  http://www.w3.org/2001/XMLSchema-instance
- the document xsi:schemaLocation: urn:hl7-org:v3 http://www.accessdata.fda.gov/spl/schema/spl.xsd

From what I understand, my source XML must be checked against a business rules engine.  Unfortunately the rules engine is not part of hte script and must be called upon separately.

To answer this, I am writing the six values to a text file.
I have all of the steps done, except the actual extraction from the xml.

How do I proceed?  I am using XMLLINT, but the solaris version which does not have the --xpath extension, but I can do cat calls in XMLLINT.

I am open to other options.
Thanks.
0
Evan Cutler
Asked:
Evan Cutler
  • 7
  • 7
1 Solution
 
tel2Commented:
Hi arcee123,

Q1. Are you open to a Perl solution?
Q2. Please provide expected output in the format you want it, for the sample input you've provided.

Thanks.
tel2
0
 
Evan CutlerVolunteer Chief Information OfficerAuthor Commented:
Hi Tel,
to be honest, the output is irrelevant. ...however I am restricted to base install on Solaris 10.  I do kmow it has "a version" of perl, vut I dont kniw which. ...

What are you thinking?
0
 
tel2Commented:
Hi arceee123,

> to be honest, the output is irrelevant
So what would you like the script to do then, if the output is irrelevant?

I was thinking of using Perl to generate any output you may require.  Having sample output up front, in the required format, often makes life much easier for everyone, as the programmer can make sure they have it right first time and don't get surprises later, and waste a lot of time sorting them.  That may not be what happens in this case, but I'd want to guard against it by having sample output before I took this on.  And I'd also what to know whether you want all the output in a single file, or what.

I'm not sure whether I'll be the one to provide a solution yet.

Thanks.
tel2
0
Independent Software Vendors: We Want Your Opinion

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 
Evan CutlerVolunteer Chief Information OfficerAuthor Commented:
Ok....I see where your going work this.  The issue is I was going to call am application using the values as parameters in the shell script. The application will use the values in the parameters to do it's thing.
To that endi had no thought to output.   I hope this opens up some doors for you .

 Again thank you so much for this.
0
 
Evan CutlerVolunteer Chief Information OfficerAuthor Commented:
If you have any ideas.   I am not above using the perl to output the values somewhere and I load them into the parameters usinga file read in the shell script.
0
 
tel2Commented:
Hi arcee,

> I am not above using the perl to output the values somewhere and I load them into the parameters usinga file read in the shell script.

OK, so if the file was space separated, like this (this is a 3 record example):

1.0 UTF-8 http://www.accessdata.fda.gov/spl/stylesheet/spl.xsl urn:hl7-org:v3 http://www.w3.org/2001/XMLSchema-instance urn:hl7-org:v3 http://www.accessdata.fda.gov/spl/schema/spl.xsd
2.1 UTF-8 http://www.accessdata.fda.gov/spl/stylesheet/spl.xsl urn:hl7-org:v3 http://www.w3.org/2001/XMLSchema-instance urn:hl7-org:v3 http://www.accessdata.fda.gov/spl/schema/spl.xsd
1.1 UTF-8 http://www.accessdata.fda.gov/spl/stylesheet/spl.xsl urn:hl7-org:v3 http://www.w3.org/2001/XMLSchema-instance urn:hl7-org:v3 http://www.accessdata.fda.gov/spl/schema/spl.xsd

That should work with something like:
    perl ... >filename
    cat filename | while read VER ENC STYLE XMLNS XMLNSXSI SCHEMA
    do
        ...
    done

Or even just:

    perl ... | while read VER ENC STYLE XMLNS XMLNSXSI SCHEMA
    ...as above

Q3. OK?

I note that the schema location value has a space in it, but SCHEMA will pick all the rest of the space-delimited fields up, since it's the last one in the read command.
Q4.  Should OK?

Q5. Is it possible that any of the other values could contain spaces or be blank?  If so, we should change the delimter and do it some other way.
0
 
Evan CutlerVolunteer Chief Information OfficerAuthor Commented:
I do not know the answer to the last question. ... but for the rest of it, you are awesome. That'll work
0
 
tel2Commented:
Hi arcee,

Try the following bash script.  Note that it assumes the input files are anything ending in '.xml' in the current directory, but that can be easily changed.

#!/bin/bash

perl -0ne '/version="(.*?)".+encoding="(.*?)".+xml-stylesheet href="(.*?)".+xmlns="(.*?)".+xmlns:xsi="(.*?)".+xsi:schemaLocation="(.*?)"/s;print "$1 $2 $3 $4 $5 $6\n"' *.xml >xml_summary.out

cat xml_summary.out |\
while read VER ENC STYLE XMLNS XMLNSXSI SCHEMA
do
        echo "VER=$VER"
        echo "ENC=$ENC"
        echo "STYLE=$STYLE"
        echo "XMLNS=$XMLNS"
        echo "XMLNSXSI=$XMLNSXSI"
        echo "SCHEMA=$SCHEMA"
        echo
done

Open in new window

Here's some sample output that I get from it:
VER=1.0
ENC=UTF-8
STYLE=http://www.accessdata.fda.gov/spl/stylesheet/spl.xsl
XMLNS=urn:hl7-org:v3
XMLNSXSI=http://www.w3.org/2001/XMLSchema-instance
SCHEMA=urn:hl7-org:v3 http://www.accessdata.fda.gov/spl/schema/spl.xsd

VER=2.1
ENC=UTF-8
STYLE=http://www.accessdata.fda.gov/spl/stylesheet/spl.xsl
XMLNS=urn:hl7-org:v3
XMLNSXSI=http://www.w3.org/2001/XMLSchema-instance
SCHEMA=urn:hl7-org:v3 http://www.accessdata.fda.gov/spl/schema/spl.xsd

Is that what you're after?
0
 
Evan CutlerVolunteer Chief Information OfficerAuthor Commented:
Yeah, something just like that.
Thank you so much. Let me try it or ant get backto you.
0
 
tel2Commented:
OK.  As you've probably realised, the "echo"s are just for demo purposes and can be removed.

And as you probably also know, this is slightly more concise (no cat & pipe):
while read VER ENC STYLE XMLNS XMLNSXSI SCHEMA
do
    ...
done <xml_summary.out

Open in new window

Or forget the temporary file, as previously mentioned, like this:
perl -0ne '/version="(.*?)".+encoding="(.*?)".+xml-stylesheet href="(.*?)".+xmlns="(.*?)".+xmlns:xsi="(.*?)".+xsi:schemaLocation="(.*?)"/s;print "$1 $2 $3 $4 $5 $6\n"' *.xml |\
while read VER ENC STYLE XMLNS XMLNSXSI SCHEMA
do
    ...
done

Open in new window

But that makes it harder to troubleshoot if you have a problem, coz there's no file to examine.
0
 
tel2Commented:
Note that if there's any chance that any of your xml files could contain a nul character (ASCII 0), then change the:
    perl -0ne ...
to:
    perl -0777 -ne ...
I just prefer the former for brevity, since it is usually not a problem.

And to make the regex a bit more readable, you could use Perl's 'x' modifier like this:
perl -0ne '/
        version="(.*?)".+
        encoding="(.*?)".+
        xml-stylesheet\ href="(.*?)".+
        xmlns="(.*?)".+
        xmlns:xsi="(.*?)".+
        xsi:schemaLocation="(.*?)"
        /sx;print "$1 $2 $3 $4 $5 $6\n"' *.xml |\
while read VER ENC STYLE XMLNS XMLNSXSI SCHEMA
do
        echo "VER=$VER"
        echo "ENC=$ENC"
        echo "STYLE=$STYLE"
        echo "XMLNS=$XMLNS"
        echo "XMLNSXSI=$XMLNSXSI"
        echo "SCHEMA=$SCHEMA"
        echo
done

Open in new window

Note how I had to escape the space after the word "stylesheet" since /x ignores whitespace by default.
0
 
Evan CutlerVolunteer Chief Information OfficerAuthor Commented:
that is so awesome.
I sent the code up to see how it works.
Please bear with me,
first thing in the morning I will have an answer.
thank you sooo much.

Evan
0
 
Evan CutlerVolunteer Chief Information OfficerAuthor Commented:
Genius.
Absolutely Genius...
first time it ran with no problems.

Thank you so much,
you have no idea how much hock you got me out of.

Thanks again.
0
 
tel2Commented:
Glad to be of service, arcee.  Thanks for the points.

Now that I've achieved genius status, would this be a good time to break the (possibly) bad news?

If any of your xml files doesn't contain all of those attributes in the sequence they appear in your example, then that file will fail to match and will not be processed by the while loop.

To check this, I suggest you:
- Count your xml files
- Count the number of files processed by the while loop
- Investigate any differences in the above
If you need help doing the above, let me know, but I don't have any genius ideas on how to solve the problems if there are any.

A pleasure doing business.  Call again.

TRS
0

Featured Post

Concerto Cloud for Software Providers & ISVs

Can Concerto Cloud Services help you focus on evolving your application offerings, while delivering the best cloud experience to your customers? From DevOps to revenue models and customer support, the answer is yes!

Learn how Concerto can help you.

  • 7
  • 7
Tackle projects and never again get stuck behind a technical roadblock.
Join Now