[2 days left] What’s wrong with your cloud strategy? Learn why multicloud solutions matter with Nimble Storage.Register Now

x
?
Solved

Extraction of metadata from XML in SHELL SCRIPT

Posted on 2013-06-14
14
Medium Priority
?
447 Views
Last Modified: 2013-06-17
Greetings,
I am trying to extract data from XML files.
They all look like this:

<?xml version="1.0" encoding="UTF-8"?><?xml-stylesheet href="http://www.accessdata.fda.gov/spl/stylesheet/spl.xsl" type="text/xsl"?>
<document xmlns="urn:hl7-org:v3" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="urn:hl7-org:v3 http://www.accessdata.fda.gov/spl/schema/spl.xsd">
<!--subchilds go here-->
</document>

I need to get the following out from this XML and into Variables:
- the version number: 1.0
- the endoding: UTF-8
- the stylesheet href: http://www.accessdata.fda.gov/spl/stylesheet/spl.xsl
- the document xmlns: urn:hl7-org:v3
- the document xmlns:xsi:  http://www.w3.org/2001/XMLSchema-instance
- the document xsi:schemaLocation: urn:hl7-org:v3 http://www.accessdata.fda.gov/spl/schema/spl.xsd

From what I understand, my source XML must be checked against a business rules engine.  Unfortunately the rules engine is not part of hte script and must be called upon separately.

To answer this, I am writing the six values to a text file.
I have all of the steps done, except the actual extraction from the xml.

How do I proceed?  I am using XMLLINT, but the solaris version which does not have the --xpath extension, but I can do cat calls in XMLLINT.

I am open to other options.
Thanks.
0
Comment
Question by:Evan Cutler
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 7
  • 7
14 Comments
 
LVL 12

Expert Comment

by:tel2
ID: 39249347
Hi arcee123,

Q1. Are you open to a Perl solution?
Q2. Please provide expected output in the format you want it, for the sample input you've provided.

Thanks.
tel2
0
 
LVL 9

Author Comment

by:Evan Cutler
ID: 39249393
Hi Tel,
to be honest, the output is irrelevant. ...however I am restricted to base install on Solaris 10.  I do kmow it has "a version" of perl, vut I dont kniw which. ...

What are you thinking?
0
 
LVL 12

Expert Comment

by:tel2
ID: 39249403
Hi arceee123,

> to be honest, the output is irrelevant
So what would you like the script to do then, if the output is irrelevant?

I was thinking of using Perl to generate any output you may require.  Having sample output up front, in the required format, often makes life much easier for everyone, as the programmer can make sure they have it right first time and don't get surprises later, and waste a lot of time sorting them.  That may not be what happens in this case, but I'd want to guard against it by having sample output before I took this on.  And I'd also what to know whether you want all the output in a single file, or what.

I'm not sure whether I'll be the one to provide a solution yet.

Thanks.
tel2
0
Concerto Cloud for Software Providers & ISVs

Can Concerto Cloud Services help you focus on evolving your application offerings, while delivering the best cloud experience to your customers? From DevOps to revenue models and customer support, the answer is yes!

Learn how Concerto can help you.

 
LVL 9

Author Comment

by:Evan Cutler
ID: 39249408
Ok....I see where your going work this.  The issue is I was going to call am application using the values as parameters in the shell script. The application will use the values in the parameters to do it's thing.
To that endi had no thought to output.   I hope this opens up some doors for you .

 Again thank you so much for this.
0
 
LVL 9

Author Comment

by:Evan Cutler
ID: 39249409
If you have any ideas.   I am not above using the perl to output the values somewhere and I load them into the parameters usinga file read in the shell script.
0
 
LVL 12

Expert Comment

by:tel2
ID: 39249418
Hi arcee,

> I am not above using the perl to output the values somewhere and I load them into the parameters usinga file read in the shell script.

OK, so if the file was space separated, like this (this is a 3 record example):

1.0 UTF-8 http://www.accessdata.fda.gov/spl/stylesheet/spl.xsl urn:hl7-org:v3 http://www.w3.org/2001/XMLSchema-instance urn:hl7-org:v3 http://www.accessdata.fda.gov/spl/schema/spl.xsd
2.1 UTF-8 http://www.accessdata.fda.gov/spl/stylesheet/spl.xsl urn:hl7-org:v3 http://www.w3.org/2001/XMLSchema-instance urn:hl7-org:v3 http://www.accessdata.fda.gov/spl/schema/spl.xsd
1.1 UTF-8 http://www.accessdata.fda.gov/spl/stylesheet/spl.xsl urn:hl7-org:v3 http://www.w3.org/2001/XMLSchema-instance urn:hl7-org:v3 http://www.accessdata.fda.gov/spl/schema/spl.xsd

That should work with something like:
    perl ... >filename
    cat filename | while read VER ENC STYLE XMLNS XMLNSXSI SCHEMA
    do
        ...
    done

Or even just:

    perl ... | while read VER ENC STYLE XMLNS XMLNSXSI SCHEMA
    ...as above

Q3. OK?

I note that the schema location value has a space in it, but SCHEMA will pick all the rest of the space-delimited fields up, since it's the last one in the read command.
Q4.  Should OK?

Q5. Is it possible that any of the other values could contain spaces or be blank?  If so, we should change the delimter and do it some other way.
0
 
LVL 9

Author Comment

by:Evan Cutler
ID: 39249420
I do not know the answer to the last question. ... but for the rest of it, you are awesome. That'll work
0
 
LVL 12

Expert Comment

by:tel2
ID: 39249446
Hi arcee,

Try the following bash script.  Note that it assumes the input files are anything ending in '.xml' in the current directory, but that can be easily changed.

#!/bin/bash

perl -0ne '/version="(.*?)".+encoding="(.*?)".+xml-stylesheet href="(.*?)".+xmlns="(.*?)".+xmlns:xsi="(.*?)".+xsi:schemaLocation="(.*?)"/s;print "$1 $2 $3 $4 $5 $6\n"' *.xml >xml_summary.out

cat xml_summary.out |\
while read VER ENC STYLE XMLNS XMLNSXSI SCHEMA
do
        echo "VER=$VER"
        echo "ENC=$ENC"
        echo "STYLE=$STYLE"
        echo "XMLNS=$XMLNS"
        echo "XMLNSXSI=$XMLNSXSI"
        echo "SCHEMA=$SCHEMA"
        echo
done

Open in new window

Here's some sample output that I get from it:
VER=1.0
ENC=UTF-8
STYLE=http://www.accessdata.fda.gov/spl/stylesheet/spl.xsl
XMLNS=urn:hl7-org:v3
XMLNSXSI=http://www.w3.org/2001/XMLSchema-instance
SCHEMA=urn:hl7-org:v3 http://www.accessdata.fda.gov/spl/schema/spl.xsd

VER=2.1
ENC=UTF-8
STYLE=http://www.accessdata.fda.gov/spl/stylesheet/spl.xsl
XMLNS=urn:hl7-org:v3
XMLNSXSI=http://www.w3.org/2001/XMLSchema-instance
SCHEMA=urn:hl7-org:v3 http://www.accessdata.fda.gov/spl/schema/spl.xsd

Is that what you're after?
0
 
LVL 9

Author Comment

by:Evan Cutler
ID: 39249450
Yeah, something just like that.
Thank you so much. Let me try it or ant get backto you.
0
 
LVL 12

Expert Comment

by:tel2
ID: 39249669
OK.  As you've probably realised, the "echo"s are just for demo purposes and can be removed.

And as you probably also know, this is slightly more concise (no cat & pipe):
while read VER ENC STYLE XMLNS XMLNSXSI SCHEMA
do
    ...
done <xml_summary.out

Open in new window

Or forget the temporary file, as previously mentioned, like this:
perl -0ne '/version="(.*?)".+encoding="(.*?)".+xml-stylesheet href="(.*?)".+xmlns="(.*?)".+xmlns:xsi="(.*?)".+xsi:schemaLocation="(.*?)"/s;print "$1 $2 $3 $4 $5 $6\n"' *.xml |\
while read VER ENC STYLE XMLNS XMLNSXSI SCHEMA
do
    ...
done

Open in new window

But that makes it harder to troubleshoot if you have a problem, coz there's no file to examine.
0
 
LVL 12

Accepted Solution

by:
tel2 earned 2000 total points
ID: 39252006
Note that if there's any chance that any of your xml files could contain a nul character (ASCII 0), then change the:
    perl -0ne ...
to:
    perl -0777 -ne ...
I just prefer the former for brevity, since it is usually not a problem.

And to make the regex a bit more readable, you could use Perl's 'x' modifier like this:
perl -0ne '/
        version="(.*?)".+
        encoding="(.*?)".+
        xml-stylesheet\ href="(.*?)".+
        xmlns="(.*?)".+
        xmlns:xsi="(.*?)".+
        xsi:schemaLocation="(.*?)"
        /sx;print "$1 $2 $3 $4 $5 $6\n"' *.xml |\
while read VER ENC STYLE XMLNS XMLNSXSI SCHEMA
do
        echo "VER=$VER"
        echo "ENC=$ENC"
        echo "STYLE=$STYLE"
        echo "XMLNS=$XMLNS"
        echo "XMLNSXSI=$XMLNSXSI"
        echo "SCHEMA=$SCHEMA"
        echo
done

Open in new window

Note how I had to escape the space after the word "stylesheet" since /x ignores whitespace by default.
0
 
LVL 9

Author Comment

by:Evan Cutler
ID: 39252008
that is so awesome.
I sent the code up to see how it works.
Please bear with me,
first thing in the morning I will have an answer.
thank you sooo much.

Evan
0
 
LVL 9

Author Closing Comment

by:Evan Cutler
ID: 39254591
Genius.
Absolutely Genius...
first time it ran with no problems.

Thank you so much,
you have no idea how much hock you got me out of.

Thanks again.
0
 
LVL 12

Expert Comment

by:tel2
ID: 39254694
Glad to be of service, arcee.  Thanks for the points.

Now that I've achieved genius status, would this be a good time to break the (possibly) bad news?

If any of your xml files doesn't contain all of those attributes in the sequence they appear in your example, then that file will fail to match and will not be processed by the while loop.

To check this, I suggest you:
- Count your xml files
- Count the number of files processed by the while loop
- Investigate any differences in the above
If you need help doing the above, let me know, but I don't have any genius ideas on how to solve the problems if there are any.

A pleasure doing business.  Call again.

TRS
0

Featured Post

New feature and membership benefit!

New feature! Upgrade and increase expert visibility of your issues with Priority Questions.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

I have been running these systems for a few years now and I am just very happy with them.   I just wanted to share the manual that I have created for upgrades and other things.  Oooh yes! FreeBSD makes me happy (as a server), no maintenance and I al…
Active Directory replication delay is the cause to many problems.  Here is a super easy script to force Active Directory replication to all sites with by using an elevated PowerShell command prompt, and a tool to verify your changes.
This video shows how to set up a shell script to accept a positional parameter when called, pass that to a SQL script, accept the output from the statement back and then manipulate it in the Shell.
In a recent question (https://www.experts-exchange.com/questions/29004105/Run-AutoHotkey-script-directly-from-Notepad.html) here at Experts Exchange, a member asked how to run an AutoHotkey script (.AHK) directly from Notepad++ (aka NPP). This video…
Suggested Courses

649 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question