Solved

Extraction of metadata from XML in SHELL SCRIPT

Posted on 2013-06-14
14
430 Views
Last Modified: 2013-06-17
Greetings,
I am trying to extract data from XML files.
They all look like this:

<?xml version="1.0" encoding="UTF-8"?><?xml-stylesheet href="http://www.accessdata.fda.gov/spl/stylesheet/spl.xsl" type="text/xsl"?>
<document xmlns="urn:hl7-org:v3" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="urn:hl7-org:v3 http://www.accessdata.fda.gov/spl/schema/spl.xsd">
<!--subchilds go here-->
</document>

I need to get the following out from this XML and into Variables:
- the version number: 1.0
- the endoding: UTF-8
- the stylesheet href: http://www.accessdata.fda.gov/spl/stylesheet/spl.xsl
- the document xmlns: urn:hl7-org:v3
- the document xmlns:xsi:  http://www.w3.org/2001/XMLSchema-instance
- the document xsi:schemaLocation: urn:hl7-org:v3 http://www.accessdata.fda.gov/spl/schema/spl.xsd

From what I understand, my source XML must be checked against a business rules engine.  Unfortunately the rules engine is not part of hte script and must be called upon separately.

To answer this, I am writing the six values to a text file.
I have all of the steps done, except the actual extraction from the xml.

How do I proceed?  I am using XMLLINT, but the solaris version which does not have the --xpath extension, but I can do cat calls in XMLLINT.

I am open to other options.
Thanks.
0
Comment
Question by:Evan Cutler
  • 7
  • 7
14 Comments
 
LVL 12

Expert Comment

by:tel2
ID: 39249347
Hi arcee123,

Q1. Are you open to a Perl solution?
Q2. Please provide expected output in the format you want it, for the sample input you've provided.

Thanks.
tel2
0
 
LVL 9

Author Comment

by:Evan Cutler
ID: 39249393
Hi Tel,
to be honest, the output is irrelevant. ...however I am restricted to base install on Solaris 10.  I do kmow it has "a version" of perl, vut I dont kniw which. ...

What are you thinking?
0
 
LVL 12

Expert Comment

by:tel2
ID: 39249403
Hi arceee123,

> to be honest, the output is irrelevant
So what would you like the script to do then, if the output is irrelevant?

I was thinking of using Perl to generate any output you may require.  Having sample output up front, in the required format, often makes life much easier for everyone, as the programmer can make sure they have it right first time and don't get surprises later, and waste a lot of time sorting them.  That may not be what happens in this case, but I'd want to guard against it by having sample output before I took this on.  And I'd also what to know whether you want all the output in a single file, or what.

I'm not sure whether I'll be the one to provide a solution yet.

Thanks.
tel2
0
Netscaler Common Configuration How To guides

If you use NetScaler you will want to see these guides. The NetScaler How To Guides show administrators how to get NetScaler up and configured by providing instructions for common scenarios and some not so common ones.

 
LVL 9

Author Comment

by:Evan Cutler
ID: 39249408
Ok....I see where your going work this.  The issue is I was going to call am application using the values as parameters in the shell script. The application will use the values in the parameters to do it's thing.
To that endi had no thought to output.   I hope this opens up some doors for you .

 Again thank you so much for this.
0
 
LVL 9

Author Comment

by:Evan Cutler
ID: 39249409
If you have any ideas.   I am not above using the perl to output the values somewhere and I load them into the parameters usinga file read in the shell script.
0
 
LVL 12

Expert Comment

by:tel2
ID: 39249418
Hi arcee,

> I am not above using the perl to output the values somewhere and I load them into the parameters usinga file read in the shell script.

OK, so if the file was space separated, like this (this is a 3 record example):

1.0 UTF-8 http://www.accessdata.fda.gov/spl/stylesheet/spl.xsl urn:hl7-org:v3 http://www.w3.org/2001/XMLSchema-instance urn:hl7-org:v3 http://www.accessdata.fda.gov/spl/schema/spl.xsd
2.1 UTF-8 http://www.accessdata.fda.gov/spl/stylesheet/spl.xsl urn:hl7-org:v3 http://www.w3.org/2001/XMLSchema-instance urn:hl7-org:v3 http://www.accessdata.fda.gov/spl/schema/spl.xsd
1.1 UTF-8 http://www.accessdata.fda.gov/spl/stylesheet/spl.xsl urn:hl7-org:v3 http://www.w3.org/2001/XMLSchema-instance urn:hl7-org:v3 http://www.accessdata.fda.gov/spl/schema/spl.xsd

That should work with something like:
    perl ... >filename
    cat filename | while read VER ENC STYLE XMLNS XMLNSXSI SCHEMA
    do
        ...
    done

Or even just:

    perl ... | while read VER ENC STYLE XMLNS XMLNSXSI SCHEMA
    ...as above

Q3. OK?

I note that the schema location value has a space in it, but SCHEMA will pick all the rest of the space-delimited fields up, since it's the last one in the read command.
Q4.  Should OK?

Q5. Is it possible that any of the other values could contain spaces or be blank?  If so, we should change the delimter and do it some other way.
0
 
LVL 9

Author Comment

by:Evan Cutler
ID: 39249420
I do not know the answer to the last question. ... but for the rest of it, you are awesome. That'll work
0
 
LVL 12

Expert Comment

by:tel2
ID: 39249446
Hi arcee,

Try the following bash script.  Note that it assumes the input files are anything ending in '.xml' in the current directory, but that can be easily changed.

#!/bin/bash

perl -0ne '/version="(.*?)".+encoding="(.*?)".+xml-stylesheet href="(.*?)".+xmlns="(.*?)".+xmlns:xsi="(.*?)".+xsi:schemaLocation="(.*?)"/s;print "$1 $2 $3 $4 $5 $6\n"' *.xml >xml_summary.out

cat xml_summary.out |\
while read VER ENC STYLE XMLNS XMLNSXSI SCHEMA
do
        echo "VER=$VER"
        echo "ENC=$ENC"
        echo "STYLE=$STYLE"
        echo "XMLNS=$XMLNS"
        echo "XMLNSXSI=$XMLNSXSI"
        echo "SCHEMA=$SCHEMA"
        echo
done

Open in new window

Here's some sample output that I get from it:
VER=1.0
ENC=UTF-8
STYLE=http://www.accessdata.fda.gov/spl/stylesheet/spl.xsl
XMLNS=urn:hl7-org:v3
XMLNSXSI=http://www.w3.org/2001/XMLSchema-instance
SCHEMA=urn:hl7-org:v3 http://www.accessdata.fda.gov/spl/schema/spl.xsd

VER=2.1
ENC=UTF-8
STYLE=http://www.accessdata.fda.gov/spl/stylesheet/spl.xsl
XMLNS=urn:hl7-org:v3
XMLNSXSI=http://www.w3.org/2001/XMLSchema-instance
SCHEMA=urn:hl7-org:v3 http://www.accessdata.fda.gov/spl/schema/spl.xsd

Is that what you're after?
0
 
LVL 9

Author Comment

by:Evan Cutler
ID: 39249450
Yeah, something just like that.
Thank you so much. Let me try it or ant get backto you.
0
 
LVL 12

Expert Comment

by:tel2
ID: 39249669
OK.  As you've probably realised, the "echo"s are just for demo purposes and can be removed.

And as you probably also know, this is slightly more concise (no cat & pipe):
while read VER ENC STYLE XMLNS XMLNSXSI SCHEMA
do
    ...
done <xml_summary.out

Open in new window

Or forget the temporary file, as previously mentioned, like this:
perl -0ne '/version="(.*?)".+encoding="(.*?)".+xml-stylesheet href="(.*?)".+xmlns="(.*?)".+xmlns:xsi="(.*?)".+xsi:schemaLocation="(.*?)"/s;print "$1 $2 $3 $4 $5 $6\n"' *.xml |\
while read VER ENC STYLE XMLNS XMLNSXSI SCHEMA
do
    ...
done

Open in new window

But that makes it harder to troubleshoot if you have a problem, coz there's no file to examine.
0
 
LVL 12

Accepted Solution

by:
tel2 earned 500 total points
ID: 39252006
Note that if there's any chance that any of your xml files could contain a nul character (ASCII 0), then change the:
    perl -0ne ...
to:
    perl -0777 -ne ...
I just prefer the former for brevity, since it is usually not a problem.

And to make the regex a bit more readable, you could use Perl's 'x' modifier like this:
perl -0ne '/
        version="(.*?)".+
        encoding="(.*?)".+
        xml-stylesheet\ href="(.*?)".+
        xmlns="(.*?)".+
        xmlns:xsi="(.*?)".+
        xsi:schemaLocation="(.*?)"
        /sx;print "$1 $2 $3 $4 $5 $6\n"' *.xml |\
while read VER ENC STYLE XMLNS XMLNSXSI SCHEMA
do
        echo "VER=$VER"
        echo "ENC=$ENC"
        echo "STYLE=$STYLE"
        echo "XMLNS=$XMLNS"
        echo "XMLNSXSI=$XMLNSXSI"
        echo "SCHEMA=$SCHEMA"
        echo
done

Open in new window

Note how I had to escape the space after the word "stylesheet" since /x ignores whitespace by default.
0
 
LVL 9

Author Comment

by:Evan Cutler
ID: 39252008
that is so awesome.
I sent the code up to see how it works.
Please bear with me,
first thing in the morning I will have an answer.
thank you sooo much.

Evan
0
 
LVL 9

Author Closing Comment

by:Evan Cutler
ID: 39254591
Genius.
Absolutely Genius...
first time it ran with no problems.

Thank you so much,
you have no idea how much hock you got me out of.

Thanks again.
0
 
LVL 12

Expert Comment

by:tel2
ID: 39254694
Glad to be of service, arcee.  Thanks for the points.

Now that I've achieved genius status, would this be a good time to break the (possibly) bad news?

If any of your xml files doesn't contain all of those attributes in the sequence they appear in your example, then that file will fail to match and will not be processed by the while loop.

To check this, I suggest you:
- Count your xml files
- Count the number of files processed by the while loop
- Investigate any differences in the above
If you need help doing the above, let me know, but I don't have any genius ideas on how to solve the problems if there are any.

A pleasure doing business.  Call again.

TRS
0

Featured Post

PRTG Network Monitor: Intuitive Network Monitoring

Network Monitoring is essential to ensure that computer systems and network devices are running. Use PRTG to monitor LANs, servers, websites, applications and devices, bandwidth, virtual environments, remote systems, IoT, and many more. PRTG is easy to set up & use.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Title # Comments Views Activity
Bash Script-Trigger email when server reboots 11 75
UNIX SCP 5 78
Problem logging tar errors 11 58
How to search a specific mailbox rule name in all mailboxes in Exchange 2010? 11 55
When you do backups in the Solaris Operating System, the file system must be inactive. Otherwise, the output may be inconsistent. A file system is inactive when it's unmounted or it's write-locked by the operating system. Although the fssnap utility…
Background Still having to process all these year-end "csv" files received from all these sources (including Government entities), sometimes we have the need to examine the contents due to data error, etc... As a "Unix" shop, our only readily …
Learn how to navigate the file tree with the shell. Use pwd to print the current working directory: Use ls to list a directory's contents: Use cd to change to a new directory: Use wildcards instead of typing out long directory names: Use ../ to move…
In a previous video, we went over how to export a DynamoDB table into Amazon S3.  In this video, we show how to load the export from S3 into a DynamoDB table.

821 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question