Link to home
Start Free TrialLog in
Avatar of fsyed
fsyed

asked on

Need help to slightly modify an existing XML document using either XSLT or Java

Dear fellow Java/XML developers:

I have an xml file which I need to slightly modify, using either XSLT, or Java, however, I am not sure how to do this.  The current document structure is as follows:

<?xml version="1.0" encoding="UTF-8"?>
<collection name="Name of Collection">
    <book number="1" title="Book Title">
        <chapter number="1:Chapter Title">
            <quote number="1.0001" reference="Book 1, Number 1">
                <narrator></narrator>
                <quotation>
                    <quotation-text></quotation-text>
                    <quotation-footnote></quotation-footnote>
                </quotation>
            </quote>
            ...
        </chapter>
        <chapter number="2:Chapter Title">
             <quote number="1.0005" reference="Book 1, Number 5">
                <narrator></narrator>
                <quotation>
                    <quotation-text></quotation-text>
                    <quotation-footnote></quotation-footnote>
                </quotation>
            </quote>
         ...
     </book>
     <book number="2" title="Book Title">
         <chapter number="1:Chapter Title">
             <quote number="2.0025" reference="Book 2, Number 25">
                <narrator></narrator>
                <quotation>
                    <quotation-text></quotation-text>
                    <quotation-footnote></quotation-footnote>
                </quotation>
            </quote>
      ....
</collection>

The changes I need to make are:

1.  add a "title" attribute to the <chapter> element by breaking up the current "number" attribute, such that:

 <chapter number="1:Chapter Title">

changes to :

<chapter number="1" title="Chapter Title"> (and have the colon removed in the process).

2.  add the value of the "number" attribute from the <chapter> element, to the "reference" attribute of the  <quote> element, and modify the "number" attribute of the <quote> element, such that:

<chapter number="1:Chapter Title">
<quote number="1.0001" reference="Book 1, Number 1">

changes to:

<quote number="1" reference="Book 1, Chapter 1, Number 1">

As of right now, the way the "number" attribute of the <quote> element works, is that it is made up of the "Book" number, followed by the "quote" number, separated by a period in between.  I want to eliminate the book number (the initial number, the period, and all of the leading zeroes in front of the quote number, so that ONLY the quote number remains.  I hope this makes sense.

The parent tag of the xml document, is <collection>.  <collection> contains several <book> elements, which contain several <chapter> elements, and each <chapter> element, contains several <quote> elements.

I hope this question is not too complicated, and if it is, please let me know which part is confusing, and I will do my best to further clarify.

My sincerest thanks to all who reply.
SOLUTION
Avatar of Gertone (Geert Bormans)
Gertone (Geert Bormans)
Flag of Belgium image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
ASKER CERTIFIED SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of fsyed
fsyed

ASKER

Dear Gertone:

Thanks (yet again!) for an amazing solution!  Unfortunately, there is one slight error in the output, and that is the value of the attribute "number" in the <quote> element is the book value.  I need this number, the period, and the leading zeroes removed.

The fault is actually mine when I look at the example I provided above.  Here is a clearer example:

<chapter number="3:Chapter Title">
<quote number="1.0002" reference="Book 1, Number 2">

changes to:

<quote number="2" reference="Book 1, Chapter 3, Number 2">

in the number attribute above, 1 refers to the book number, and the 0002 refers to the quote number (which is what I need to keep, minus the zeroes).  

I hope this helps, and thanks again for so quick of a response.

Also, can you show me how to run the XSLT from java?

Thanks again.
Sincerely;
Fayyaz
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of fsyed

ASKER

As usual Gertone, your answers are truly outstanding, and are always delivered immediately.  I truly appreciate all the work you have done.  I was wondering if you could provide me a breakdown of the revised, complete XSLT sheet to explain to me what is happening?  This problem was tricky for me which is why I posted my problem, and as I suspected, your solution is quite involved.  You truly are a genius!

Thanks again for everything.
Sincerely;
Fayyaz
Here is an explanation, I will break this down in multiple posts, so that I have the explanation next to the code snippet pane.

If you only make small changes to an XML source document, you usually start with a so called identity copy stylesheet
That is a stylesheet with one template, as below (variants do exist) that makes the output an identical copy of the input

Each node will receive the following treatment
- xsl:copy copies the current node to the output... that is a copy without the children... for a text() node this would copy the text, for an element, it would copy the start and end tag, for a comment, this would copy the comment... note that the template operates on all type of nodes
- inside the xsl:copy you need to do something with the children
  + all attributes are copied as is
  + all child nodes are pushed to the templates as well... since there is only one template, the same copying occurs on the nested levels

I hope this makes the identity copy clear
   <xsl:template match="node()">
        <xsl:copy>
            <xsl:copy-of select="@*"/>
            <xsl:apply-templates select="node()"></xsl:apply-templates>
        </xsl:copy>
    </xsl:template>

Open in new window

For matching nodes to a template, the most specific match statement wins
Allthough the above template matches the chapter element node
the
<xsl:template match="chapter">... is more specific for the chapter element

So adding an extra template for a specific element to the indentity transformation stylesheet,
will still transform the input indentical to the output, except for this one specific element

Here is what we do with chapter elements
- we copy all their attributes, except the @number
- we create an attribute number, with the value being a part of the original number attribute
- we create an attribute title, with the value being another part of the original @number
- and then we process the child nodes in exactly the same fashion as before
    <xsl:template match="chapter">
        <xsl:copy>
            <xsl:copy-of select="@*[not(name() = 'number')]"/>
            <xsl:attribute name="number">
                <xsl:value-of select="substring-before(@number, ':')"/>
            </xsl:attribute>
            <xsl:attribute name="title">
                <xsl:value-of select="substring-after(@number, ':')"/>
            </xsl:attribute>
            <xsl:apply-templates select="node()"></xsl:apply-templates>
        </xsl:copy>
    </xsl:template>

Open in new window

Basically the next template for quote, does some variants to the chapter template, nothing really new,
except that we need to check out the attribute number of an ancestor

Let me know if there are still unclarities at this point

cheers

Geert