• Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 1253
  • Last Modified:

Need to remove white space from text contained in values of XML document

Dear fellow XML/XSLT developers:

I have an XML document that contains elements which hold in some case, quite a bit of text.  Unfortunately, a lot of this text contains quite a bit of extraneous white space and blank lines throughout the entire document.  I would like a small program in Java or XSLT, that can go through the entire XML document and remove the white space, such that there exists only 2 white spaces after each period, double quotes, question/exclamation mark and colons, and a single space after commas, semi-colon's, and brackets (round, curly or square); i.e. standard spacing for punctuation in english.  My XML document looks like the following:
 

<?xml version="1.0" encoding="UTF-8"?>
<collection name="Collection Title">
    <book number="1" title="Book Title">
        <quote number="1" reference="Book 1, Number 1">
             <narrator>John Doe</narrator>
<quotation>blah             blah   blah             </quotation>
</quote>
       
        <quote number="2" reference="Book 1, Number 2">
<narrator>Jane Doe</narrator>
<quotation>blah blah             blah</quotation>
</quote>
...

</book>
</collection>

The XML document is structured, such that <collection> is the parent tags, which contain several <book> elements.  Each <book> element contains several <quote> elements.  The problem with the spacing exists within the <quotation> element ONLY (which exists within the <quote> element).

I hope this is clear.  Please let me know if anything is confusing.  

Thanks in advance to all who reply.
 
0
fsyed
Asked:
fsyed
  • 8
  • 4
3 Solutions
 
Geert BormansCommented:
If you are happy with a simple normalize-space,
make an identity transform and normalize each text node.
All series of white-space will then become a single space.

If that is not enough for you, I would still do that and use that as a basis
After you normalized the spaces of each text node, you can start replacing occurences of
. - space into . - space - space
That would of course best be done in XSLT2 (you can chunck in Saxon in a java transform: www.saxonica.com)
Or you will be up to some recursive processing (I know what to choose :-) in XSLT1

So, let me know:
- do you need help with normalizing the spaces in teh text nodes
- do you really need the double sace after the periods
- can you use XSLT2

And I will show you some code
0
 
Geert BormansCommented:
Here is an XSLT1, space normalizer, hope it works for you
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
    <xsl:output indent="yes"/>
    <xsl:template match="node()">
        <xsl:copy>
            <xsl:copy-of select="@*"/>
            <xsl:apply-templates select="node()"></xsl:apply-templates>
        </xsl:copy>
    </xsl:template>
    <xsl:template match="text()">
        <xsl:value-of select="normalize-space(.)"/>
    </xsl:template>
</xsl:stylesheet>

Open in new window

0
 
Geert BormansCommented:
Here is an XSLT2 variant that doubles the appropriate spaces again

Note that I changed the template for text() into */text(), which is more specific
in order to avoid ambigous templates warnings
I should have done that in the first place
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet 
    xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="2.0">
    <xsl:output indent="yes"/>
    <xsl:template match="node()">
        <xsl:copy>
            <xsl:copy-of select="@*"/>
            <xsl:apply-templates select="node()"></xsl:apply-templates>
        </xsl:copy>
    </xsl:template>
    <xsl:template match="*/text()">
        <xsl:variable name="regex"><xsl:text>([\."?!:])\s</xsl:text></xsl:variable>
        <xsl:value-of select="replace(normalize-space(.), $regex, '$1  ')"/>
    </xsl:template>
</xsl:stylesheet>

Open in new window

0
VIDEO: THE CONCERTO CLOUD FOR HEALTHCARE

Modern healthcare requires a modern cloud. View this brief video to understand how the Concerto Cloud for Healthcare can help your organization.

 
Geert BormansCommented:
A version XSLT1 alternative is a lot more involving, so I am only going to post some suggestions in that aspect if you really need that
0
 
fsyedAuthor Commented:
Dear Gertone:

Thanks once again for your prompt reply.  

In asnwer to your questions above:

So, let me know:
- do you need help with normalizing the spaces in teh text nodes
Yes, and your first XSLT code fixed this.  Thanks.  This part is done.

- do you really need the double sace after the periods
Yes I would.  

- can you use XSLT2
I have never used XSLT2, so I am not sure how to use it.  Do I use it the same way as using XSLT?  If you can show me how to use XSLT2, I would be more than happy to use an XSLT2 version, if it's much easier to implement.

Thanks again for your help.
Sincerely;
Fayyaz
0
 
fsyedAuthor Commented:
Since the spaces have all been normalized, at this point I would need a program to add 2 spaces after every period (.), exclamation mark (!), and question mark (?).  This should be it.

Thanks again for all of your help.
Sincerely;
Fayyaz
0
 
Geert BormansCommented:
Hi Fayyaz,

I am not a java developer so it is hard for me to give you full details on that,
but basically, I think you need to put the saxon jar in the classpath
and explicitely tell the transformer factory to use saxon instead of the built in java

you can download saxon here
http://sourceforge.net/projects/saxon/
the docs are here
http://www.saxonica.com/documentation/index/intro.html

I know there are some java examples in the resource package
saxon-resources9-n.zip
that can be downloaded from sourceforge too

I hope that is enough for you to get you started with xslt2
I highly recommend you do so

Geert
0
 
Geert BormansCommented:
I am not sure I agree completely with your latest statement.

I don't think we could just simply add two spaces after every dot
Maybe you have numbers in there, such as 3.14
you don't want them to become 3.  14

I think that after normalisation there would exactly be one space after every dot, denoting an end of sentence.
So likely the logic will be to make a ". " into a ".  "
dot-space becomes dot-space-space as I suggested before

In any case this requires a simple regular expression in XSLT2 (I actually implemented this logic in one of my follow ups)
It would also require some more complex recursive processing in XSLT1

Such somewhat more complex string operations are a very good driver for XSLT2.
Given that you need a little time investment, the first time you have to set up Saxon.
It is worth the cost 100 times afterwards
If you control the environment, as you seem to do, there is no excuse for not migrating to XSLT2

If you still think that the logic you suggested is better (replace every . with a dot-space-space) I will show you how to do that in XSLT2

have fun,

Geert
0
 
fsyedAuthor Commented:
You raised a very good point.  Is there a way to modify the rules such that no spaces are added if a number follows a period, and two spaces if a character does?  I think such a rule would work.

Thanks very much for bringing this up.
Sincerely;
Fayyaz
0
 
Geert BormansCommented:
yep, I can easily do that, but would you not think that the space is allready there?
Anyway, this would make the logic more complex now, so even more a recommendation to move to XSLT2.
So before I change the XSLT, let us try to get Saxon 9 to work
and then try the XSLT in comment ID 24231410 first to see if that is not buying you what you need already
0
 
fsyedAuthor Commented:
Thanks very much, yet again.  I was able to configure the XSLT2 processor and use the output from the XSLT1 version, and get the desired output.  Now I am a bit more empowered now, with exposure to XSLT2, and I got my XML document exactly how I wanted!  Thanks so much for consistently providing high quality solutions, so quickly.  Full points!
0
 
Geert BormansCommented:
Welcome
0

Featured Post

Concerto's Cloud Advisory Services

Want to avoid the missteps to gaining all the benefits of the cloud? Learn more about the different assessment options from our Cloud Advisory team.

  • 8
  • 4
Tackle projects and never again get stuck behind a technical roadblock.
Join Now