Want to win a PS4? Go Premium and enter to win our High-Tech Treats giveaway. Enter to Win

x
?
Solved

XSLT - Applying an XSLT aginst and HTML document to produce XML output

Posted on 2012-04-01
10
Medium Priority
?
537 Views
Last Modified: 2012-04-14
Hi

Is there a recommened approach to applying an XSLT transformation against an HTML document ?

I would like to take an HTML input and produce some XML output.

I am currently using the SAXON XSLT processor in an java app.

Is SAXON the way to go ? Or are there any better approches I could take ?
0
Comment
Question by:Molko
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 6
  • 2
  • 2
10 Comments
 
LVL 60

Expert Comment

by:Geert Bormans
ID: 37793049
In my opinion Saxon is about the best choice you could make.

Transforming HTML into XML usually implies that you need to add nested grouping from flat list
H1 H2 H1
to become
section{H1 {section{H2} ...} section{H1 ...}
For that XSLT2 grouping facilities come in very handy.
So indeed, I recommend using Saxon and XSLT2

Note that XSLT requires wellformed XML as its input format.
If you are processing HTML instead of XHTML as a source,
you will need to run TagSoup or HTMLTidy to parse the HTML before you can send it to XSLT
0
 

Author Comment

by:Molko
ID: 37793128
Hi

Yes, the source is HTML and not XHTML.

Could you provide me a simple example of the XSLT2 grouping ? Very much appreciated.

Thanks
0
 
LVL 47

Expert Comment

by:for_yan
ID: 37793659
Perhpas, this woul be an example of XSLT 2 grouping:
http://stackoverflow.com/questions/2177927/grouping-several-groups-in-xslt-2
0
Concerto Cloud for Software Providers & ISVs

Can Concerto Cloud Services help you focus on evolving your application offerings, while delivering the best cloud experience to your customers? From DevOps to revenue models and customer support, the answer is yes!

Learn how Concerto can help you.

 
LVL 60

Expert Comment

by:Geert Bormans
ID: 37793786
well, that is a good example (and simple to grasp) for one level grouping.
it is getting really complex and harder to swallow if you want to do this up to six levels.
(I once did one in XSLT1, and it required multiple steps, never managed to get it working properly for other than straightforward examples in a single step,
so allthough complex, my statement for XSLT2 holds)
I have a stylesheet I always use, but have not done that myself, so would like to pass the reference to you, not the stylesheet to give the author credit
(you could google for it "nesting html with XSLT2 grouping" or something like that for the xsl biglist(mullberry tech)
I will try to find the reference later tonight
0
 
LVL 47

Expert Comment

by:for_yan
ID: 37793849
Perhaps you already did it, but anyway
I combined the code from here:
http://blog.msbbc.co.uk/2007/06/simple-saxon-java-example.html
downloaded saxonb9-1-0-8j.zip from here
http://sourceforge.net/projects/saxon/files/Saxon-B/9.1.0.8/saxonb9-1-0-8j.zip/download
 and expanded it and placed saxon9.jar on the classpath


and used input files from the above link.
http://stackoverflow.com/questions/2177927/grouping-several-groups-in-xslt-2

And it worked exactly as stated there

This is the code:
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerConfigurationException;
import javax.xml.transform.TransformerException;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.stream.StreamResult;
import javax.xml.transform.stream.StreamSource;
import java.io.File;

public class SimpleSaxon {


    public static void myTransformer (String sourceID, String xslID)
throws TransformerException, TransformerConfigurationException {

        // Create a transform factory instance.
        TransformerFactory tfactory = TransformerFactory.newInstance();

 // Create a transformer for the stylesheet.
 Transformer transformer = tfactory.newTransformer(new StreamSource(new File(xslID)));

 // Transform the source XML to System.out.
 transformer.transform(new StreamSource(new File(sourceID)),
    new StreamResult(System.out));
}

    public static void main(String args[]) {

 // set the TransformFactory to use the Saxon TransformerFactoryImpl method
 System.setProperty("javax.xml.transform.TransformerFactory",
 "net.sf.saxon.TransformerFactoryImpl");


 String foo_xml = "input.xml"; //input xml
 String foo_xsl = "input.xsl"; //input xsl

 try {
         myTransformer (foo_xml, foo_xsl);
 } catch (Exception ex) {
      handleException(ex);
 }

}

  private static void handleException(Exception ex) {

     System.out.println("EXCEPTION: " + ex);
     ex.printStackTrace();
}


}

Open in new window


input.xml


<article>
  <h1>A section title here</h1>
  <p>A paragraph.</p>
  <p>Another paragraph.</p>
  <bl>Bulleted list item.</bl>
  <bl>Another bulleted list item.</bl>
  <h1>Another section title</h1>
  <p>Yet another paragraph.</p>
</article>

Open in new window


input.xsl:

<xsl:stylesheet
  version="2.0"
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

  <xsl:strip-space elements="*"/>
  <xsl:output indent="yes"/>

  <xsl:template match="article">
    <xsl:copy>
      <xsl:for-each-group select="*" group-starting-with="h1">
        <sec>
          <xsl:copy-of select="."/>
          <xsl:for-each-group select="current-group() except ." group-adjacent="boolean(self::bl)">
            <xsl:choose>
              <xsl:when test="current-grouping-key()">
                <list>
                  <xsl:apply-templates select="current-group()"/>
                </list>
              </xsl:when>
              <xsl:otherwise>
                <xsl:copy-of select="current-group()"/>
              </xsl:otherwise>
            </xsl:choose>
          </xsl:for-each-group>
        </sec>
      </xsl:for-each-group>
    </xsl:copy>
  </xsl:template>

  <xsl:template match="bl">
    <list-item>
      <xsl:apply-templates/>
    </list-item>
  </xsl:template>

</xsl:stylesheet>

Open in new window



Output:

<?xml version="1.0" encoding="UTF-8"?>
<article>
   <sec>
      <h1>A section title here</h1>
      <p>A paragraph.</p>
      <p>Another paragraph.</p>
      <list>
         <list-item>Bulleted list item.</list-item>
         <list-item>Another bulleted list item.</list-item>
      </list>
   </sec>
   <sec>
      <h1>Another section title</h1>
      <p>Yet another paragraph.</p>
   </sec>
</article>

Open in new window

0
 
LVL 60

Expert Comment

by:Geert Bormans
ID: 37793857
my friend, what I meant was more than one level.
It is easy with one level,
h1 - h1 - h1
and it is easy with predictable levels (see michael kays excellent reference book)
h1 - h2 - h2 - h1 - h 2 - h3 - h2 - h3 - h1

it is somewhat harder with unpredictable levels (such as most html out there)
h1 - h4 - h2 - h1 - h4 - h3 - h2 - h1- h4 - h3

please read my comments more carefully :-)
0
 
LVL 60

Accepted Solution

by:
Geert Bormans earned 2000 total points
ID: 37798378
have a look here if you are using XSLT2
http://www.dpawson.co.uk/xsl/rev2/html.html
David Carlisle developed a XSLT2 html cleanup you can use as a first step
0
 
LVL 60

Expert Comment

by:Geert Bormans
ID: 37799574
And here is the code I use for nesting flat structures

http://stackoverflow.com/questions/2108348/xslt-deepening-content-structure

the answer from martin honnen is what you are looking for
... somewhat advanced stuff, so you will need to take some time to swallow and adapt it

cheers

Geert
0
 

Author Closing Comment

by:Molko
ID: 37846083
Thanks
0
 
LVL 60

Expert Comment

by:Geert Bormans
ID: 37846201
welcome
0

Featured Post

Free Tool: Port Scanner

Check which ports are open to the outside world. Helps make sure that your firewall rules are working as intended.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Introduction This article is the last of three articles that explain why and how the Experts Exchange QA Team does test automation for our web site. This article covers our test design approach and then goes through a simple test case example, how …
In this post we will learn how to connect and configure Android Device (Smartphone etc.) with Android Studio. After that we will run a simple Hello World Program.
Video by: Michael
Viewers learn about how to reduce the potential repetitiveness of coding in main by developing methods to perform specific tasks for their program. Additionally, objects are introduced for the purpose of learning how to call methods in Java. Define …
Viewers will learn about if statements in Java and their use The if statement: The condition required to create an if statement: Variations of if statements: An example using if statements:
Suggested Courses

618 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question