• Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 409
  • Last Modified:

"Piping" streams in Java

I have a Lucene-based application that indexes e-mails. The content handler for HTML e-mail has gone through various incarnations now, but I'm settling on putting the dirty HTML message parts into org.ccil.cowan.tagsoup.Parser, using a home-grown implementation of the org.xml.sax.ContentHandler interface, which essentially ignores everything but characters - i.e. it simply strips the tags. My ContentHandler wraps a java.io.Writer, which writes plain text into a temporary file. When it has finished processing the file, I close the writer and then open the temporary file to get a java.io.Reader, which Lucene accepts as a constructor parameter for generating a org.apache.lucene.document.Field. By using TagSoup's SAX approach for HTML tag stripping and then feeding a java.io.Reader to Lucene's Field constructor, I'm in good shape with respect to heap usage.

However, it isn't very elegant using a Writer to write to a temporary file and then a Reader to read all the content from it immediately afterwards. The Lucene interface demands that I present it with a class which extends java.io.Reader. Do you reckon there's a practicable way to extend java.io.Reader to read from a org.xml.sax.ContentHandler, i.e. "piping" the SAX characters events into the Reader so that the plain text can be written directly into a Lucene Field?
0
rstaveley
Asked:
rstaveley
  • 3
  • 3
1 Solution
 
CEHJCommented:
This works for me

package xml.sax;

import java.io.IOException;
import java.io.PipedReader;
import java.io.PipedWriter;
import java.io.PrintWriter;
import java.net.URL;

import javax.xml.parsers.ParserConfigurationException;
import javax.xml.parsers.SAXParserFactory;

import org.xml.sax.Attributes;
import org.xml.sax.SAXException;
import org.xml.sax.helpers.DefaultHandler;

public class BasicSaxPiped {
      static PrintWriter out;

      public static void main(String[] args) {
            // Create a handler to handle the SAX events generated during parsing
            MyHandler handler = new MyHandler();

            /*
             * Parse the file using the handler
             * and pipe the output from the handler
             * into a Reader
             */
            
            final PipedWriter pw = new PipedWriter();
            out = new PrintWriter(pw, true);

            new Thread() {
                  public void run() {
                        try {
                              PipedReader in = new PipedReader(pw);
                              int buf = -1;
                              while ((buf = in.read()) > -1) {
                                    System.out.print((char) buf);
                              }
                              in.close();
                        } catch (IOException e) {
                              // This to trap a bug - mine or their's i don't know ...
                              if (e.getMessage().indexOf("Write end dead") < 0) {
                                    e.printStackTrace();
                              }
                        }
                  }
            }.start();

            parseXmlFile(
                        "file:/C:/Documents and Settings/Charles/workspace/WorthKeeping/xml/sax/infilename.xml",
                        handler, false);
            // Parses an XML file using a SAX parser.

      }

      // If validating is true, the contents is validated against the DTD
      // specified in the file.
      public static void parseXmlFile(String url, DefaultHandler handler,
                  boolean validating) {
            try {
                  // Create a builder factory
                  SAXParserFactory factory = SAXParserFactory.newInstance();
                  factory.setValidating(validating);

                  // Create the builder and parse the file
                  factory.newSAXParser().parse(new URL(url).openStream(), handler);
            } catch (SAXException e) {
                  e.printStackTrace();
            } catch (ParserConfigurationException e) {
                  e.printStackTrace();
            } catch (IOException e) {
                  e.printStackTrace();
            }
      }

      // DefaultHandler contain no-op implementations for all SAX events.
      // This class should override methods to capture the events of interest.

      static class MyHandler extends DefaultHandler {

            public void startElement(String namespaceURI, String localName,
                        String qName, Attributes atts) throws SAXException {
            }

            public void endElement(String namespaceURI, String localName, String qName)
                        throws SAXException {
            }

            public void characters(char[] ch, int start, int length)
                        throws SAXException {
                  out.print(new String(ch, start, length));
            }
      }
}
0
 
Mayank SAssociate Director - Product EngineeringCommented:
You could try StringWriter/ StringReader if you don't want to create temporary files.
0
 
rstaveleyAuthor Commented:
Perfect! Another one I owe you, CEHJ :-)

[mayankeagle, I need to be a good citizen with respect to heap usage in my container. The PipedReader is what I needed - I just hadn't figured out that I needed a separate thread for reading from the stream as and when data became available from the SAX events.]
0
Independent Software Vendors: We Want Your Opinion

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 
CEHJCommented:
:-)

If you find out about that 'bug' let me know here ;-)
0
 
rstaveleyAuthor Commented:
main() completes before run has completed its course? My guess is you need to join the PipedReader thread to wait for it to complete. I'll experiemnt to see.

0
 
rstaveleyAuthor Commented:
It works nicely for me if I close the writer to get the EOF and then use join to wait for the PipedReader thread to complete.

i.e.
--------8<--------
          out = new PrintWriter(pw, true);
          Thread t = new Thread() {
              /* ... */
          };
          t.start();

          // Parses an XML file using a SAX parser.
          parseXmlFile(
                    //"file:/C:/Documents and Settings/Charles/workspace/WorthKeeping/xml/sax/infilename.xml",
                    "file:/"+filename,
                    handler, false);
          try {
               out.close();    // Close the output stream to get the PipedReader to see EOF
               t.join();          // Wait for the  PipedReader to finish before exiting main
          } catch (InterruptedException e) {
               e.printStackTrace();
          }
--------8<--------
0
 
CEHJCommented:
Thanks rstaveley - well done
0

Featured Post

What does it mean to be "Always On"?

Is your cloud always on? With an Always On cloud you won't have to worry about downtime for maintenance or software application code updates, ensuring that your bottom line isn't affected.

  • 3
  • 3
Tackle projects and never again get stuck behind a technical roadblock.
Join Now