asked on

"Piping" streams in Java

I have a Lucene-based application that indexes e-mails. The content handler for HTML e-mail has gone through various incarnations now, but I'm settling on putting the dirty HTML message parts into org.ccil.cowan.tagsoup.Parser, using a home-grown implementation of the org.xml.sax.ContentHandler interface, which essentially ignores everything but characters - i.e. it simply strips the tags. My ContentHandler wraps a java.io.Writer, which writes plain text into a temporary file. When it has finished processing the file, I close the writer and then open the temporary file to get a java.io.Reader, which Lucene accepts as a constructor parameter for generating a org.apache.lucene.document.Field. By using TagSoup's SAX approach for HTML tag stripping and then feeding a java.io.Reader to Lucene's Field constructor, I'm in good shape with respect to heap usage.

However, it isn't very elegant using a Writer to write to a temporary file and then a Reader to read all the content from it immediately afterwards. The Lucene interface demands that I present it with a class which extends java.io.Reader. Do you reckon there's a practicable way to extend java.io.Reader to read from a org.xml.sax.ContentHandler, i.e. "piping" the SAX characters events into the Reader so that the plain text can be written directly into a Lucene Field?

ASKER CERTIFIED SOLUTION

CEHJ

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

Mayank S

You could try StringWriter/ StringReader if you don't want to create temporary files.

rstaveley

ASKER

Perfect! Another one I owe you, CEHJ :-)

[mayankeagle, I need to be a good citizen with respect to heap usage in my container. The PipedReader is what I needed - I just hadn't figured out that I needed a separate thread for reading from the stream as and when data became available from the SAX events.]

CEHJ

:-)

If you find out about that 'bug' let me know here ;-)

rstaveley

ASKER

main() completes before run has completed its course? My guess is you need to join the PipedReader thread to wait for it to complete. I'll experiemnt to see.

rstaveley

ASKER

It works nicely for me if I close the writer to get the EOF and then use join to wait for the PipedReader thread to complete.

i.e.
--------8<--------
out = new PrintWriter(pw, true);
Thread t = new Thread() {
/* ... */
};
t.start();

// Parses an XML file using a SAX parser.
parseXmlFile(
//"file:/C:/Documents and Settings/Charles/workspace/WorthKeeping/xml/sax/infilename.xml",
"file:/"+filename,
handler, false);
try {
out.close(); // Close the output stream to get the PipedReader to see EOF
t.join(); // Wait for the PipedReader to finish before exiting main
} catch (InterruptedException e) {
e.printStackTrace();
}
--------8<--------

CEHJ

Thanks rstaveley - well done

&quot;Piping&quot; streams in Java

"Piping" streams in Java