"Piping" streams in Java
Posted on 2006-05-13
I have a Lucene-based application that indexes e-mails. The content handler for HTML e-mail has gone through various incarnations now, but I'm settling on putting the dirty HTML message parts into org.ccil.cowan.tagsoup.Parser, using a home-grown implementation of the org.xml.sax.ContentHandler interface, which essentially ignores everything but characters - i.e. it simply strips the tags. My ContentHandler wraps a java.io.Writer, which writes plain text into a temporary file. When it has finished processing the file, I close the writer and then open the temporary file to get a java.io.Reader, which Lucene accepts as a constructor parameter for generating a org.apache.lucene.document.Field. By using TagSoup's SAX approach for HTML tag stripping and then feeding a java.io.Reader to Lucene's Field constructor, I'm in good shape with respect to heap usage.
However, it isn't very elegant using a Writer to write to a temporary file and then a Reader to read all the content from it immediately afterwards. The Lucene interface demands that I present it with a class which extends java.io.Reader. Do you reckon there's a practicable way to extend java.io.Reader to read from a org.xml.sax.ContentHandler, i.e. "piping" the SAX characters events into the Reader so that the plain text can be written directly into a Lucene Field?