Link to home
Start Free TrialLog in
Avatar of rstaveley
rstaveleyFlag for United Kingdom of Great Britain and Northern Ireland

asked on

"Piping" streams in Java

I have a Lucene-based application that indexes e-mails. The content handler for HTML e-mail has gone through various incarnations now, but I'm settling on putting the dirty HTML message parts into org.ccil.cowan.tagsoup.Parser, using a home-grown implementation of the org.xml.sax.ContentHandler interface, which essentially ignores everything but characters - i.e. it simply strips the tags. My ContentHandler wraps a java.io.Writer, which writes plain text into a temporary file. When it has finished processing the file, I close the writer and then open the temporary file to get a java.io.Reader, which Lucene accepts as a constructor parameter for generating a org.apache.lucene.document.Field. By using TagSoup's SAX approach for HTML tag stripping and then feeding a java.io.Reader to Lucene's Field constructor, I'm in good shape with respect to heap usage.

However, it isn't very elegant using a Writer to write to a temporary file and then a Reader to read all the content from it immediately afterwards. The Lucene interface demands that I present it with a class which extends java.io.Reader. Do you reckon there's a practicable way to extend java.io.Reader to read from a org.xml.sax.ContentHandler, i.e. "piping" the SAX characters events into the Reader so that the plain text can be written directly into a Lucene Field?
ASKER CERTIFIED SOLUTION
Avatar of CEHJ
CEHJ
Flag of United Kingdom of Great Britain and Northern Ireland image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
You could try StringWriter/ StringReader if you don't want to create temporary files.
Avatar of rstaveley

ASKER

Perfect! Another one I owe you, CEHJ :-)

[mayankeagle, I need to be a good citizen with respect to heap usage in my container. The PipedReader is what I needed - I just hadn't figured out that I needed a separate thread for reading from the stream as and when data became available from the SAX events.]
:-)

If you find out about that 'bug' let me know here ;-)
main() completes before run has completed its course? My guess is you need to join the PipedReader thread to wait for it to complete. I'll experiemnt to see.

It works nicely for me if I close the writer to get the EOF and then use join to wait for the PipedReader thread to complete.

i.e.
--------8<--------
          out = new PrintWriter(pw, true);
          Thread t = new Thread() {
              /* ... */
          };
          t.start();

          // Parses an XML file using a SAX parser.
          parseXmlFile(
                    //"file:/C:/Documents and Settings/Charles/workspace/WorthKeeping/xml/sax/infilename.xml",
                    "file:/"+filename,
                    handler, false);
          try {
               out.close();    // Close the output stream to get the PipedReader to see EOF
               t.join();          // Wait for the  PipedReader to finish before exiting main
          } catch (InterruptedException e) {
               e.printStackTrace();
          }
--------8<--------
Thanks rstaveley - well done