rstaveley
asked on
BufferedReader with a restricted buffer size
I'm occasionally getting exceptionally long text files with no line terminators. This is causing problems in my application in the following code, which is designed to strip UUEncoded blocks from plain text messages:
--------8<--------
File temp; // Temporary file for collecting stripped text
boolean skip_uue = false;
boolean doneFirst = false;
try {
// Create temp file.
temp = File.createTempFile("Plain TextHandle r",".txt") ;
// Delete temp file when program exits.
temp.deleteOnExit();
// Write to temp file
BufferedWriter writer = new BufferedWriter(new FileWriter(temp));
//System.out.println(new TimeStamp().toString()+get Class().ge tName()+": Opening BufferedReader");
// Use a BufferedReader for the input stream
BufferedReader reader = new BufferedReader(
new InputStreamReader(is)
);
String line = null;
int line_number = 0;
int uue_line_number = 0;
while ((line = reader.readLine()) != null) {
++line_number;
if (skip_uue) {
++uue_line_number;
if (line.length() > 2 && "end".equals(line.substrin g(0,3))) {
// Show how many UUEncoded lines we've skipped
System.out.println(new TimeStamp().toString()+get Class().ge tName()+": Skipped "+uue_line_number+" lines of UUEncoded text");
skip_uue = false;
}
continue;
}
else if (line.length() > 5 && "begin".equals(line.substr ing(0,5))) {
// Look for a UUEncoded block
if (line.matches("^begin\\s\\ d{3}\\s.+$ ")) {
skip_uue = true;
uue_line_number = 1;
continue;
}
}
// Subsequent lines need white space
if (doneFirst)
writer.newLine(); // Give Lucene some white space to separate the tokens
else
doneFirst = true; // We have at least one line
writer.write(line); // Write the non-UUE data to the temporary file
}
reader.close();
writer.close();
// Show how many UUEncoded lines we've skipped
if (skip_uue)
System.out.println(new TimeStamp().toString()+get Class().ge tName()+": Skipped "+uue_line_number+" lines of UUEncoded text");
}
catch (IOException e) {
//System.out.println(new TimeStamp().toString()+get Class().ge tName()+": IOException "+e.toString());
throw new StandardDocumentHandlerExc eption("Ca nnot read the text document",e);
}
catch (Exception e) {
//System.out.println(new TimeStamp().toString()+get Class().ge tName()+": Exception "+e.toString());
throw new StandardDocumentHandlerExc eption("Ex ception caught in PlainTextHandler",e);
}
// ... the plain text in the temp file is then passed to Lucene, before being deleted
--------8<--------
The trouble with the code above is that it may cause String line to be loaded with an unacceptably large string, which makes this thread a bad citizen in my MT application, using up too much of the heap and causing another thread to barf with an out of memory exception, when it temporarily needs heap space.
My question is this:
Can I use the BufferedStream constructor that specifies a buffer size to limit the maximum length of string read from the reader - i.e. http://java.sun.com/j2se/1.5.0/docs/api/java/io/BufferedReader.html#BufferedReader%28java.io.Reader,%20int%29 ? If so, is the stream still readable after reading a partial line? [I can live with having the long line broken up in such a way that tokens are broken up, because it is a special case.]
--------8<--------
File temp; // Temporary file for collecting stripped text
boolean skip_uue = false;
boolean doneFirst = false;
try {
// Create temp file.
temp = File.createTempFile("Plain
// Delete temp file when program exits.
temp.deleteOnExit();
// Write to temp file
BufferedWriter writer = new BufferedWriter(new FileWriter(temp));
//System.out.println(new TimeStamp().toString()+get
// Use a BufferedReader for the input stream
BufferedReader reader = new BufferedReader(
new InputStreamReader(is)
);
String line = null;
int line_number = 0;
int uue_line_number = 0;
while ((line = reader.readLine()) != null) {
++line_number;
if (skip_uue) {
++uue_line_number;
if (line.length() > 2 && "end".equals(line.substrin
// Show how many UUEncoded lines we've skipped
System.out.println(new TimeStamp().toString()+get
skip_uue = false;
}
continue;
}
else if (line.length() > 5 && "begin".equals(line.substr
// Look for a UUEncoded block
if (line.matches("^begin\\s\\
skip_uue = true;
uue_line_number = 1;
continue;
}
}
// Subsequent lines need white space
if (doneFirst)
writer.newLine(); // Give Lucene some white space to separate the tokens
else
doneFirst = true; // We have at least one line
writer.write(line); // Write the non-UUE data to the temporary file
}
reader.close();
writer.close();
// Show how many UUEncoded lines we've skipped
if (skip_uue)
System.out.println(new TimeStamp().toString()+get
}
catch (IOException e) {
//System.out.println(new TimeStamp().toString()+get
throw new StandardDocumentHandlerExc
}
catch (Exception e) {
//System.out.println(new TimeStamp().toString()+get
throw new StandardDocumentHandlerExc
}
// ... the plain text in the temp file is then passed to Lucene, before being deleted
--------8<--------
The trouble with the code above is that it may cause String line to be loaded with an unacceptably large string, which makes this thread a bad citizen in my MT application, using up too much of the heap and causing another thread to barf with an out of memory exception, when it temporarily needs heap space.
My question is this:
Can I use the BufferedStream constructor that specifies a buffer size to limit the maximum length of string read from the reader - i.e. http://java.sun.com/j2se/1.5.0/docs/api/java/io/BufferedReader.html#BufferedReader%28java.io.Reader,%20int%29 ? If so, is the stream still readable after reading a partial line? [I can live with having the long line broken up in such a way that tokens are broken up, because it is a special case.]
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
I guess I need to implement my own line reader, then. Thanks for the quick response, CEHJ.
:-)
if (line.length() > MAX_LINE_LENGTH) {
line = line.substring(0, MAX_LINE_LENGTH);
}
otherwise you'd have to do your own line reading or override BufferedReader.readLine