Avatar of allelopath
allelopath

asked on 

Apache POI

I have a memory issue using Apache POI. When creating a workbook (org.apache.poi.ss.usermodel.Workbook) like so:
Workbook workbook = WorkbookFactory.create( inputStream );

Open in new window


It takes much memory. For example, executing that line on an excel file that is 73mb causes memory usage to jump by about 1gb (according to the Windows Task Manager - PF Usage)

Two question
1. Why so much memory?
2. Is there a way to read an create/read row-by-row or column-by-column?

Java

Avatar of undefined
Last Comment
Mick Barry
SOLUTION
Avatar of for_yan
for_yan
Flag of United States of America image

Blurred text
THIS SOLUTION IS ONLY AVAILABLE TO MEMBERS.
View this solution by signing up for a free trial.
Members can start a 7-Day free trial and enjoy unlimited access to the platform.
See Pricing Options
Start Free Trial
Avatar of for_yan
for_yan
Flag of United States of America image

How many rows are there in your table?
Do you know if you are usiing HSSF or XSSF and how many rows you have in this table?
It looks like if there are less than 65535 - you can use less memory with HSSF
Avatar of allelopath
allelopath

ASKER

I cannot depend on the file having fewer than 65535 rows.
Avatar of for_yan
for_yan
Flag of United States of America image


Perhpas read this from:
http://apache-poi.1045710.n5.nabble.com/I-want-to-help-reduce-memory-footprint-td3354296.html


On Sun, 23 Jan 2011, Joseph D. Wagner wrote:
> I am particularly interested in reducing the memory footprint around
> spreadsheets.  Unfortunately, I must confess that I'm having trouble
> pealing back all the layers and sifting through all the dependencies.

Which kind - HSSF (.xls) or XSSF (.xlsx)?

> Is there a developer who is familiar with POI's memory consumption who could
> point me in the right direction?

You'll probably need to run a profiler against the code for your type of
spreadsheet, and see what bubbles up.

If you're using XSSF, then you probably want to use the File constructor
rather than the InputStream one. For HSSF, since Christmas there's been an
experimental POIFS replacement which is more memory efficient, and has a
File constructor too.

> One specific question I have is where CTCell was defined.  Is it in a
> library that's outside of POI?

It's auto-generated by XMLBeans based on the Microsoft XSDs. If you do a
svn checkout, then you can use ant to generate the sources for you through
XMLBeans. Any optimisations there are likely to need to happen in the
XMLBeans sourcecode though, to get it to generate a more efficient version

> Also, I know POI relies on XmlBeans.  I was wondering if you were
> familiar with any issues there, such as "if you cut z in xmlbeans, it
> will save us a lot of memory."

I'm not aware of anything, sorry. XmlBeans is another apache project, so
you can ask there, but they usually seem to say "It depends on your XML"
so you'd probably need to do some profiling first...
Avatar of for_yan
for_yan
Flag of United States of America image


http://stackoverflow.com/questions/4763624/jexcelapi-vs-apache-poi-which-is-better

Please, pay attention at the comment in the answer  there

Apache-POI also offers an event based API which can be used to drastically reduce the memory footprint when reading but using it is more complicated than just reading the file directly
Avatar of for_yan
for_yan
Flag of United States of America image

So they are all mentioning this POIFS system which should allow to read Excel with less memory consumptuiion

http://poi.apache.org/poifs/how-to.html

though I didn't find concrete examples reading Excel files yet
ASKER CERTIFIED SOLUTION
Avatar of for_yan
for_yan
Flag of United States of America image

Blurred text
THIS SOLUTION IS ONLY AVAILABLE TO MEMBERS.
View this solution by signing up for a free trial.
Members can start a 7-Day free trial and enjoy unlimited access to the platform.
Avatar of allelopath
allelopath

ASKER

Looking at tika.I have code as below. Drawing an equivalent to Sax XML parsing, I'll looking for methods the are called by events and where to override those methods. E.g. in XML parsing I would have:

@Override
public void startElement(String uri, String localName, String rawName, Attributes attributes)	throws SAXException {
  .........................
}

Open in new window


but I am not finding the equivlanet for tika.


final AutoDetectParser myLibraryParser = new AutoDetectParser();
ContentHandler contentHandler = new BodyContentHandler();
Metadata metadata = new Metadata();
ParseContext parseContext = new ParseContext();
try {
	myLibraryParser.parse(inputStream, contentHandler, metadata, parseContext);
}
catch (IOException e) {
	e.printStackTrace();
}
catch (SAXException e) {
	e.printStackTrace();
}
catch (TikaException e) {
	e.printStackTrace();
}

Open in new window

Avatar of for_yan
for_yan
Flag of United States of America image

Avatar of for_yan
for_yan
Flag of United States of America image

If the last one does not use as much memory as your initial POI attempts - maybe it would be a good solution
Avatar of allelopath
allelopath

ASKER

Those last 2 examples both use HSSF, which I am trying to leave behind because it apparently is the thing taking up so much memory.
Avatar of allelopath
allelopath

ASKER

I think I need to use this:
org.apache.tika.parser.microsoft.ExcelExtractor;
SOLUTION
Avatar of Mick Barry
Mick Barry
Flag of Australia image

Blurred text
THIS SOLUTION IS ONLY AVAILABLE TO MEMBERS.
View this solution by signing up for a free trial.
Members can start a 7-Day free trial and enjoy unlimited access to the platform.
Java
Java

Java is a platform-independent, object-oriented programming language and run-time environment, designed to have as few implementation dependencies as possible such that developers can write one set of code across all platforms using libraries. Most devices will not run Java natively, and require a run-time component to be installed in order to execute a Java program.

102K
Questions
--
Followers
--
Top Experts
Get a personalized solution from industry experts
Ask the experts
Read over 600 more reviews

TRUSTED BY

IBM logoIntel logoMicrosoft logoUbisoft logoSAP logo
Qualcomm logoCitrix Systems logoWorkday logoErnst & Young logo
High performer badgeUsers love us badge
LinkedIn logoFacebook logoX logoInstagram logoTikTok logoYouTube logo