We help IT Professionals succeed at work.

Java and Converting PDF files to XML Documents

dogsareit
dogsareit used Ask the Experts™
on
I need to programmatically convert pdf files to XML to be able to extract data and insert into a database.
I have researched and seen many examples.
My environment is: I am developing using localhost, I have Java 13.0.1 installed , I have set the java bin path in environment variable (rebooted afterwards). I have both inetpub and wampserver installed (listening on different ports) and have successfully compiled java classes (beginners examples) on my computer.
I found this coding (listed below) at: https://stackoverflow.com/questions/16936013/java-code-for-pdf-to-xml-conversion.

I am not very skilled at Java. I compiled the class  at the cmd line: javac c:\wamp\www\PDFConvert\ConvertPDFToXML.java and receive the errors (36 of them!). The errors are concern with the first 3 lines after the public class declaration  - static StreamResult streamResult;  static TransformerHandler handler; static AttributesImpl atts;
The errors are  "cannot find static streamResult steamResult" ; "cannot find static streamResult TransformerHandler " ;"cannot find static streamResult AttributesImpl " for each time the above 3 appeared in the coding.
SO I decided to add the following code to the top of the coding:
import java.util.stream;
import javax.xml.transform.sax;
import org.xml.sax.helpers;

Open in new window


That just resulted in producing same type of errors for those lines.
I have attached a screenshot of the errors.
Could someone be as so kind as to help and educate me in what I am doing wrong ?? I don't know what I am doing wrong.
Below is the complete coding - including what I inserted (first 3 lines).

import java.util.stream;
import javax.xml.transform.sax;
import org.xml.sax.helpers;
// FROM:  https://stackoverflow.com/questions/16936013/java-code-for-pdf-to-xml-conversion


public class ConvertPDFToXML {
            static StreamResult streamResult;
            static TransformerHandler handler;
            static AttributesImpl atts;

            public static void main(String[] args) throws IOException {

                    try {
                            Document document = new Document();
                            document.open();
                            PdfReader reader = new PdfReader("C:\\PaymodeRCCL.pdf");
                            PdfDictionary page = reader.getPageN(1);
                            PRIndirectReference objectReference = (PRIndirectReference) page
                                            .get(PdfName.CONTENTS);
                            PRStream stream = (PRStream) PdfReader
                                            .getPdfObject(objectReference);
                            byte[] streamBytes = PdfReader.getStreamBytes(stream);
                            PRTokeniser tokenizer = new PRTokeniser(streamBytes);

                            StringBuffer strbufe = new StringBuffer();
                            while (tokenizer.nextToken()) {
                                    if (tokenizer.getTokenType() == PRTokeniser.TK_STRING) {
                                            strbufe.append(tokenizer.getStringValue());
                                    }
                            }
                            String test = strbufe.toString();
                            streamResult = new StreamResult("data.xml");
                            initXML();
                            process(test);
                            closeXML();
                            document.add(new Paragraph(".."));
                            document.close();
                    } catch (Exception e) {
                    }
            }

            public static void initXML() throws ParserConfigurationException,
                            TransformerConfigurationException, SAXException {
                    SAXTransformerFactory tf = (SAXTransformerFactory) SAXTransformerFactory
                                    .newInstance();

                    handler = tf.newTransformerHandler();
                    Transformer serializer = handler.getTransformer();
                    serializer.setOutputProperty(OutputKeys.ENCODING, "ISO-8859-1");
                    serializer.setOutputProperty(
                                    "{http://xml.apache.org/xslt}indent-amount", "4");
                    serializer.setOutputProperty(OutputKeys.INDENT, "yes");
                    handler.setResult(streamResult);
                    handler.startDocument();
                    atts = new AttributesImpl();
                    handler.startElement("", "", "data", atts);
            }

            public static void process(String s) throws SAXException {
                    String[] elements = s.split("\\|");
                    atts.clear();
                    handler.startElement("", "", "Message", atts);
                    handler.characters(elements[0].toCharArray(), 0, elements[0].length());
                    handler.endElement("", "", "Message");
            }

            public static void closeXML() throws SAXException {
                    handler.endElement("", "", "data");
                    handler.endDocument();
            }
    }

Open in new window


Screenshot of errors

Error Messages when compiling Java Class
Comment
Watch Question

Do more with

Expert Office
EXPERT OFFICE® is a registered trademark of EXPERTS EXCHANGE®
I looks like you may be missing some handlers

See if this tutorial helps - https://www.roseindia.net/tutorial/java/xml/pdftoXML.html

Author

Commented:
Thank you, that helped a great deal ! I knew I had to be missing some import statements !! It cut my errors down to 18 ! I think I also need to install the jar from the following link. I have never installed a jar before, here we go again...

http://www.java2s.com/Code/Jar/i/Downloaditextpdf541sourcesjar.htm

Open in new window

Author

Commented:
I am mistaken in what "jar" I need to install, When I look at the handlers, I use  "lowagie" jar. I found it, and thought I installed the jar properly but
receiving errors on it. I am going to close this question and open on on installing jars.
Thank you for responding.

Author

Commented:
Thank you for responding - you solved a lot of my problems - I just need to figure out why installing the "lowagie" jar doesn't seem right.
I'm glad the link helped

Author

Commented:
I am not sure which I need now ...