Link to home
Start Free TrialLog in
Avatar of Marthaj
MarthajFlag for United States of America

asked on

Java and Converting PDF files to XML Documents

I need to programmatically convert pdf files to XML to be able to extract data and insert into a database.
I have researched and seen many examples.
My environment is: I am developing using localhost, I have Java 13.0.1 installed , I have set the java bin path in environment variable (rebooted afterwards). I have both inetpub and wampserver installed (listening on different ports) and have successfully compiled java classes (beginners examples) on my computer.
I found this coding (listed below) at: https://stackoverflow.com/questions/16936013/java-code-for-pdf-to-xml-conversion.

I am not very skilled at Java. I compiled the class  at the cmd line: javac c:\wamp\www\PDFConvert\ConvertPDFToXML.java and receive the errors (36 of them!). The errors are concern with the first 3 lines after the public class declaration  - static StreamResult streamResult;  static TransformerHandler handler; static AttributesImpl atts;
The errors are  "cannot find static streamResult steamResult" ; "cannot find static streamResult TransformerHandler " ;"cannot find static streamResult AttributesImpl " for each time the above 3 appeared in the coding.
SO I decided to add the following code to the top of the coding:
import java.util.stream;
import javax.xml.transform.sax;
import org.xml.sax.helpers;

Open in new window


That just resulted in producing same type of errors for those lines.
I have attached a screenshot of the errors.
Could someone be as so kind as to help and educate me in what I am doing wrong ?? I don't know what I am doing wrong.
Below is the complete coding - including what I inserted (first 3 lines).

import java.util.stream;
import javax.xml.transform.sax;
import org.xml.sax.helpers;
// FROM:  https://stackoverflow.com/questions/16936013/java-code-for-pdf-to-xml-conversion


public class ConvertPDFToXML {
            static StreamResult streamResult;
            static TransformerHandler handler;
            static AttributesImpl atts;

            public static void main(String[] args) throws IOException {

                    try {
                            Document document = new Document();
                            document.open();
                            PdfReader reader = new PdfReader("C:\\PaymodeRCCL.pdf");
                            PdfDictionary page = reader.getPageN(1);
                            PRIndirectReference objectReference = (PRIndirectReference) page
                                            .get(PdfName.CONTENTS);
                            PRStream stream = (PRStream) PdfReader
                                            .getPdfObject(objectReference);
                            byte[] streamBytes = PdfReader.getStreamBytes(stream);
                            PRTokeniser tokenizer = new PRTokeniser(streamBytes);

                            StringBuffer strbufe = new StringBuffer();
                            while (tokenizer.nextToken()) {
                                    if (tokenizer.getTokenType() == PRTokeniser.TK_STRING) {
                                            strbufe.append(tokenizer.getStringValue());
                                    }
                            }
                            String test = strbufe.toString();
                            streamResult = new StreamResult("data.xml");
                            initXML();
                            process(test);
                            closeXML();
                            document.add(new Paragraph(".."));
                            document.close();
                    } catch (Exception e) {
                    }
            }

            public static void initXML() throws ParserConfigurationException,
                            TransformerConfigurationException, SAXException {
                    SAXTransformerFactory tf = (SAXTransformerFactory) SAXTransformerFactory
                                    .newInstance();

                    handler = tf.newTransformerHandler();
                    Transformer serializer = handler.getTransformer();
                    serializer.setOutputProperty(OutputKeys.ENCODING, "ISO-8859-1");
                    serializer.setOutputProperty(
                                    "{http://xml.apache.org/xslt}indent-amount", "4");
                    serializer.setOutputProperty(OutputKeys.INDENT, "yes");
                    handler.setResult(streamResult);
                    handler.startDocument();
                    atts = new AttributesImpl();
                    handler.startElement("", "", "data", atts);
            }

            public static void process(String s) throws SAXException {
                    String[] elements = s.split("\\|");
                    atts.clear();
                    handler.startElement("", "", "Message", atts);
                    handler.characters(elements[0].toCharArray(), 0, elements[0].length());
                    handler.endElement("", "", "Message");
            }

            public static void closeXML() throws SAXException {
                    handler.endElement("", "", "data");
                    handler.endDocument();
            }
    }

Open in new window


Screenshot of errors

User generated image
ASKER CERTIFIED SOLUTION
Avatar of kenfcamp
kenfcamp
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of Marthaj

ASKER

Thank you, that helped a great deal ! I knew I had to be missing some import statements !! It cut my errors down to 18 ! I think I also need to install the jar from the following link. I have never installed a jar before, here we go again...

http://www.java2s.com/Code/Jar/i/Downloaditextpdf541sourcesjar.htm

Open in new window

Avatar of Marthaj

ASKER

I am mistaken in what "jar" I need to install, When I look at the handlers, I use  "lowagie" jar. I found it, and thought I installed the jar properly but
receiving errors on it. I am going to close this question and open on on installing jars.
Thank you for responding.
Avatar of Marthaj

ASKER

Thank you for responding - you solved a lot of my problems - I just need to figure out why installing the "lowagie" jar doesn't seem right.
I'm glad the link helped
Avatar of Marthaj

ASKER

I am not sure which I need now ...