how to get a blank page from pdf files?

We have requirement to find if there is any blank/empty pages in a PDF files. Actually there are 4 million PDF files which needs to be validated for above condition and also there will be 10k-12k pages in a PDF. Hence need a script to automate this work.

Thanks in Advance!

Note:-
OS:- Windows
abdul hameedAsked:
Who is Participating?

[Product update] Infrastructure Analysis Tool is now available with Business Accounts.Learn More

x
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

Dan CraciunIT ConsultantCommented:
Using iText, here is code taken from http://www.rgagnon.com/javadetails/java-detect-and-remove-blank-page-in-pdf.html.

The code below deletes the blank pages, you will probably just need to add the name of the file and the page number in some form of log.

import java.io.ByteArrayOutputStream;
import java.io.FileOutputStream;
import java.io.IOException;

import com.itextpdf.text.Document;
import com.itextpdf.text.DocumentException;
import com.itextpdf.text.io.RandomAccessSourceFactory;
import com.itextpdf.text.pdf.PdfCopy;
import com.itextpdf.text.pdf.PdfDictionary;
import com.itextpdf.text.pdf.PdfImportedPage;
import com.itextpdf.text.pdf.PdfName;
import com.itextpdf.text.pdf.PdfReader;
import com.itextpdf.text.pdf.RandomAccessFileOrArray;

public class RemoveBlankPageFromPDF {

    // value where we can consider that this is a blank image
    // can be much higher or lower depending of what is considered as a blank page
    public static final int BLANK_THRESHOLD = 160;

    public static void removeBlankPdfPages(String source, String destination)
        throws IOException, DocumentException
    {
        PdfReader r = null;
        RandomAccessSourceFactory rasf = null;
        RandomAccessFileOrArray raf = null;
        Document document = null;
        PdfCopy writer = null;

        try {
            r = new PdfReader(source);
            // deprecated
            //    RandomAccessFileOrArray raf
            //           = new RandomAccessFileOrArray(pdfSourceFile);
            // itext 5.4.1
            rasf = new RandomAccessSourceFactory();
            raf = new RandomAccessFileOrArray(rasf.createBestSource(source));
            document = new Document(r.getPageSizeWithRotation(1));
            writer = new PdfCopy(document, new FileOutputStream(destination));
            document.open();
            PdfImportedPage page = null;

            for (int i=1; i<=r.getNumberOfPages(); i++) {
                // first check, examine the resource dictionary for /Font or
                // /XObject keys.  If either are present -> not blank.
                PdfDictionary pageDict = r.getPageN(i);
                PdfDictionary resDict = (PdfDictionary) pageDict.get( PdfName.RESOURCES );
                boolean noFontsOrImages = true;
                if (resDict != null) {
                  noFontsOrImages = resDict.get( PdfName.FONT ) == null &&
                                    resDict.get( PdfName.XOBJECT ) == null;
                }
                System.out.println(i + " noFontsOrImages " + noFontsOrImages);

                if (!noFontsOrImages) {
                    byte bContent [] = r.getPageContent(i,raf);
                    ByteArrayOutputStream bs = new ByteArrayOutputStream();
                    bs.write(bContent);
                    System.out.println
                      (i + bs.size() + " > BLANK_THRESHOLD " +  (bs.size() > BLANK_THRESHOLD));
                    if (bs.size() > BLANK_THRESHOLD) {
                        page = writer.getImportedPage(r, i);
                        writer.addPage(page);
                    }
                }
            }
        }
        finally {
            if (document != null) document.close();
            if (writer != null) writer.close();
            if (raf != null) raf.close();
            if (r != null) r.close();
        }
    }

    public static void main (String ... args) throws Exception {
        removeBlankPdfPages
            ("C://temp//documentwithblank.pdf", "C://temp//documentwithnoblank.pdf");
    }
}

Open in new window


HTH,
Dan

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
abdul hameedAuthor Commented:
is there any other way to do the same?
Joe Winograd, Fellow&MVEDeveloperCommented:
What is your definition of a "blank/empty" page? For example, attached are five, one-page PDFs, as follows:

(1) created from a Word file with nothing in it

(2) created from a Word file with some tabs and spaces in it

(3) created from a Word file with a footer that has a page number in it, but nothing else

(4) created by a scanner that scanned at 300 DPI in black&white — it is visually "blank/empty"

(5) created by a scanner that scanned at 200 DPI in color — it is visually "blank/empty"

Which of these do you consider "blank/empty"? Regards, Joe
word-with-nothing.pdf
word-with-tabs-and-spaces.pdf
word-with-page-num-in-footer.pdf
scanned-image-300dpi-bw.pdf
scanned-image-200dpi-color.pdf
Joe Winograd, Fellow&MVEDeveloperCommented:
Dan's code provides a solution based on size being the criterion for a blank/empty page, while Joe's post discusses important issues regarding the very definition of a blank/empty page. Most of the credit to Dan, some to Joe.
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Adobe Acrobat

From novice to tech pro — start learning today.