how to get a blank page from pdf files?

abdul hameed
abdul hameed used Ask the Experts™
on
We have requirement to find if there is any blank/empty pages in a PDF files. Actually there are 4 million PDF files which needs to be validated for above condition and also there will be 10k-12k pages in a PDF. Hence need a script to automate this work.

Thanks in Advance!

Note:-
OS:- Windows
Comment
Watch Question

Do more with

Expert Office
EXPERT OFFICE® is a registered trademark of EXPERTS EXCHANGE®
Commented:
Using iText, here is code taken from http://www.rgagnon.com/javadetails/java-detect-and-remove-blank-page-in-pdf.html.

The code below deletes the blank pages, you will probably just need to add the name of the file and the page number in some form of log.

import java.io.ByteArrayOutputStream;
import java.io.FileOutputStream;
import java.io.IOException;

import com.itextpdf.text.Document;
import com.itextpdf.text.DocumentException;
import com.itextpdf.text.io.RandomAccessSourceFactory;
import com.itextpdf.text.pdf.PdfCopy;
import com.itextpdf.text.pdf.PdfDictionary;
import com.itextpdf.text.pdf.PdfImportedPage;
import com.itextpdf.text.pdf.PdfName;
import com.itextpdf.text.pdf.PdfReader;
import com.itextpdf.text.pdf.RandomAccessFileOrArray;

public class RemoveBlankPageFromPDF {

    // value where we can consider that this is a blank image
    // can be much higher or lower depending of what is considered as a blank page
    public static final int BLANK_THRESHOLD = 160;

    public static void removeBlankPdfPages(String source, String destination)
        throws IOException, DocumentException
    {
        PdfReader r = null;
        RandomAccessSourceFactory rasf = null;
        RandomAccessFileOrArray raf = null;
        Document document = null;
        PdfCopy writer = null;

        try {
            r = new PdfReader(source);
            // deprecated
            //    RandomAccessFileOrArray raf
            //           = new RandomAccessFileOrArray(pdfSourceFile);
            // itext 5.4.1
            rasf = new RandomAccessSourceFactory();
            raf = new RandomAccessFileOrArray(rasf.createBestSource(source));
            document = new Document(r.getPageSizeWithRotation(1));
            writer = new PdfCopy(document, new FileOutputStream(destination));
            document.open();
            PdfImportedPage page = null;

            for (int i=1; i<=r.getNumberOfPages(); i++) {
                // first check, examine the resource dictionary for /Font or
                // /XObject keys.  If either are present -> not blank.
                PdfDictionary pageDict = r.getPageN(i);
                PdfDictionary resDict = (PdfDictionary) pageDict.get( PdfName.RESOURCES );
                boolean noFontsOrImages = true;
                if (resDict != null) {
                  noFontsOrImages = resDict.get( PdfName.FONT ) == null &&
                                    resDict.get( PdfName.XOBJECT ) == null;
                }
                System.out.println(i + " noFontsOrImages " + noFontsOrImages);

                if (!noFontsOrImages) {
                    byte bContent [] = r.getPageContent(i,raf);
                    ByteArrayOutputStream bs = new ByteArrayOutputStream();
                    bs.write(bContent);
                    System.out.println
                      (i + bs.size() + " > BLANK_THRESHOLD " +  (bs.size() > BLANK_THRESHOLD));
                    if (bs.size() > BLANK_THRESHOLD) {
                        page = writer.getImportedPage(r, i);
                        writer.addPage(page);
                    }
                }
            }
        }
        finally {
            if (document != null) document.close();
            if (writer != null) writer.close();
            if (raf != null) raf.close();
            if (r != null) r.close();
        }
    }

    public static void main (String ... args) throws Exception {
        removeBlankPdfPages
            ("C://temp//documentwithblank.pdf", "C://temp//documentwithnoblank.pdf");
    }
}

Open in new window


HTH,
Dan

Author

Commented:
is there any other way to do the same?
Joe WinogradDeveloper
Fellow 2017
Most Valuable Expert 2018
Commented:
What is your definition of a "blank/empty" page? For example, attached are five, one-page PDFs, as follows:

(1) created from a Word file with nothing in it

(2) created from a Word file with some tabs and spaces in it

(3) created from a Word file with a footer that has a page number in it, but nothing else

(4) created by a scanner that scanned at 300 DPI in black&white — it is visually "blank/empty"

(5) created by a scanner that scanned at 200 DPI in color — it is visually "blank/empty"

Which of these do you consider "blank/empty"? Regards, Joe
word-with-nothing.pdf
word-with-tabs-and-spaces.pdf
word-with-page-num-in-footer.pdf
scanned-image-300dpi-bw.pdf
scanned-image-200dpi-color.pdf
Joe WinogradDeveloper
Fellow 2017
Most Valuable Expert 2018

Commented:
Dan's code provides a solution based on size being the criterion for a blank/empty page, while Joe's post discusses important issues regarding the very definition of a blank/empty page. Most of the credit to Dan, some to Joe.

Do more with

Expert Office
Submit tech questions to Ask the Experts™ at any time to receive solutions, advice, and new ideas from leading industry professionals.

Start 7-Day Free Trial