Go Premium for a chance to win a PS4. Enter to Win


PDF File Shrinker

Posted on 2016-08-01
Medium Priority
Last Modified: 2016-08-22
Have a client with a lot of PDF files that were scanned at to high of a resolution, so they are huge. Has anybody seen a utility you could feed a file list and it would process those files to a lower resolution, perhaps even from color to black and white?
I'm thinking ideally with a preview of the compressed file before it overwrites the source file?
Or something similar tool?
There are some thousands of these PDF files in various directories.
Being able to give it a directory to process, and include sub-directories would be helpful.
Free is good but a nice solution for a couple hundred $$ might work.
Question by:AnthonyMCSE
LVL 15

Assisted Solution

WalkaboutTigger earned 400 total points (awarded by participants)
ID: 41738184
So it sounds as if the PDFs are image-type PDFs without a lot of metadata - not OCR'd, in other words.

Were it me, I would look for a package which could do two things:  Convert PDFs to JPEGs, Convert JPEGs to PDFs.  This tool would have a scripting interface so I could pass parameters on the command line to perform this conversion or could scan a directory and perform the conversion on all files within that directory.
I would want the PDF-to-JPG process to allow me to set the DPI and color depth of the resulting JPG file.  This could also be done on the JPG-to-PDF side of the conversion.

This project on Sourceforge  looks promising for the batch conversion from PDF to JPG.

Assisted Solution

by:Eric C
Eric C earned 400 total points (awarded by participants)
ID: 41738217
If you have the Pro (maybe even Standard) edition of Acrobat, then you get another great program called Distiller.

Launch Distiller, configure the settings with the resolution, quality, etc. Then save it as a preset. Now you can literally drag a folder or a bunch of PDFs onto Distiller. It will "re-distill" the PDFs into smaller PDFs. t can even convert from color to black and white.
LVL 56

Assisted Solution

by:Joe Winograd, EE MVE 2015&2016
Joe Winograd, EE MVE 2015&2016 earned 1200 total points (awarded by participants)
ID: 41738349
Hi Anthony:

First, a couple of comments previous posts:

> without a lot of metadata - not OCR'd, in other words

To be clear, metadata is different from the content created by OCR. Metadata fields contain data like Title, Author, Subject, etc. They can exist in an image-only PDF, i.e., one that has not been OCRed. When a PDF is OCRed, text is created in the contents of the PDF — unrelated to the metadata of the PDF.

> maybe even Standard

Yes, Acrobat Standard comes with Distiller, but the Watched Folder feature of Distiller comes only with Acrobat Pro.

Now to your comments.

> a nice solution for a couple hundred $$ might work

Eric's suggestion is a good one at that price point. You can probably find a copy of X (10) or XI (11) Std for that (a one-time, perpetual license fee). Even better, if this is a one-off job that you can complete in a month (and from your description it sounds as if it is), then you can subscribe to Acrobat Std DC for a month for $23 — even a year of Std would cost only $156 (just $13/mo when you commit to a year).

> Free is good

If free interests you, take a look at GraphicsMagick. This EE article discusses how to get it and explains the various editions. It also shows how a simple, 4-line batch file can recurse into all sub-directories of a source directory.

> preview of the compressed file before it overwrites the source file

That's not going to be practical for thousands of PDFs in many directories. I recommend trying various settings on several documents to see what will work well. Even then, I recommend making a complete backup of all the PDFs before letting it rip on all of them.

Attached to this is a one-page, 24-bit color, 600 DPI, image-only PDF (which, btw is the first page of another EE article about GM). It is 4,941,468 bytes. I then ran these GraphicsMagick commands with the convert sub-command (similar to the one in this other EE article, but with a PDF output file and different options):

gm convert -density 150 input.pdf -colorspace RGB output150color.pdf
output file in bytes: 739,723 (85% size reduction)

gm convert -density 200 input.pdf -colorspace RGB output200color.pdf
output file in bytes: 1,314,281 (73% size reduction)

gm convert -density 200 input.pdf -colorspace GRAY output200gray.pdf
output file in bytes: 515,063 (90% size reduction)

gm convert -density 300 input.pdf -colorspace GRAY output300gray.pdf
output file in bytes: 1,056,147 (79% size reduction)

The first two convert to color with 150 and 200 DPI, while the next two convert to grayscale with 200 and 300 DPI. All four output files are attached. My suggestion is to run some tests like these to assess the size/quality trade-off. Then put it in a batch file that recurses into all sub-directories and let it rip — after making a backup. :)

Btw, all five files attached are image-only PDFs (no OCR), but all five have metadata that you can see via File>Properties. Regards, Joe
Free Backup Tool for VMware and Hyper-V

Restore full virtual machine or individual guest files from 19 common file systems directly from the backup file. Schedule VM backups with PowerShell scripts. Set desired time, lean back and let the script to notify you via email upon completion.  


Author Comment

ID: 41739753
Joe, you have a remarkable very detailed answer.

Is there an option to replace the input.pdf with the output.pd?

How would you handle the case of already "light" PDF files intermixed with the "heavy" PDF files.  Don't want to lose any further detail on the "light" PDF Files.

By "light" I mean files that are already gray scale and 200 or 300 dpi.  By heavy I mean 600 dpi color.

LVL 56

Accepted Solution

Joe Winograd, EE MVE 2015&2016 earned 1200 total points (awarded by participants)
ID: 41739963
> you have a remarkable very detailed answer

Thank you — I appreciate the compliment.

> Is there an option to replace the input.pdf with the output.pdf?

Yes. The sub-command is called mogrify (discussed in the first EE article mentioned above). Of course, since it replaces the input file with no warning, you need to be careful! Here's what one of the commands would look like:

gm mogrify -density 200 -colorspace GRAY input.pdf

> By "light" I mean files that are already gray scale and 200 or 300 dpi. By heavy I mean 600 dpi color.

GM has an identify sub-command, which also has a -verbose option that gives a lot more info, but I don't know if it gives enough of the right info to do that. To try it, before doing the mogrify, your script/program would issue the identify and capture the output. The call is like this:

gm identify input.pdf

The output is like this for the five files mentioned in my previous post:

input.pdf PDF 612x792+0+0 DirectClass 8-bit 1.4Mi 0.000u 0:01
output150color.pdf PDF 1275x1650+0+0 DirectClass 8-bit 6.0Mi 0.000u 0:01
output200color.pdf PDF 1700x2200+0+0 DirectClass 8-bit 10.7Mi 0.000u 0:01
output200gray.pdf PDF 1700x2200+0+0 DirectClass 8-bit 10.7Mi 0.000u 0:01
output300gray.pdf PDF 2550x3300+0+0 DirectClass 8-bit 24.1Mi 0.000u 0:01

There are a gazillion other options in the program that can affect the output, such as -colors, -depth, -geometry, -quality, -resample, -resize, and lots more. You may want to spend a few hours (or days) experimenting, or you may just want to go with something that should work pretty well overall — I'd recommend -density=150 and -colorspace=RGB. Regards, Joe

Author Comment

ID: 41759745
Apologies for delay, would like to distribute points myself.
LVL 56

Expert Comment

by:Joe Winograd, EE MVE 2015&2016
ID: 41759774
> would like to distribute points myself

Fine by me. I was simply responding to a "Help resolve this question" email that said, "The following question you participated in has been inactive for 14 days" and "You can still help resolve it by choosing the comment(s) with the most merit and following the prompts to close the question". It's a new process that EE put in place to deal with the enormous number of abandoned questions. It's much better, of course, if askers close questions themselves. Regards, Joe

Author Closing Comment

ID: 41761498
Thanks everyone!  Good solutions, especially Joe!
LVL 56

Expert Comment

by:Joe Winograd, EE MVE 2015&2016
ID: 41761522
You're welcome, Anthony. And thanks to you for the compliment — I appreciate hearing it!

Featured Post


Modern healthcare requires a modern cloud. View this brief video to understand how the Concerto Cloud for Healthcare can help your organization.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Steps to fix “Unable to mount database. (hr=0x80004005, ec=1108)”.
Stellar Exchange Toolkit: this 5 in 1 toolkit comes loaded with mega-software tool. Here’s an introduction to tools’ usage and advantages:
Microsoft Office Picture Manager has a Picture Shortcuts pane that shows a list with the Recently Browsed folders. While creating my video Micro Tutorial here at Experts Exchange showing How to Install Microsoft Office Picture Manager in Office 2013…
In this seventh video of the Xpdf series, we discuss and demonstrate the PDFfonts utility, which lists all the fonts used in a PDF file. It does this via a command line interface, making it suitable for use in programs, scripts, batch files — any pl…

926 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question