Solved

PDF File Shrinker

Posted on 2016-08-01
10
150 Views
Last Modified: 2016-08-22
Have a client with a lot of PDF files that were scanned at to high of a resolution, so they are huge. Has anybody seen a utility you could feed a file list and it would process those files to a lower resolution, perhaps even from color to black and white?
I'm thinking ideally with a preview of the compressed file before it overwrites the source file?
Or something similar tool?
There are some thousands of these PDF files in various directories.
Being able to give it a directory to process, and include sub-directories would be helpful.
Free is good but a nice solution for a couple hundred $$ might work.
0
Comment
Question by:AnthonyMCSE
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
10 Comments
 
LVL 15

Assisted Solution

by:WalkaboutTigger
WalkaboutTigger earned 100 total points (awarded by participants)
ID: 41738184
So it sounds as if the PDFs are image-type PDFs without a lot of metadata - not OCR'd, in other words.

Were it me, I would look for a package which could do two things:  Convert PDFs to JPEGs, Convert JPEGs to PDFs.  This tool would have a scripting interface so I could pass parameters on the command line to perform this conversion or could scan a directory and perform the conversion on all files within that directory.
I would want the PDF-to-JPG process to allow me to set the DPI and color depth of the resulting JPG file.  This could also be done on the JPG-to-PDF side of the conversion.

This project on Sourceforge  looks promising for the batch conversion from PDF to JPG.
1
 
LVL 5

Assisted Solution

by:Eric C
Eric C earned 100 total points (awarded by participants)
ID: 41738217
If you have the Pro (maybe even Standard) edition of Acrobat, then you get another great program called Distiller.

Launch Distiller, configure the settings with the resolution, quality, etc. Then save it as a preset. Now you can literally drag a folder or a bunch of PDFs onto Distiller. It will "re-distill" the PDFs into smaller PDFs. t can even convert from color to black and white.
1
 
LVL 54

Assisted Solution

by:Joe Winograd, EE MVE 2015&2016
Joe Winograd, EE MVE 2015&2016 earned 300 total points (awarded by participants)
ID: 41738349
Hi Anthony:

First, a couple of comments previous posts:

> without a lot of metadata - not OCR'd, in other words

To be clear, metadata is different from the content created by OCR. Metadata fields contain data like Title, Author, Subject, etc. They can exist in an image-only PDF, i.e., one that has not been OCRed. When a PDF is OCRed, text is created in the contents of the PDF — unrelated to the metadata of the PDF.

> maybe even Standard

Yes, Acrobat Standard comes with Distiller, but the Watched Folder feature of Distiller comes only with Acrobat Pro.

Now to your comments.

> a nice solution for a couple hundred $$ might work

Eric's suggestion is a good one at that price point. You can probably find a copy of X (10) or XI (11) Std for that (a one-time, perpetual license fee). Even better, if this is a one-off job that you can complete in a month (and from your description it sounds as if it is), then you can subscribe to Acrobat Std DC for a month for $23 — even a year of Std would cost only $156 (just $13/mo when you commit to a year).

> Free is good

If free interests you, take a look at GraphicsMagick. This EE article discusses how to get it and explains the various editions. It also shows how a simple, 4-line batch file can recurse into all sub-directories of a source directory.

> preview of the compressed file before it overwrites the source file

That's not going to be practical for thousands of PDFs in many directories. I recommend trying various settings on several documents to see what will work well. Even then, I recommend making a complete backup of all the PDFs before letting it rip on all of them.

Attached to this is a one-page, 24-bit color, 600 DPI, image-only PDF (which, btw is the first page of another EE article about GM). It is 4,941,468 bytes. I then ran these GraphicsMagick commands with the convert sub-command (similar to the one in this other EE article, but with a PDF output file and different options):

gm convert -density 150 input.pdf -colorspace RGB output150color.pdf
output file in bytes: 739,723 (85% size reduction)

gm convert -density 200 input.pdf -colorspace RGB output200color.pdf
output file in bytes: 1,314,281 (73% size reduction)

gm convert -density 200 input.pdf -colorspace GRAY output200gray.pdf
output file in bytes: 515,063 (90% size reduction)

gm convert -density 300 input.pdf -colorspace GRAY output300gray.pdf
output file in bytes: 1,056,147 (79% size reduction)

The first two convert to color with 150 and 200 DPI, while the next two convert to grayscale with 200 and 300 DPI. All four output files are attached. My suggestion is to run some tests like these to assess the size/quality trade-off. Then put it in a batch file that recurses into all sub-directories and let it rip — after making a backup. :)

Btw, all five files attached are image-only PDFs (no OCR), but all five have metadata that you can see via File>Properties. Regards, Joe
input.pdf
output150color.pdf
output200color.pdf
output200gray.pdf
output300gray.pdf
2
Technology Partners: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 

Author Comment

by:AnthonyMCSE
ID: 41739753
Joe, you have a remarkable very detailed answer.

Is there an option to replace the input.pdf with the output.pd?

How would you handle the case of already "light" PDF files intermixed with the "heavy" PDF files.  Don't want to lose any further detail on the "light" PDF Files.

By "light" I mean files that are already gray scale and 200 or 300 dpi.  By heavy I mean 600 dpi color.

Thanks.
0
 
LVL 54

Accepted Solution

by:
Joe Winograd, EE MVE 2015&2016 earned 300 total points (awarded by participants)
ID: 41739963
> you have a remarkable very detailed answer

Thank you — I appreciate the compliment.

> Is there an option to replace the input.pdf with the output.pdf?

Yes. The sub-command is called mogrify (discussed in the first EE article mentioned above). Of course, since it replaces the input file with no warning, you need to be careful! Here's what one of the commands would look like:

gm mogrify -density 200 -colorspace GRAY input.pdf

> By "light" I mean files that are already gray scale and 200 or 300 dpi. By heavy I mean 600 dpi color.

GM has an identify sub-command, which also has a -verbose option that gives a lot more info, but I don't know if it gives enough of the right info to do that. To try it, before doing the mogrify, your script/program would issue the identify and capture the output. The call is like this:

gm identify input.pdf

The output is like this for the five files mentioned in my previous post:

input.pdf PDF 612x792+0+0 DirectClass 8-bit 1.4Mi 0.000u 0:01
output150color.pdf PDF 1275x1650+0+0 DirectClass 8-bit 6.0Mi 0.000u 0:01
output200color.pdf PDF 1700x2200+0+0 DirectClass 8-bit 10.7Mi 0.000u 0:01
output200gray.pdf PDF 1700x2200+0+0 DirectClass 8-bit 10.7Mi 0.000u 0:01
output300gray.pdf PDF 2550x3300+0+0 DirectClass 8-bit 24.1Mi 0.000u 0:01

There are a gazillion other options in the program that can affect the output, such as -colors, -depth, -geometry, -quality, -resample, -resize, and lots more. You may want to spend a few hours (or days) experimenting, or you may just want to go with something that should work pretty well overall — I'd recommend -density=150 and -colorspace=RGB. Regards, Joe
1
 

Author Comment

by:AnthonyMCSE
ID: 41759745
Apologies for delay, would like to distribute points myself.
0
 
LVL 54

Expert Comment

by:Joe Winograd, EE MVE 2015&2016
ID: 41759774
> would like to distribute points myself

Fine by me. I was simply responding to a "Help resolve this question" email that said, "The following question you participated in has been inactive for 14 days" and "You can still help resolve it by choosing the comment(s) with the most merit and following the prompts to close the question". It's a new process that EE put in place to deal with the enormous number of abandoned questions. It's much better, of course, if askers close questions themselves. Regards, Joe
0
 

Author Closing Comment

by:AnthonyMCSE
ID: 41761498
Thanks everyone!  Good solutions, especially Joe!
0
 
LVL 54

Expert Comment

by:Joe Winograd, EE MVE 2015&2016
ID: 41761522
You're welcome, Anthony. And thanks to you for the compliment — I appreciate hearing it!
0

Featured Post

Efficient way to get backups off site to Azure

This user guide provides instructions on how to deploy and configure both a StoneFly Scale Out NAS Enterprise Cloud Drive virtual machine and Veeam Cloud Connect in the Microsoft Azure Cloud.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Skype is a P2P (Peer to Peer) instant messaging and VOIP (Voice over IP) service – as well as a whole lot more.
This guide will walk you through the essential considerations and tech stack for building scalable websites. Know how to grow your business the smart way!
Michael from AdRem Software explains how to view the most utilized and worst performing nodes in your network, by accessing the Top Charts view in NetCrunch network monitor (https://www.adremsoft.com/). Top Charts is a view in which you can set seve…
In this brief tutorial Pawel from AdRem Software explains how you can quickly find out which services are running on your network, or what are the IP addresses of servers responsible for each service. Software used is freeware NetCrunch Tools (https…

734 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question