Solved

PDF File Shrinker

Posted on 2016-08-01
10
136 Views
Last Modified: 2016-08-22
Have a client with a lot of PDF files that were scanned at to high of a resolution, so they are huge. Has anybody seen a utility you could feed a file list and it would process those files to a lower resolution, perhaps even from color to black and white?
I'm thinking ideally with a preview of the compressed file before it overwrites the source file?
Or something similar tool?
There are some thousands of these PDF files in various directories.
Being able to give it a directory to process, and include sub-directories would be helpful.
Free is good but a nice solution for a couple hundred $$ might work.
0
Comment
Question by:AnthonyMCSE
10 Comments
 
LVL 15

Assisted Solution

by:WalkaboutTigger
WalkaboutTigger earned 100 total points (awarded by participants)
ID: 41738184
So it sounds as if the PDFs are image-type PDFs without a lot of metadata - not OCR'd, in other words.

Were it me, I would look for a package which could do two things:  Convert PDFs to JPEGs, Convert JPEGs to PDFs.  This tool would have a scripting interface so I could pass parameters on the command line to perform this conversion or could scan a directory and perform the conversion on all files within that directory.
I would want the PDF-to-JPG process to allow me to set the DPI and color depth of the resulting JPG file.  This could also be done on the JPG-to-PDF side of the conversion.

This project on Sourceforge  looks promising for the batch conversion from PDF to JPG.
1
 
LVL 5

Assisted Solution

by:Eric C
Eric C earned 100 total points (awarded by participants)
ID: 41738217
If you have the Pro (maybe even Standard) edition of Acrobat, then you get another great program called Distiller.

Launch Distiller, configure the settings with the resolution, quality, etc. Then save it as a preset. Now you can literally drag a folder or a bunch of PDFs onto Distiller. It will "re-distill" the PDFs into smaller PDFs. t can even convert from color to black and white.
1
 
LVL 53

Assisted Solution

by:Joe Winograd, EE MVE
Joe Winograd, EE MVE earned 300 total points (awarded by participants)
ID: 41738349
Hi Anthony:

First, a couple of comments previous posts:

> without a lot of metadata - not OCR'd, in other words

To be clear, metadata is different from the content created by OCR. Metadata fields contain data like Title, Author, Subject, etc. They can exist in an image-only PDF, i.e., one that has not been OCRed. When a PDF is OCRed, text is created in the contents of the PDF — unrelated to the metadata of the PDF.

> maybe even Standard

Yes, Acrobat Standard comes with Distiller, but the Watched Folder feature of Distiller comes only with Acrobat Pro.

Now to your comments.

> a nice solution for a couple hundred $$ might work

Eric's suggestion is a good one at that price point. You can probably find a copy of X (10) or XI (11) Std for that (a one-time, perpetual license fee). Even better, if this is a one-off job that you can complete in a month (and from your description it sounds as if it is), then you can subscribe to Acrobat Std DC for a month for $23 — even a year of Std would cost only $156 (just $13/mo when you commit to a year).

> Free is good

If free interests you, take a look at GraphicsMagick. This EE article discusses how to get it and explains the various editions. It also shows how a simple, 4-line batch file can recurse into all sub-directories of a source directory.

> preview of the compressed file before it overwrites the source file

That's not going to be practical for thousands of PDFs in many directories. I recommend trying various settings on several documents to see what will work well. Even then, I recommend making a complete backup of all the PDFs before letting it rip on all of them.

Attached to this is a one-page, 24-bit color, 600 DPI, image-only PDF (which, btw is the first page of another EE article about GM). It is 4,941,468 bytes. I then ran these GraphicsMagick commands with the convert sub-command (similar to the one in this other EE article, but with a PDF output file and different options):

gm convert -density 150 input.pdf -colorspace RGB output150color.pdf
output file in bytes: 739,723 (85% size reduction)

gm convert -density 200 input.pdf -colorspace RGB output200color.pdf
output file in bytes: 1,314,281 (73% size reduction)

gm convert -density 200 input.pdf -colorspace GRAY output200gray.pdf
output file in bytes: 515,063 (90% size reduction)

gm convert -density 300 input.pdf -colorspace GRAY output300gray.pdf
output file in bytes: 1,056,147 (79% size reduction)

The first two convert to color with 150 and 200 DPI, while the next two convert to grayscale with 200 and 300 DPI. All four output files are attached. My suggestion is to run some tests like these to assess the size/quality trade-off. Then put it in a batch file that recurses into all sub-directories and let it rip — after making a backup. :)

Btw, all five files attached are image-only PDFs (no OCR), but all five have metadata that you can see via File>Properties. Regards, Joe
input.pdf
output150color.pdf
output200color.pdf
output200gray.pdf
output300gray.pdf
1
NAS Cloud Backup Strategies

This article explains backup scenarios when using network storage. We review the so-called “3-2-1 strategy” and summarize the methods you can use to send NAS data to the cloud

 

Author Comment

by:AnthonyMCSE
ID: 41739753
Joe, you have a remarkable very detailed answer.

Is there an option to replace the input.pdf with the output.pd?

How would you handle the case of already "light" PDF files intermixed with the "heavy" PDF files.  Don't want to lose any further detail on the "light" PDF Files.

By "light" I mean files that are already gray scale and 200 or 300 dpi.  By heavy I mean 600 dpi color.

Thanks.
0
 
LVL 53

Accepted Solution

by:
Joe Winograd, EE MVE earned 300 total points (awarded by participants)
ID: 41739963
> you have a remarkable very detailed answer

Thank you — I appreciate the compliment.

> Is there an option to replace the input.pdf with the output.pdf?

Yes. The sub-command is called mogrify (discussed in the first EE article mentioned above). Of course, since it replaces the input file with no warning, you need to be careful! Here's what one of the commands would look like:

gm mogrify -density 200 -colorspace GRAY input.pdf

> By "light" I mean files that are already gray scale and 200 or 300 dpi. By heavy I mean 600 dpi color.

GM has an identify sub-command, which also has a -verbose option that gives a lot more info, but I don't know if it gives enough of the right info to do that. To try it, before doing the mogrify, your script/program would issue the identify and capture the output. The call is like this:

gm identify input.pdf

The output is like this for the five files mentioned in my previous post:

input.pdf PDF 612x792+0+0 DirectClass 8-bit 1.4Mi 0.000u 0:01
output150color.pdf PDF 1275x1650+0+0 DirectClass 8-bit 6.0Mi 0.000u 0:01
output200color.pdf PDF 1700x2200+0+0 DirectClass 8-bit 10.7Mi 0.000u 0:01
output200gray.pdf PDF 1700x2200+0+0 DirectClass 8-bit 10.7Mi 0.000u 0:01
output300gray.pdf PDF 2550x3300+0+0 DirectClass 8-bit 24.1Mi 0.000u 0:01

There are a gazillion other options in the program that can affect the output, such as -colors, -depth, -geometry, -quality, -resample, -resize, and lots more. You may want to spend a few hours (or days) experimenting, or you may just want to go with something that should work pretty well overall — I'd recommend -density=150 and -colorspace=RGB. Regards, Joe
1
 

Author Comment

by:AnthonyMCSE
ID: 41759745
Apologies for delay, would like to distribute points myself.
0
 
LVL 53

Expert Comment

by:Joe Winograd, EE MVE
ID: 41759774
> would like to distribute points myself

Fine by me. I was simply responding to a "Help resolve this question" email that said, "The following question you participated in has been inactive for 14 days" and "You can still help resolve it by choosing the comment(s) with the most merit and following the prompts to close the question". It's a new process that EE put in place to deal with the enormous number of abandoned questions. It's much better, of course, if askers close questions themselves. Regards, Joe
0
 

Author Closing Comment

by:AnthonyMCSE
ID: 41761498
Thanks everyone!  Good solutions, especially Joe!
0
 
LVL 53

Expert Comment

by:Joe Winograd, EE MVE
ID: 41761522
You're welcome, Anthony. And thanks to you for the compliment — I appreciate hearing it!
0

Featured Post

What is SQL Server and how does it work?

The purpose of this paper is to provide you background on SQL Server. It’s your self-study guide for learning fundamentals. It includes both the history of SQL and its technical basics. Concepts and definitions will form the solid foundation of your future DBA expertise.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

In our personal lives, we have well-designed consumer apps to delight us and make even the most complex transactions simple. Many enterprise applications, however, are a bit behind the times. For an enterprise app to be successful in today's tech wo…
All of the resources available today make learning a new digital media easier than ever-- if you know where to begin. This is a clear, simple guide to a few of the basic digital art mediums and how to begin learning them on your own.
The viewer will learn how to successfully download and install the SARDU utility on Windows 7, without downloading adware.
Microsoft Office Picture Manager is not included in Office 2013. This comes as quite a surprise to users upgrading from earlier versions of Office, such as 2007 and 2010, where Picture Manager was included as a standard application. This video expla…

809 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question