Adobe PDF File - Delete pages from a search word

Hi there,

Every week, I have an Adobe PDF files containing a lot of pages, where I must delete pages containing the same strings.

So imagine this.  Any page containing the whole phrase "Delete Me" would have to be deleted from the PDF file.

My best solution, would be to have a simple solution where I can somehow set parameters to delete all pages containing "Delete me" (Yep; case sensitive, but if not, it's not so dramatic!).

Thanks for your help,
Rene
LVL 10
ReneGeAsked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

Randy DownsOWNERCommented:
maybe this will help.

You can utilize the built-in Action that highlights search results and then loop over these results, deleting or extracting the pages they're located on. This can be done from the JS console or from a new toolbar icon.

Another option is to use the Advanced Find window to generate a CSV file with the results and then use that file for the same operation. I've developed a script that allows you to print or extract the matching pages from such a command, and it will not be a problem to also have them deleted, if you wish. See: Acrobat -- Print or Extract Pages from CSV Search Results
0
Joe Winograd, Fellow&MVEDeveloperCommented:
Hi ReneGe,

This question interests me and I'm considering writing a program to do it. But first some questions on your exact requirements:

(1) Do you want to do this for just a single PDF file or an entire folder (and/or subfolders) of PDFs?

(2) It's dangerous to overwrite the input PDF (in case something goes haywire), so I recommend keeping the source file intact and creating a new file with the deleted pages removed. Do you want to rename the new file manually or have the program automatically assign a new name? If the latter, one idea is to add something like _WithoutDeletedPages to the end of the source file name in order to create the new file name. Another idea is to rename the source file by adding something like _WithDeletedPages to its name so that the new file may retain the name of the source file.

(3) Is the phrase that triggers page deletion always going to be Delete Me or would you like that to be a user-specified parameter?

That's it for starters. I may have some other questions when I begin to write and test the code. Regards, Joe
0
ReneGeAuthor Commented:
Hi Joe,

I find very cool that your making of this question a project!!! :)

My current need is per file; which is fine because I can still loop it in a batch file if needed.

_WithDeletedPages  is cool!  Could also be renamed as follows: "%1_%yy%-%mm%%dd%_%hh%-%mim%-%ss%.pdf"

Actually, since I do batch files, it could be called as follows: cleanPDF.exe "FileName.pdf" "Content to look for to delete a page"

I think the search results should only be effective when the whole string is found.  For example "Content to look for to delete a page".

Cheers mate,
Rene
0
Cloud Class® Course: Microsoft Office 2010

This course will introduce you to the interfaces and features of Microsoft Office 2010 Word, Excel, PowerPoint, Outlook, and Access. You will learn about the features that are shared between all products in the Office suite, as well as the new features that are product specific.

Joe Winograd, Fellow&MVEDeveloperCommented:
Hi Rene,
Yes, the YYYYMMDD_hh.mm.ss trick is one of my favorites for naming files — and with any program that requires user input/execution, it is guaranteed to create a unique file name, since a user can't run it in less than a second.

Although you're clearly a whiz with batch files, many users aren't, so I'm thinking of a more self-contained solution that wouldn't require a knowledge of batch files (in the interest of hitting a broader audience). In that spirit, another option that comes to mind is Case Sensitive or Case Insensitive search — I can see a need for both. This is a fascinating question — thanks for asking! You got the brain juices flowing. :)
Cheers, Joe
0
ReneGeAuthor Commented:
That's all good :)
0
Joe Winograd, Fellow&MVEDeveloperCommented:
Hi Rene,
While working on the project tonight, something struck me. How is the delete-this-page phrase getting on the pages of the PDF? You seem to have some control over it, because you suggested that it could be "Delete Me" or "Content to look for to delete a page". So if you can control what the phrase is, you must be involved in getting it on the page, but I'm not understanding how.

Btw, just thought of a good use for this — deleting those annoying "This page intentionally left blank" pages. :)  Regards, Joe
0
ReneGeAuthor Commented:
Hi Joe,

I must say this. You are a very cool and unique individual!!

I found that often in PDF generated reports, they contain annoying empty pages which contain the same sentence structure.

My current need:
I have a weekly 120 page report, which contains around 90 pages containing the sentence "No record here".  I must delete them before printing them.

I currently must pass through every one of them to delete these pages before printing them on very tiny.04 pounds 8.5X11of sliced bleached white dried pressed tree fiber compounds.

Cheers,
Rene
0
Joe Winograd, Fellow&MVEDeveloperCommented:
Thank you for kind words, Rene — I appreciate hearing them.

Please post a sample PDF page (extracted from the actual PDF file) with the "No record here" phrase. I want to make sure that my code handles it right (PDFs can be strange beasts). Would be good if you also post a page without that phrase (but be careful that it doesn't contain any private/sensitive info).

It's getting into the wee hours in my neck of the woods, so this will be my last post tonight. I'll check back into the thread first thing in the morning. Thanks, Joe

Update: I see that you viewed my Xpdf - Part 1 video Micro Tutorial. The solution I'm working on will use Part 3 in that series, which is about PDFtoText (Xpdf's utility to convert a PDF file to plain text). That's how my program will determine if the delete-this-page phrase is on a page.
0
aikimarkCommented:
@Rene

Maybe we should be looking at the process that creates such a report, rather than trying to fix the PDF.
0
Joe Winograd, Fellow&MVEDeveloperCommented:
Hi Rene,
I completed the program — a "quick and dirty" version, but with your batch coding skills, it should work well for you. I tested it on several PDFs here and it works perfectly, but before posting the code, I'd like to test it on your PDFs, so please post a sample when you get a chance.

I'm leaving my office now for several hours. Will check back into the thread as soon as I return. Regards, Joe
0
ReneGeAuthor Commented:
@aikimark
The PDF file is created by an application in which I do not have any access to the source code.

@Joe
Wow, that cool, great and wonderful!
How can I try your solution?
Please give me that command line arguments and switches (if any).

Thanks and cheers :)
Rene
0
Joe Winograd, Fellow&MVEDeveloperCommented:
> Wow, that cool, great and wonderful!

I'm glad you think so. :)

> How can I try your solution?

I'm writing an article about it for publication here at EE. I'll send you the link when I submit the article, probably during the weekend. In the meantime, please post one of your PDFs here (or at least some sample pages — some with the Delete Phrase, some without it). As I alluded to earlier, the PDF spec is very complex — not all PDFs are created equal. It's working perfectly on my PDFs, but I'd like to test it on your PDFs to be sure there are no issues.

> Please give me that command line arguments and switches (if any).

I took your suggestion with one addition — a parameter for a case sensitive ("cs") or case insensitive ("ci") search. So the command line is:

cleanPDF.exe FileName DeletePhrase SearchType

Example:

cleanPDF.exe c:\temp\test.pdf DeleteMe cs

Of course, if the file name or search phrase has spaces in it, enclose it in quotes, such as:

cleanPDF.exe "c:\folder has spaces\file has spaces.pdf" "delete me has spaces" ci

Regards, Joe
0
ReneGeAuthor Commented:
Hi Joe,

Here is a "modified" PDF file for you.

The expected command line should be something like this:
cleanPDF.exe "C:\Temp\EE_JoeWinograd.pdf" "No Punches" cs

Cheers
EE-JoeWinograd.pdf
0
Joe Winograd, Fellow&MVEDeveloperCommented:
Hi Rene,

cleanPDF works perfectly on your PDF.

I changed one feature. The third parameter is now a string of one-character switches, instead of "ci" or "cs". I did this to have an easy way of specifying additional options in the future. I already added one — a troubleshooting (debugging) run. So here are the current switches:

s — case sensitive search (default is case insensitive)
t — troubleshooting run (default is normal run)

If any character that is not defined as a switch is specified in the third parameter, it will be ignored. The switches may be entered in any order.

So the command line you'll run is now like this:

cleanPDF.exe "C:\Temp\EE_JoeWinograd.pdf" "No Punches" s

To facilitate its use in batch files, cleanPDF's operation is silent (except for fatal errors, such as input file or search phrase not specified). Because of this, it creates a logfile (in the same folder as the PDF file) so you can check its operation after it runs. The logfile name is:

cleanPDF_LogFile_YYYY-MM-DD_hh.mm.ss.txt

For example, here's the logfile from the run shown above:

cleanPDF started at 2015-08-15_11.56.41
Version 1.0 [Build 20150815.1015]
Input file: C:\temp\EE-JoeWinograd.pdf
Output file: C:\temp\EE-JoeWinograd_2015-08-15_11.56.41.pdf
Search phrase: No Punches
Search type: Case Sensitive
Troubleshooting run: No
cleanPDF finished at 2015-08-15_11.56.42
Number of pages kept: 1
Number of pages deleted: 1
Page numbers kept: 2
Page numbers deleted: 1

If it's a troubleshooting run, there will be additional entries in the logfile, each preceded with an asterisk so that it's easy to find the debugging messages.

To make sure it works on a larger file, I duplicated your 2-page file five times to create a 10-page file. Here's the logfile from that run with troubleshooting turned on:

cleanPDF started at 2015-08-15_12.00.50
Version 1.0 [Build 20150815.1015]
Input file: C:\temp\EE-JoeWinograd - 10 pages.pdf
Output file: C:\temp\EE-JoeWinograd - 10 pages_2015-08-15_12.00.50.pdf
Search phrase: No Punches
Search type: Case Sensitive
Troubleshooting run: Yes
*Param1=C:\temp\ee-joewinograd - 10 pages.pdf
*Param2=No Punches
*Param3=st
*Did not find search phrase on page 2
*Did not find search phrase on page 4
*Did not find search phrase on page 6
*Did not find search phrase on page 8
*Did not find search phrase on page 10
cleanPDF finished at 2015-08-15_12.00.53
Number of pages kept: 5
Number of pages deleted: 5
Page numbers kept: 2 4 6 8 10
Page numbers deleted: 1 3 5 7 9

If all pages have the search phrase, meaning all pages would be deleted, it does not create a new PDF. For example, a troubleshooting run that looks for "a" (case insensitive) would generate a logfile like this:

cleanPDF started at 2015-08-15_12.37.50
Version 1.0 [Build 20150815.1015]
Input file: C:\temp\EE-JoeWinograd - 10 pages.pdf
Output file: C:\temp\EE-JoeWinograd - 10 pages_2015-08-15_12.37.50.pdf
Search phrase: a
Search type: Case Insensitive
Troubleshooting run: Yes
*Param1=C:\temp\ee-joewinograd - 10 pages.pdf
*Param2=a
*Param3=t
cleanPDF finished at 2015-08-15_12.37.52
Number of pages kept: All pages have search phrase - did not create new PDF
Number of pages deleted: All pages have search phrase - did not create new PDF
Page numbers kept: N/A
Page numbers deleted: N/A

Likewise, if no pages have the search phrase, meaning no pages would be deleted, it does not create a new PDF. For example, a troubleshooting run that looks for "abc" (case sensitive) would have a logfile like this:

cleanPDF started at 2015-08-15_12.44.11
Version 1.0 [Build 20150815.1015]
Input file: C:\temp\EE-JoeWinograd - 10 pages.pdf
Output file: C:\temp\EE-JoeWinograd - 10 pages_2015-08-15_12.44.11.pdf
Search phrase: abc
Search type: Case Sensitive
Troubleshooting run: Yes
*Param1=C:\temp\ee-joewinograd - 10 pages.pdf
*Param2=abc
*Param3=ts
*Did not find search phrase on page 1
*Did not find search phrase on page 2
*Did not find search phrase on page 3
*Did not find search phrase on page 4
*Did not find search phrase on page 5
*Did not find search phrase on page 6
*Did not find search phrase on page 7
*Did not find search phrase on page 8
*Did not find search phrase on page 9
*Did not find search phrase on page 10
cleanPDF finished at 2015-08-15_12.44.13
Number of pages kept: No pages have search phrase - did not create new PDF
Number of pages deleted: No pages have search phrase - did not create new PDF
Page numbers kept: N/A
Page numbers deleted: N/A

I mentioned above that the program goes non-silent in the case of a fatal error. Here's one example:

input file does not exist
I'm working on the article, but I'll send you a message via the Messaging System with a link to download the program now. Before publishing the article I'd like you to let me know if it works well on your weekly 120-page report. If it doesn't, I hope you'll work with me to troubleshoot it. Regards, Joe
0
ReneGeAuthor Commented:
Here is the error message I get:

Error=1 trying to get number of pages in C:\BatchFiles\CleanPDF\EE_JoeWinograd.pdf
Most common cause of this is a secure PDF.

cleanPDF started at 2015-08-15_14.41.10
Version 1.0 [Build 20150815.1015]
Input file: C:\BatchFiles\CleanPDF\EE_JoeWinograd.pdf
Output file: C:\BatchFiles\CleanPDF\EE_JoeWinograd_2015-08-15_14.41.10.pdf
Search phrase: No Punches
Search type: Case Insensitive
Troubleshooting run: Yes
*Param1=C:\BatchFiles\CleanPDF\EE_JoeWinograd.pdf
*Param2=No Punches
*Param3=t
0
Joe Winograd, Fellow&MVEDeveloperCommented:
Oops...my bad...I should have explained this. cleanPDF requires both Xpdf's PDFtoText.exe and PDF Labs' PDFtk.exe (and the DLL that it uses, libiconv2.dll). You already watched my EE video Micro Tutorial that explains how to download and install PDFtoText, so you should be good on that. For PDFtk, here's the download link:
https://www.pdflabs.com/tools/pdftk-the-pdf-toolkit/pdftk_server-2.02-win-setup.exe

Simply run that setup file.

Btw, neither of those really needs to be "installed". In fact, to keep things simpler in the program, it assumes that they will be in the same folder as cleanPDF itself. In other words, you must have these four files in the same folder:

cleanPDF.exe
PDFtoText.exe
PDFtk.exe
libiconv2.dll

I wrote the program in AutoHotkey, but you do not need to download and install it, because I compiled cleanPDF into an executable (cleanPDF.exe). However, if you're interested in learning about the language, this EE article should help:
AutoHotkey - Getting Started

Regards, Joe

Upadte: Based on your experience, I'll modify the program to look for PDFtoText.exe in "C:\Program Files (x86)\xpdf\" and "C:\Program Files\xpdf\" (standard locations for it); and for PDFtk.exe (and libiconv2.dll) in "C:\Program Files (x86)\PDF Labs\PDFtk Server\bin\" and "C:\Program Files\PDF Labs\PDFtk Server\bin\" (standard locations for it). I'll also look for them in the same folder where cleanPDF.exe is located. I'll throw the error only if it fails to find them in all three locations, and in that case, I'll display a more meaning message, such as "PDFtoText.exe not found in standard locations - please install it". Regards, Joe
0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
ReneGeAuthor Commented:
Works perfectly :)

Ideas:
-What if need to have it case sensitive and troubleshooting mode?
-I think the output diag txt file should have the same name as the output file but with the txt extension.

Cheers
0
Joe Winograd, Fellow&MVEDeveloperCommented:
> Works perfectly :)

Glad to hear it!

> What if need to have it case sensitive and troubleshooting mode?

You can already do that. Make the third parameter this:

st

Order doesn't matter, so it could also be this:

ts

> I think the output diag txt file should have the same name as the output file but with the txt extension.

Excellent idea! That way you'll know immediately right from the file name which input file that the logfile is for. I'll put that in the next build, along with the fix for detecting that the required support files are missing. But I'm thinking that I should also include LogFile in the file name, just in case a future enhancement creates other TXT files besides the LogFile. Make sense? Regards, Joe
0
ReneGeAuthor Commented:
Ideas:

I would not put any log file if "t" is not in the third argument, where "t" would put the complete troubleshooting log file.

This is just in case we run it in a batch file loop in order to remove pages of multiple pdf files.

Or "L" for log and "t" for troubleshoot and "e" for log errors only.

Cheers :)
0
Joe Winograd, Fellow&MVEDeveloperCommented:
I like adding an "L" as the switch for creating a logfile (btw, switches are case insensitive). Default would be no logfile. "T" would imply "L" (i.e., no need to specify "L" when "T" is specified). I'm not so sure about adding the "E" for errors only, which is probably not needed, since errors give the dialog — although it could be useful for non-fatal errors. Thanks for the ideas! Cheers, Joe
0
Joe Winograd, Fellow&MVEDeveloperCommented:
Rene,
I'm going offline now for a few hours — need to get some things done AFK/IRL. More later — already working on the new build. :)  Regards, Joe
0
Joe Winograd, Fellow&MVEDeveloperCommented:
Release Notes for Version 1.1 [Build 20150815.2045]

o  Look for PDFtoText.exe in
C:\Program Files (x86)\xpdf\
C:\Program Files\xpdf\
the same folder where cleanPDF.exe is
(fatal error if not found in one of those three folders)

o  Look for PDFtk.exe in
C:\Program Files (x86)\PDF Labs\PDFtk Server\bin\
C:\Program Files\PDF Labs\PDFtk Server\bin\
the same folder where cleanPDF.exe is
(fatal error if not found in one of those three folders)

o  Look for libiconv2.dll in the same folder where PDFtk.exe is
(fatal error if not found in that folder)

o  New option switch "L" to create logfile, so the default now is not to create logfile
("L" not needed if "T" present, i.e., "T" creates a logfile whether or not "L" present)

o  Logfile name changed to:
<input file name>_YYYYMMDD_hh.mm.ss_cleanPDFlogfile.txt

o  New troubleshooting entry in logfile when search phrase is found on a page

Will send you a message with instructions for new version. Regards, Joe
0
ReneGeAuthor Commented:
Hi Joe,

Just tried it with: @ECHO OFF & "CleanPDF.exe" "%~1" "No Punches" s

Very nice :)

Going to bed now.

I'll try all options tomorrow morning (-5GMT)

Cheers and good night,
Rene
0
Joe Winograd, Fellow&MVEDeveloperCommented:
Hi Rene,

Thanks for testing it in a batch file — I neglected to do that (all testing was in a command prompt). I'm very glad to hear it worked in the BAT.

I'm normally in Central Time, too, but right now in Eastern Time.

Looking forward to hearing how your testing goes tomorrow. Regards, Joe
0
Joe Winograd, Fellow&MVEDeveloperCommented:
Hi Rene,

Would like your feedback on changing the icon for cleanPDF. As you may have noticed in v1.0 and v1.1, the program's icon is a red X:

cleanPDF-redX
At first I thought it was a good choice because it signifies deletion. But then it occurred to me that it may scare users a bit, so now I'm thinking that an icon along the lines of cleaning, rather than deleting, may be a better way to go, such as this broom:

cleanPDF-broom
What's your opinion? Opinions from anyone else following this thread would be welcome. Thanks, Joe
0
ReneGeAuthor Commented:
Hi Joe,

I think the "X" better represents what this tool does.

This is because, the broom may inspire for a subjective interpretation of what it does.

Its just an opinion, nothing more!

Cheers mate,
Rene
0
Joe Winograd, Fellow&MVEDeveloperCommented:
You're right — it does delete pages! Let's keep the red X. But maybe the name cleanPDF should be changed, since "clean" may also inspire a subjective interpretation of what it does. How about DeletePagesPDF.exe for the name?
0
ReneGeAuthor Commented:
What about : AutomaticPageDeletePDF.exe
0
Joe Winograd, Fellow&MVEDeveloperCommented:
I like it! Or maybe a shortened version:  AutoPageDeletePDF.exe
0
ReneGeAuthor Commented:
Nice :)
0
ReneGeAuthor Commented:
Thanks Joe :)

This was a fun project!

You rock!!

Cheers mate,
Rene
0
Joe Winograd, Fellow&MVEDeveloperCommented:
Hi Rene,

I'm still writing the article (it's taking longer than I expected) and I'm working on a section now with examples. One of the examples is based on a post of yours:

AutoPageDeletePDF.exe "%~1" "No Punches" s

I'm not an expert in batch files and am wondering why you used:

"%~1"

As far as I know, the purpose of the tilde is to remove the quotes. But then you put the quotes around %~1. So isn't "%~1" the same as %1 — if not, how is it different?

On a separate note, has the program been working well for you? Thanks, Joe
0
Joe Winograd, Fellow&MVEDeveloperCommented:
Rene,
Amazing timing! After no comments from either of us for six days, we were both writing posts at the same exact time. Anyway, thanks for the compliment — I appreciate it. I may be back to ask you more questions as I continue writing the article. Cheers, Joe
0
ReneGeAuthor Commented:
You're always welcome!
0
Joe Winograd, Fellow&MVEDeveloperCommented:
When  you get a chance, please answer the questions in this post:
http:#a40942965
Thanks!
0
ReneGeAuthor Commented:
With pleasure and very shortly!
0
Joe Winograd, Fellow&MVEDeveloperCommented:
Rene,
Another question for you. I'd like to use real-world examples in the article and am wondering if "No record here" is really the delete phrase for your 120-page weekly report. Or is it "No Punches"? Or something else? Thanks again, Joe
0
ReneGeAuthor Commented:
Hi Joe,

This PDF file is genetated by a time attendance system which also includes employees that did not punch in the selected pay period.

The PDF file I use with utility you created is to delete all pages containg the phrase: "No Punches".

Cheers
0
Joe Winograd, Fellow&MVEDeveloperCommented:
Hi Rene,
That's a great real-world use of it! Thanks for letting me know. Regards, Joe
0
ReneGeAuthor Commented:
With pleasure :)

I'm in the middle of a few emergencies.  I'll update you later with the batch file tests (which will not take much time anyway).

Cheers mate!
0
Joe Winograd, Fellow&MVEDeveloperCommented:
No rush, Rene. I appreciate your ongoing support.
0
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Adobe Acrobat

From novice to tech pro — start learning today.