Link to home
Start Free TrialLog in
Avatar of ReneGe
ReneGeFlag for Canada

asked on

Adobe PDF File - Delete pages from a search word

Hi there,

Every week, I have an Adobe PDF files containing a lot of pages, where I must delete pages containing the same strings.

So imagine this.  Any page containing the whole phrase "Delete Me" would have to be deleted from the PDF file.

My best solution, would be to have a simple solution where I can somehow set parameters to delete all pages containing "Delete me" (Yep; case sensitive, but if not, it's not so dramatic!).

Thanks for your help,
Rene
Avatar of Randy Downs
Randy Downs
Flag of United States of America image

maybe this will help.

You can utilize the built-in Action that highlights search results and then loop over these results, deleting or extracting the pages they're located on. This can be done from the JS console or from a new toolbar icon.

Another option is to use the Advanced Find window to generate a CSV file with the results and then use that file for the same operation. I've developed a script that allows you to print or extract the matching pages from such a command, and it will not be a problem to also have them deleted, if you wish. See: Acrobat -- Print or Extract Pages from CSV Search Results
Hi ReneGe,

This question interests me and I'm considering writing a program to do it. But first some questions on your exact requirements:

(1) Do you want to do this for just a single PDF file or an entire folder (and/or subfolders) of PDFs?

(2) It's dangerous to overwrite the input PDF (in case something goes haywire), so I recommend keeping the source file intact and creating a new file with the deleted pages removed. Do you want to rename the new file manually or have the program automatically assign a new name? If the latter, one idea is to add something like _WithoutDeletedPages to the end of the source file name in order to create the new file name. Another idea is to rename the source file by adding something like _WithDeletedPages to its name so that the new file may retain the name of the source file.

(3) Is the phrase that triggers page deletion always going to be Delete Me or would you like that to be a user-specified parameter?

That's it for starters. I may have some other questions when I begin to write and test the code. Regards, Joe
Avatar of ReneGe

ASKER

Hi Joe,

I find very cool that your making of this question a project!!! :)

My current need is per file; which is fine because I can still loop it in a batch file if needed.

_WithDeletedPages  is cool!  Could also be renamed as follows: "%1_%yy%-%mm%%dd%_%hh%-%mim%-%ss%.pdf"

Actually, since I do batch files, it could be called as follows: cleanPDF.exe "FileName.pdf" "Content to look for to delete a page"

I think the search results should only be effective when the whole string is found.  For example "Content to look for to delete a page".

Cheers mate,
Rene
Hi Rene,
Yes, the YYYYMMDD_hh.mm.ss trick is one of my favorites for naming files — and with any program that requires user input/execution, it is guaranteed to create a unique file name, since a user can't run it in less than a second.

Although you're clearly a whiz with batch files, many users aren't, so I'm thinking of a more self-contained solution that wouldn't require a knowledge of batch files (in the interest of hitting a broader audience). In that spirit, another option that comes to mind is Case Sensitive or Case Insensitive search — I can see a need for both. This is a fascinating question — thanks for asking! You got the brain juices flowing. :)
Cheers, Joe
Avatar of ReneGe

ASKER

That's all good :)
Hi Rene,
While working on the project tonight, something struck me. How is the delete-this-page phrase getting on the pages of the PDF? You seem to have some control over it, because you suggested that it could be "Delete Me" or "Content to look for to delete a page". So if you can control what the phrase is, you must be involved in getting it on the page, but I'm not understanding how.

Btw, just thought of a good use for this — deleting those annoying "This page intentionally left blank" pages. :)  Regards, Joe
Avatar of ReneGe

ASKER

Hi Joe,

I must say this. You are a very cool and unique individual!!

I found that often in PDF generated reports, they contain annoying empty pages which contain the same sentence structure.

My current need:
I have a weekly 120 page report, which contains around 90 pages containing the sentence "No record here".  I must delete them before printing them.

I currently must pass through every one of them to delete these pages before printing them on very tiny.04 pounds 8.5X11of sliced bleached white dried pressed tree fiber compounds.

Cheers,
Rene
Thank you for kind words, Rene — I appreciate hearing them.

Please post a sample PDF page (extracted from the actual PDF file) with the "No record here" phrase. I want to make sure that my code handles it right (PDFs can be strange beasts). Would be good if you also post a page without that phrase (but be careful that it doesn't contain any private/sensitive info).

It's getting into the wee hours in my neck of the woods, so this will be my last post tonight. I'll check back into the thread first thing in the morning. Thanks, Joe

Update: I see that you viewed my Xpdf - Part 1 video Micro Tutorial. The solution I'm working on will use Part 3 in that series, which is about PDFtoText (Xpdf's utility to convert a PDF file to plain text). That's how my program will determine if the delete-this-page phrase is on a page.
@Rene

Maybe we should be looking at the process that creates such a report, rather than trying to fix the PDF.
Hi Rene,
I completed the program — a "quick and dirty" version, but with your batch coding skills, it should work well for you. I tested it on several PDFs here and it works perfectly, but before posting the code, I'd like to test it on your PDFs, so please post a sample when you get a chance.

I'm leaving my office now for several hours. Will check back into the thread as soon as I return. Regards, Joe
Avatar of ReneGe

ASKER

@aikimark
The PDF file is created by an application in which I do not have any access to the source code.

@Joe
Wow, that cool, great and wonderful!
How can I try your solution?
Please give me that command line arguments and switches (if any).

Thanks and cheers :)
Rene
> Wow, that cool, great and wonderful!

I'm glad you think so. :)

> How can I try your solution?

I'm writing an article about it for publication here at EE. I'll send you the link when I submit the article, probably during the weekend. In the meantime, please post one of your PDFs here (or at least some sample pages — some with the Delete Phrase, some without it). As I alluded to earlier, the PDF spec is very complex — not all PDFs are created equal. It's working perfectly on my PDFs, but I'd like to test it on your PDFs to be sure there are no issues.

> Please give me that command line arguments and switches (if any).

I took your suggestion with one addition — a parameter for a case sensitive ("cs") or case insensitive ("ci") search. So the command line is:

cleanPDF.exe FileName DeletePhrase SearchType

Example:

cleanPDF.exe c:\temp\test.pdf DeleteMe cs

Of course, if the file name or search phrase has spaces in it, enclose it in quotes, such as:

cleanPDF.exe "c:\folder has spaces\file has spaces.pdf" "delete me has spaces" ci

Regards, Joe
Avatar of ReneGe

ASKER

Hi Joe,

Here is a "modified" PDF file for you.

The expected command line should be something like this:
cleanPDF.exe "C:\Temp\EE_JoeWinograd.pdf" "No Punches" cs

Cheers
EE-JoeWinograd.pdf
Hi Rene,

cleanPDF works perfectly on your PDF.

I changed one feature. The third parameter is now a string of one-character switches, instead of "ci" or "cs". I did this to have an easy way of specifying additional options in the future. I already added one — a troubleshooting (debugging) run. So here are the current switches:

s — case sensitive search (default is case insensitive)
t — troubleshooting run (default is normal run)

If any character that is not defined as a switch is specified in the third parameter, it will be ignored. The switches may be entered in any order.

So the command line you'll run is now like this:

cleanPDF.exe "C:\Temp\EE_JoeWinograd.pdf" "No Punches" s

To facilitate its use in batch files, cleanPDF's operation is silent (except for fatal errors, such as input file or search phrase not specified). Because of this, it creates a logfile (in the same folder as the PDF file) so you can check its operation after it runs. The logfile name is:

cleanPDF_LogFile_YYYY-MM-DD_hh.mm.ss.txt

For example, here's the logfile from the run shown above:

cleanPDF started at 2015-08-15_11.56.41
Version 1.0 [Build 20150815.1015]
Input file: C:\temp\EE-JoeWinograd.pdf
Output file: C:\temp\EE-JoeWinograd_2015-08-15_11.56.41.pdf
Search phrase: No Punches
Search type: Case Sensitive
Troubleshooting run: No
cleanPDF finished at 2015-08-15_11.56.42
Number of pages kept: 1
Number of pages deleted: 1
Page numbers kept: 2
Page numbers deleted: 1

If it's a troubleshooting run, there will be additional entries in the logfile, each preceded with an asterisk so that it's easy to find the debugging messages.

To make sure it works on a larger file, I duplicated your 2-page file five times to create a 10-page file. Here's the logfile from that run with troubleshooting turned on:

cleanPDF started at 2015-08-15_12.00.50
Version 1.0 [Build 20150815.1015]
Input file: C:\temp\EE-JoeWinograd - 10 pages.pdf
Output file: C:\temp\EE-JoeWinograd - 10 pages_2015-08-15_12.00.50.pdf
Search phrase: No Punches
Search type: Case Sensitive
Troubleshooting run: Yes
*Param1=C:\temp\ee-joewinograd - 10 pages.pdf
*Param2=No Punches
*Param3=st
*Did not find search phrase on page 2
*Did not find search phrase on page 4
*Did not find search phrase on page 6
*Did not find search phrase on page 8
*Did not find search phrase on page 10
cleanPDF finished at 2015-08-15_12.00.53
Number of pages kept: 5
Number of pages deleted: 5
Page numbers kept: 2 4 6 8 10
Page numbers deleted: 1 3 5 7 9

If all pages have the search phrase, meaning all pages would be deleted, it does not create a new PDF. For example, a troubleshooting run that looks for "a" (case insensitive) would generate a logfile like this:

cleanPDF started at 2015-08-15_12.37.50
Version 1.0 [Build 20150815.1015]
Input file: C:\temp\EE-JoeWinograd - 10 pages.pdf
Output file: C:\temp\EE-JoeWinograd - 10 pages_2015-08-15_12.37.50.pdf
Search phrase: a
Search type: Case Insensitive
Troubleshooting run: Yes
*Param1=C:\temp\ee-joewinograd - 10 pages.pdf
*Param2=a
*Param3=t
cleanPDF finished at 2015-08-15_12.37.52
Number of pages kept: All pages have search phrase - did not create new PDF
Number of pages deleted: All pages have search phrase - did not create new PDF
Page numbers kept: N/A
Page numbers deleted: N/A

Likewise, if no pages have the search phrase, meaning no pages would be deleted, it does not create a new PDF. For example, a troubleshooting run that looks for "abc" (case sensitive) would have a logfile like this:

cleanPDF started at 2015-08-15_12.44.11
Version 1.0 [Build 20150815.1015]
Input file: C:\temp\EE-JoeWinograd - 10 pages.pdf
Output file: C:\temp\EE-JoeWinograd - 10 pages_2015-08-15_12.44.11.pdf
Search phrase: abc
Search type: Case Sensitive
Troubleshooting run: Yes
*Param1=C:\temp\ee-joewinograd - 10 pages.pdf
*Param2=abc
*Param3=ts
*Did not find search phrase on page 1
*Did not find search phrase on page 2
*Did not find search phrase on page 3
*Did not find search phrase on page 4
*Did not find search phrase on page 5
*Did not find search phrase on page 6
*Did not find search phrase on page 7
*Did not find search phrase on page 8
*Did not find search phrase on page 9
*Did not find search phrase on page 10
cleanPDF finished at 2015-08-15_12.44.13
Number of pages kept: No pages have search phrase - did not create new PDF
Number of pages deleted: No pages have search phrase - did not create new PDF
Page numbers kept: N/A
Page numbers deleted: N/A

I mentioned above that the program goes non-silent in the case of a fatal error. Here's one example:

User generated image
I'm working on the article, but I'll send you a message via the Messaging System with a link to download the program now. Before publishing the article I'd like you to let me know if it works well on your weekly 120-page report. If it doesn't, I hope you'll work with me to troubleshoot it. Regards, Joe
Avatar of ReneGe

ASKER

Here is the error message I get:

Error=1 trying to get number of pages in C:\BatchFiles\CleanPDF\EE_JoeWinograd.pdf
Most common cause of this is a secure PDF.

cleanPDF started at 2015-08-15_14.41.10
Version 1.0 [Build 20150815.1015]
Input file: C:\BatchFiles\CleanPDF\EE_JoeWinograd.pdf
Output file: C:\BatchFiles\CleanPDF\EE_JoeWinograd_2015-08-15_14.41.10.pdf
Search phrase: No Punches
Search type: Case Insensitive
Troubleshooting run: Yes
*Param1=C:\BatchFiles\CleanPDF\EE_JoeWinograd.pdf
*Param2=No Punches
*Param3=t
ASKER CERTIFIED SOLUTION
Avatar of Joe Winograd
Joe Winograd
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of ReneGe

ASKER

Works perfectly :)

Ideas:
-What if need to have it case sensitive and troubleshooting mode?
-I think the output diag txt file should have the same name as the output file but with the txt extension.

Cheers
> Works perfectly :)

Glad to hear it!

> What if need to have it case sensitive and troubleshooting mode?

You can already do that. Make the third parameter this:

st

Order doesn't matter, so it could also be this:

ts

> I think the output diag txt file should have the same name as the output file but with the txt extension.

Excellent idea! That way you'll know immediately right from the file name which input file that the logfile is for. I'll put that in the next build, along with the fix for detecting that the required support files are missing. But I'm thinking that I should also include LogFile in the file name, just in case a future enhancement creates other TXT files besides the LogFile. Make sense? Regards, Joe
Avatar of ReneGe

ASKER

Ideas:

I would not put any log file if "t" is not in the third argument, where "t" would put the complete troubleshooting log file.

This is just in case we run it in a batch file loop in order to remove pages of multiple pdf files.

Or "L" for log and "t" for troubleshoot and "e" for log errors only.

Cheers :)
I like adding an "L" as the switch for creating a logfile (btw, switches are case insensitive). Default would be no logfile. "T" would imply "L" (i.e., no need to specify "L" when "T" is specified). I'm not so sure about adding the "E" for errors only, which is probably not needed, since errors give the dialog — although it could be useful for non-fatal errors. Thanks for the ideas! Cheers, Joe
Rene,
I'm going offline now for a few hours — need to get some things done AFK/IRL. More later — already working on the new build. :)  Regards, Joe
Release Notes for Version 1.1 [Build 20150815.2045]

o  Look for PDFtoText.exe in
C:\Program Files (x86)\xpdf\
C:\Program Files\xpdf\
the same folder where cleanPDF.exe is
(fatal error if not found in one of those three folders)

o  Look for PDFtk.exe in
C:\Program Files (x86)\PDF Labs\PDFtk Server\bin\
C:\Program Files\PDF Labs\PDFtk Server\bin\
the same folder where cleanPDF.exe is
(fatal error if not found in one of those three folders)

o  Look for libiconv2.dll in the same folder where PDFtk.exe is
(fatal error if not found in that folder)

o  New option switch "L" to create logfile, so the default now is not to create logfile
("L" not needed if "T" present, i.e., "T" creates a logfile whether or not "L" present)

o  Logfile name changed to:
<input file name>_YYYYMMDD_hh.mm.ss_cleanPDFlogfile.txt

o  New troubleshooting entry in logfile when search phrase is found on a page

Will send you a message with instructions for new version. Regards, Joe
Avatar of ReneGe

ASKER

Hi Joe,

Just tried it with: @ECHO OFF & "CleanPDF.exe" "%~1" "No Punches" s

Very nice :)

Going to bed now.

I'll try all options tomorrow morning (-5GMT)

Cheers and good night,
Rene
Hi Rene,

Thanks for testing it in a batch file — I neglected to do that (all testing was in a command prompt). I'm very glad to hear it worked in the BAT.

I'm normally in Central Time, too, but right now in Eastern Time.

Looking forward to hearing how your testing goes tomorrow. Regards, Joe
Hi Rene,

Would like your feedback on changing the icon for cleanPDF. As you may have noticed in v1.0 and v1.1, the program's icon is a red X:

User generated image
At first I thought it was a good choice because it signifies deletion. But then it occurred to me that it may scare users a bit, so now I'm thinking that an icon along the lines of cleaning, rather than deleting, may be a better way to go, such as this broom:

User generated image
What's your opinion? Opinions from anyone else following this thread would be welcome. Thanks, Joe
Avatar of ReneGe

ASKER

Hi Joe,

I think the "X" better represents what this tool does.

This is because, the broom may inspire for a subjective interpretation of what it does.

Its just an opinion, nothing more!

Cheers mate,
Rene
You're right — it does delete pages! Let's keep the red X. But maybe the name cleanPDF should be changed, since "clean" may also inspire a subjective interpretation of what it does. How about DeletePagesPDF.exe for the name?
Avatar of ReneGe

ASKER

What about : AutomaticPageDeletePDF.exe
I like it! Or maybe a shortened version:  AutoPageDeletePDF.exe
Avatar of ReneGe

ASKER

Nice :)
Avatar of ReneGe

ASKER

Thanks Joe :)

This was a fun project!

You rock!!

Cheers mate,
Rene
Hi Rene,

I'm still writing the article (it's taking longer than I expected) and I'm working on a section now with examples. One of the examples is based on a post of yours:

AutoPageDeletePDF.exe "%~1" "No Punches" s

I'm not an expert in batch files and am wondering why you used:

"%~1"

As far as I know, the purpose of the tilde is to remove the quotes. But then you put the quotes around %~1. So isn't "%~1" the same as %1 — if not, how is it different?

On a separate note, has the program been working well for you? Thanks, Joe
Rene,
Amazing timing! After no comments from either of us for six days, we were both writing posts at the same exact time. Anyway, thanks for the compliment — I appreciate it. I may be back to ask you more questions as I continue writing the article. Cheers, Joe
Avatar of ReneGe

ASKER

You're always welcome!
When  you get a chance, please answer the questions in this post:
http:#a40942965
Thanks!
Avatar of ReneGe

ASKER

With pleasure and very shortly!
Rene,
Another question for you. I'd like to use real-world examples in the article and am wondering if "No record here" is really the delete phrase for your 120-page weekly report. Or is it "No Punches"? Or something else? Thanks again, Joe
Avatar of ReneGe

ASKER

Hi Joe,

This PDF file is genetated by a time attendance system which also includes employees that did not punch in the selected pay period.

The PDF file I use with utility you created is to delete all pages containg the phrase: "No Punches".

Cheers
Hi Rene,
That's a great real-world use of it! Thanks for letting me know. Regards, Joe
Avatar of ReneGe

ASKER

With pleasure :)

I'm in the middle of a few emergencies.  I'll update you later with the batch file tests (which will not take much time anyway).

Cheers mate!
No rush, Rene. I appreciate your ongoing support.