Extracting P&ID Reference and the drawing number from a PDF Isometric drawing into excel

There are more than 8,000 PDF files containing PIPING ISOMETRIC drawings of a process plant. These isometric drawings contain Process and Instrumentation Diagram reference (P&ID). Each file has a unique drawing number. I want to extract this drawing number as as well as the P&ID drawing numbers in an excel file. There is no provision here to attach a sample file but it looks something like this.

P&ID: MD-513-1A00-EG-PR-PID-2035       ;        DWG: MD-513-1A00-ISO-CWR-AA22-0401-23

Can anyone please help. It would save me a lot of time. Thanks in advance
Devendra RamAsked:
Who is Participating?
I have no ready-to-use code at hand, but maybe some thoughts on the issue might help you anyhow (depends on your coding experience ...).

First, I don't know of a way to directly import PDF into Excel (which would in most cases make no sense, anyhow). Not even the textual part ... that part needs to be circumvented.

I presume that the text you like to extract is a textual (opposite to graphical...) content of the PDF file. I've had a case where I got a bunch of PDF file to import data from to Excel, and I did it with a utility name "pdf2text" from the Xpdf package (see here). The utility just runs over the PDF file and spits out every plain text.

So for extracting the text from the PDFs you'll just have to walk thru the files (there are numerous examples on the web for doing that with VBA) and call pdftotext with appropriate parameters. But I'd do that with a simple batch file ... nevertheless there are various examples on the web on how to call an external program from VBA.

In the batch file I'd use some command line filter program (I'd recommend GNU grep, see here for a Windows port) to filter out the relevant text.

With grep you'll be able to filter with regular expressions. For your text example, the regular expression like

Open in new window

you could filter out all line with text matching your example. To beautify the results, you'd pipe 'em thru GNU sed (see here and filter out everything except the IDs, separated with ; (numberd captures around the IDs ...) ... I'm kind of stuck at the regex syntax for that ... any help around ?

Pipe the result appending (>>) to a temp file with .csv extension.

When you're thru with all files, the temp file contains two columns of neatly ordered data in CSV format, ready to load into excel.

You even could call that batch file from an Excel macro .... ;-)
Rob HensonFinance AnalystCommented:
Are these in the filename or within the drawing itself?

I recall a DOS routine that can create a list of filenames from a specific directory and output to a text file. The text file can then be opened in Excel and, assuming there is a standard convention to the filename, the relevant parts can be extracted using formulae or text to columns.

If not and the required info is in the drawing, if the PDF files were created as an image then the text element will be difficult to extract.

Rob H
Joe Winograd, Fellow&MVEDeveloperCommented:
Hi Devendra,

I see that you joined EE today — welcome aboard!

Here's a design roadmap for you:

(1) Use Xpdf's PDFtoText command line utility to extract the text from each PDF. The EE 5-minute video Micro Tutorial, Xpdf - Command Line Utility for PDF Files - Part 1, introduces the Xpdf utilities, with download information:

And another EE 5-minute video Micro Tutorial, Xpdf - Convert PDF Files to Plain Text Files - Part 3, specifically discusses PDFtoText:

(2) Write a program/script in whatever language you prefer that loops through all of the PDF files, calling PDFtoText for each one. I've done this in the AutoHotkey language, described in the EE article, AutoHotkey - Getting Started:

(3) Look for the string "P&ID:" (assuming, of course, that this string occurs only once in each PDF). Extract the string following that into a variable.

(4) Look for the string "DWG:" (again, assuming that this string occurs only once in each PDF). Extract the string following that into a variable.

(5) Create a line in a plain text file in CSV format (one for each PDF file) that has the file name, the "P&ID:" value, and the "DWG:" value. If you want to get fancier, use Excel Component Object Model (COM) calls to create an actual Excel file (XLS or XLSX) instead of a CSV file, putting the file name in column A, the "P&ID:" value in column B, and the "DWG:" value in column C (one row for each PDF).

(6) You didn't say if all 8,000+ PDF files are in the same folder or in subfolders. If the latter, then the program/script should recurse into subfolders.

As mentioned above, I would do this in the AutoHotkey language, but any programming/scripting language that can call command line tools will suffice. If you want to create an XLS/XLSX file instead of a CSV file, the language needs to have COM support, which AutoHotkey does, as discussed in this recent EE thread — and note the links in this post at that thread. Regards, Joe
Devendra RamAuthor Commented:
Could not exactly utilize the advise however, found a work around logic to figure out the numbering strategy. Thank you all.
Joe Winograd, Fellow&MVEDeveloperCommented:
You're welcome. And thanks to you for coming back to close the question. It's always better when the asker closes the question, rather than the participants.
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.