Extracting P&ID Reference and the drawing number from a PDF Isometric drawing into excel

Posted on 2016-11-07
Last Modified: 2016-11-22
There are more than 8,000 PDF files containing PIPING ISOMETRIC drawings of a process plant. These isometric drawings contain Process and Instrumentation Diagram reference (P&ID). Each file has a unique drawing number. I want to extract this drawing number as as well as the P&ID drawing numbers in an excel file. There is no provision here to attach a sample file but it looks something like this.

P&ID: MD-513-1A00-EG-PR-PID-2035       ;        DWG: MD-513-1A00-ISO-CWR-AA22-0401-23

Can anyone please help. It would save me a lot of time. Thanks in advance
Question by:Devendra Ram
LVL 33

Assisted Solution

by:Rob Henson
Rob Henson earned 125 total points
ID: 41877000
Are these in the filename or within the drawing itself?

I recall a DOS routine that can create a list of filenames from a specific directory and output to a text file. The text file can then be opened in Excel and, assuming there is a standard convention to the filename, the relevant parts can be extracted using formulae or text to columns.

If not and the required info is in the drawing, if the PDF files were created as an image then the text element will be difficult to extract.

Rob H
LVL 14

Accepted Solution

frankhelk earned 250 total points
ID: 41877005
I have no ready-to-use code at hand, but maybe some thoughts on the issue might help you anyhow (depends on your coding experience ...).

First, I don't know of a way to directly import PDF into Excel (which would in most cases make no sense, anyhow). Not even the textual part ... that part needs to be circumvented.

I presume that the text you like to extract is a textual (opposite to graphical...) content of the PDF file. I've had a case where I got a bunch of PDF file to import data from to Excel, and I did it with a utility name "pdf2text" from the Xpdf package (see here). The utility just runs over the PDF file and spits out every plain text.

So for extracting the text from the PDFs you'll just have to walk thru the files (there are numerous examples on the web for doing that with VBA) and call pdftotext with appropriate parameters. But I'd do that with a simple batch file ... nevertheless there are various examples on the web on how to call an external program from VBA.

In the batch file I'd use some command line filter program (I'd recommend GNU grep, see here for a Windows port) to filter out the relevant text.

With grep you'll be able to filter with regular expressions. For your text example, the regular expression like

Open in new window

you could filter out all line with text matching your example. To beautify the results, you'd pipe 'em thru GNU sed (see here and filter out everything except the IDs, separated with ; (numberd captures around the IDs ...) ... I'm kind of stuck at the regex syntax for that ... any help around ?

Pipe the result appending (>>) to a temp file with .csv extension.

When you're thru with all files, the temp file contains two columns of neatly ordered data in CSV format, ready to load into excel.

You even could call that batch file from an Excel macro .... ;-)
LVL 53

Assisted Solution

by:Joe Winograd, EE MVE
Joe Winograd, EE MVE earned 125 total points
ID: 41877182
Hi Devendra,

I see that you joined EE today — welcome aboard!

Here's a design roadmap for you:

(1) Use Xpdf's PDFtoText command line utility to extract the text from each PDF. The EE 5-minute video Micro Tutorial, Xpdf - Command Line Utility for PDF Files - Part 1, introduces the Xpdf utilities, with download information:

And another EE 5-minute video Micro Tutorial, Xpdf - Convert PDF Files to Plain Text Files - Part 3, specifically discusses PDFtoText:

(2) Write a program/script in whatever language you prefer that loops through all of the PDF files, calling PDFtoText for each one. I've done this in the AutoHotkey language, described in the EE article, AutoHotkey - Getting Started:

(3) Look for the string "P&ID:" (assuming, of course, that this string occurs only once in each PDF). Extract the string following that into a variable.

(4) Look for the string "DWG:" (again, assuming that this string occurs only once in each PDF). Extract the string following that into a variable.

(5) Create a line in a plain text file in CSV format (one for each PDF file) that has the file name, the "P&ID:" value, and the "DWG:" value. If you want to get fancier, use Excel Component Object Model (COM) calls to create an actual Excel file (XLS or XLSX) instead of a CSV file, putting the file name in column A, the "P&ID:" value in column B, and the "DWG:" value in column C (one row for each PDF).

(6) You didn't say if all 8,000+ PDF files are in the same folder or in subfolders. If the latter, then the program/script should recurse into subfolders.

As mentioned above, I would do this in the AutoHotkey language, but any programming/scripting language that can call command line tools will suffice. If you want to create an XLS/XLSX file instead of a CSV file, the language needs to have COM support, which AutoHotkey does, as discussed in this recent EE thread — and note the links in this post at that thread. Regards, Joe

Author Closing Comment

by:Devendra Ram
ID: 41897352
Could not exactly utilize the advise however, found a work around logic to figure out the numbering strategy. Thank you all.
LVL 53

Expert Comment

by:Joe Winograd, EE MVE
ID: 41897676
You're welcome. And thanks to you for coming back to close the question. It's always better when the asker closes the question, rather than the participants.

Featured Post

Networking for the Cloud Era

Join Microsoft and Riverbed for a discussion and demonstration of enhancements to SteelConnect:
-One-click orchestration and cloud connectivity in Azure environments
-Tight integration of SD-WAN and WAN optimization capabilities
-Scalability and resiliency equal to a data center

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

A little background as to how I came to I design this code: Around 5 years ago I designed an add-in that formatted Excel files to a corporate standard, applying different cell colours and font type depending on whether the cells contained inputs,…
Introduction While answering a recent question (http:/Q_27311462.html), I created an alternative function to the Excel Concatenate() function that you might find useful.  I tested several solutions and share the results in this article as well as t…
The viewer will learn how to create two correlated normally distributed random variables in Excel, use a normal distribution to simulate the return on different levels of investment in each of the two funds over a period of ten years, and, create a …
This Micro Tutorial will demonstrate how to create pivot charts out of a data set. I also added a drop-down menu which allows to choose from different categories in the data set and the chart will automatically update.

821 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question