Go Premium for a chance to win a PS4. Enter to Win


Extracting P&ID Reference and the drawing number from a PDF Isometric drawing into excel

Posted on 2016-11-07
Medium Priority
Last Modified: 2016-11-22
There are more than 8,000 PDF files containing PIPING ISOMETRIC drawings of a process plant. These isometric drawings contain Process and Instrumentation Diagram reference (P&ID). Each file has a unique drawing number. I want to extract this drawing number as as well as the P&ID drawing numbers in an excel file. There is no provision here to attach a sample file but it looks something like this.

P&ID: MD-513-1A00-EG-PR-PID-2035       ;        DWG: MD-513-1A00-ISO-CWR-AA22-0401-23

Can anyone please help. It would save me a lot of time. Thanks in advance
Question by:Devendra Ram
LVL 34

Assisted Solution

by:Rob Henson
Rob Henson earned 500 total points
ID: 41877000
Are these in the filename or within the drawing itself?

I recall a DOS routine that can create a list of filenames from a specific directory and output to a text file. The text file can then be opened in Excel and, assuming there is a standard convention to the filename, the relevant parts can be extracted using formulae or text to columns.

If not and the required info is in the drawing, if the PDF files were created as an image then the text element will be difficult to extract.

Rob H
LVL 14

Accepted Solution

frankhelk earned 1000 total points
ID: 41877005
I have no ready-to-use code at hand, but maybe some thoughts on the issue might help you anyhow (depends on your coding experience ...).

First, I don't know of a way to directly import PDF into Excel (which would in most cases make no sense, anyhow). Not even the textual part ... that part needs to be circumvented.

I presume that the text you like to extract is a textual (opposite to graphical...) content of the PDF file. I've had a case where I got a bunch of PDF file to import data from to Excel, and I did it with a utility name "pdf2text" from the Xpdf package (see here). The utility just runs over the PDF file and spits out every plain text.

So for extracting the text from the PDFs you'll just have to walk thru the files (there are numerous examples on the web for doing that with VBA) and call pdftotext with appropriate parameters. But I'd do that with a simple batch file ... nevertheless there are various examples on the web on how to call an external program from VBA.

In the batch file I'd use some command line filter program (I'd recommend GNU grep, see here for a Windows port) to filter out the relevant text.

With grep you'll be able to filter with regular expressions. For your text example, the regular expression like

Open in new window

you could filter out all line with text matching your example. To beautify the results, you'd pipe 'em thru GNU sed (see here and filter out everything except the IDs, separated with ; (numberd captures around the IDs ...) ... I'm kind of stuck at the regex syntax for that ... any help around ?

Pipe the result appending (>>) to a temp file with .csv extension.

When you're thru with all files, the temp file contains two columns of neatly ordered data in CSV format, ready to load into excel.

You even could call that batch file from an Excel macro .... ;-)
LVL 56

Assisted Solution

by:Joe Winograd, EE MVE 2015&2016
Joe Winograd, EE MVE 2015&2016 earned 500 total points
ID: 41877182
Hi Devendra,

I see that you joined EE today — welcome aboard!

Here's a design roadmap for you:

(1) Use Xpdf's PDFtoText command line utility to extract the text from each PDF. The EE 5-minute video Micro Tutorial, Xpdf - Command Line Utility for PDF Files - Part 1, introduces the Xpdf utilities, with download information:

And another EE 5-minute video Micro Tutorial, Xpdf - Convert PDF Files to Plain Text Files - Part 3, specifically discusses PDFtoText:

(2) Write a program/script in whatever language you prefer that loops through all of the PDF files, calling PDFtoText for each one. I've done this in the AutoHotkey language, described in the EE article, AutoHotkey - Getting Started:

(3) Look for the string "P&ID:" (assuming, of course, that this string occurs only once in each PDF). Extract the string following that into a variable.

(4) Look for the string "DWG:" (again, assuming that this string occurs only once in each PDF). Extract the string following that into a variable.

(5) Create a line in a plain text file in CSV format (one for each PDF file) that has the file name, the "P&ID:" value, and the "DWG:" value. If you want to get fancier, use Excel Component Object Model (COM) calls to create an actual Excel file (XLS or XLSX) instead of a CSV file, putting the file name in column A, the "P&ID:" value in column B, and the "DWG:" value in column C (one row for each PDF).

(6) You didn't say if all 8,000+ PDF files are in the same folder or in subfolders. If the latter, then the program/script should recurse into subfolders.

As mentioned above, I would do this in the AutoHotkey language, but any programming/scripting language that can call command line tools will suffice. If you want to create an XLS/XLSX file instead of a CSV file, the language needs to have COM support, which AutoHotkey does, as discussed in this recent EE thread — and note the links in this post at that thread. Regards, Joe

Author Closing Comment

by:Devendra Ram
ID: 41897352
Could not exactly utilize the advise however, found a work around logic to figure out the numbering strategy. Thank you all.
LVL 56

Expert Comment

by:Joe Winograd, EE MVE 2015&2016
ID: 41897676
You're welcome. And thanks to you for coming back to close the question. It's always better when the asker closes the question, rather than the participants.

Featured Post

Free Tool: Site Down Detector

Helpful to verify reports of your own downtime, or to double check a downed website you are trying to access.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Freeze panes is an option within all variants of Excel to enable parts of a sheet to remain stationary when the cursor is in another part of the sheet. This is a very useful feature which is overlooked or under used.
Windows Explorer lets you open cabinet (cab) files like any other folder. In VBA you can easily handle normal files and folders, but opening and indeed creating cabinet files takes a lot more - and that's you'll find here.
The viewer will learn how to use the =DISCRINV command to create a discrete random variable, use this command to model a set of probabilities and outcomes in a Monte Carlo simulation, and learn how to find the standard deviation of a set of probabil…
Although Jacob Bernoulli (1654-1705) has been credited as the creator of "Binomial Distribution Table", Gottfried Leibniz (1646-1716) did his dissertation on the subject in 1666; Leibniz you may recall is the co-inventor of "Calculus" and beat Isaac…

972 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question