Extracting P&ID Reference and the drawing number from a PDF Isometric drawing into excel

Posted on 2016-11-07
Last Modified: 2016-11-22
There are more than 8,000 PDF files containing PIPING ISOMETRIC drawings of a process plant. These isometric drawings contain Process and Instrumentation Diagram reference (P&ID). Each file has a unique drawing number. I want to extract this drawing number as as well as the P&ID drawing numbers in an excel file. There is no provision here to attach a sample file but it looks something like this.

P&ID: MD-513-1A00-EG-PR-PID-2035       ;        DWG: MD-513-1A00-ISO-CWR-AA22-0401-23

Can anyone please help. It would save me a lot of time. Thanks in advance
Question by:Devendra Ram
LVL 32

Assisted Solution

by:Rob Henson
Rob Henson earned 125 total points
ID: 41877000
Are these in the filename or within the drawing itself?

I recall a DOS routine that can create a list of filenames from a specific directory and output to a text file. The text file can then be opened in Excel and, assuming there is a standard convention to the filename, the relevant parts can be extracted using formulae or text to columns.

If not and the required info is in the drawing, if the PDF files were created as an image then the text element will be difficult to extract.

Rob H
LVL 14

Accepted Solution

frankhelk earned 250 total points
ID: 41877005
I have no ready-to-use code at hand, but maybe some thoughts on the issue might help you anyhow (depends on your coding experience ...).

First, I don't know of a way to directly import PDF into Excel (which would in most cases make no sense, anyhow). Not even the textual part ... that part needs to be circumvented.

I presume that the text you like to extract is a textual (opposite to graphical...) content of the PDF file. I've had a case where I got a bunch of PDF file to import data from to Excel, and I did it with a utility name "pdf2text" from the Xpdf package (see here). The utility just runs over the PDF file and spits out every plain text.

So for extracting the text from the PDFs you'll just have to walk thru the files (there are numerous examples on the web for doing that with VBA) and call pdftotext with appropriate parameters. But I'd do that with a simple batch file ... nevertheless there are various examples on the web on how to call an external program from VBA.

In the batch file I'd use some command line filter program (I'd recommend GNU grep, see here for a Windows port) to filter out the relevant text.

With grep you'll be able to filter with regular expressions. For your text example, the regular expression like

Open in new window

you could filter out all line with text matching your example. To beautify the results, you'd pipe 'em thru GNU sed (see here and filter out everything except the IDs, separated with ; (numberd captures around the IDs ...) ... I'm kind of stuck at the regex syntax for that ... any help around ?

Pipe the result appending (>>) to a temp file with .csv extension.

When you're thru with all files, the temp file contains two columns of neatly ordered data in CSV format, ready to load into excel.

You even could call that batch file from an Excel macro .... ;-)
LVL 52

Assisted Solution

by:Joe Winograd, EE MVE
Joe Winograd, EE MVE earned 125 total points
ID: 41877182
Hi Devendra,

I see that you joined EE today — welcome aboard!

Here's a design roadmap for you:

(1) Use Xpdf's PDFtoText command line utility to extract the text from each PDF. The EE 5-minute video Micro Tutorial, Xpdf - Command Line Utility for PDF Files - Part 1, introduces the Xpdf utilities, with download information:

And another EE 5-minute video Micro Tutorial, Xpdf - Convert PDF Files to Plain Text Files - Part 3, specifically discusses PDFtoText:

(2) Write a program/script in whatever language you prefer that loops through all of the PDF files, calling PDFtoText for each one. I've done this in the AutoHotkey language, described in the EE article, AutoHotkey - Getting Started:

(3) Look for the string "P&ID:" (assuming, of course, that this string occurs only once in each PDF). Extract the string following that into a variable.

(4) Look for the string "DWG:" (again, assuming that this string occurs only once in each PDF). Extract the string following that into a variable.

(5) Create a line in a plain text file in CSV format (one for each PDF file) that has the file name, the "P&ID:" value, and the "DWG:" value. If you want to get fancier, use Excel Component Object Model (COM) calls to create an actual Excel file (XLS or XLSX) instead of a CSV file, putting the file name in column A, the "P&ID:" value in column B, and the "DWG:" value in column C (one row for each PDF).

(6) You didn't say if all 8,000+ PDF files are in the same folder or in subfolders. If the latter, then the program/script should recurse into subfolders.

As mentioned above, I would do this in the AutoHotkey language, but any programming/scripting language that can call command line tools will suffice. If you want to create an XLS/XLSX file instead of a CSV file, the language needs to have COM support, which AutoHotkey does, as discussed in this recent EE thread — and note the links in this post at that thread. Regards, Joe

Author Closing Comment

by:Devendra Ram
ID: 41897352
Could not exactly utilize the advise however, found a work around logic to figure out the numbering strategy. Thank you all.
LVL 52

Expert Comment

by:Joe Winograd, EE MVE
ID: 41897676
You're welcome. And thanks to you for coming back to close the question. It's always better when the asker closes the question, rather than the participants.

Featured Post

Gigs: Get Your Project Delivered by an Expert

Select from freelancers specializing in everything from database administration to programming, who have proven themselves as experts in their field. Hire the best, collaborate easily, pay securely and get projects done right.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Title # Comments Views Activity
Real Time 2 21
VBA - Excel, Hide/unhide range of rows on sheet with listbox selection 9 43
Extract Names Based on Position in the Column 12 28
Excel IF formula 3 18
Introduction This Article briefly covers methods of calculating the NPV and IRR variants in Excel as well as the limitations in calculating and interpreting IRR results. Paraphrasing Richard Shockley, author of my favourite finance reference tex…
Some code to ensure data integrity when using macros within Excel. Also included code that helps secure your data within an Excel workbook.
The viewer will learn how to create a normally distributed random variable in Excel, use a normal distribution to simulate the return on an investment over a period of years, Create a Monte Carlo simulation using a normal random variable, and calcul…
The viewer will learn how to create two correlated normally distributed random variables in Excel, use a normal distribution to simulate the return on different levels of investment in each of the two funds over a period of ten years, and, create a …

813 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

8 Experts available now in Live!

Get 1:1 Help Now