Still celebrating National IT Professionals Day with 3 months of free Premium Membership. Use Code ITDAY17


Extracting P&ID Reference and the drawing number from a PDF Isometric drawing into excel

Posted on 2016-11-07
Medium Priority
Last Modified: 2016-11-22
There are more than 8,000 PDF files containing PIPING ISOMETRIC drawings of a process plant. These isometric drawings contain Process and Instrumentation Diagram reference (P&ID). Each file has a unique drawing number. I want to extract this drawing number as as well as the P&ID drawing numbers in an excel file. There is no provision here to attach a sample file but it looks something like this.

P&ID: MD-513-1A00-EG-PR-PID-2035       ;        DWG: MD-513-1A00-ISO-CWR-AA22-0401-23

Can anyone please help. It would save me a lot of time. Thanks in advance
Question by:Devendra Ram
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
LVL 33

Assisted Solution

by:Rob Henson
Rob Henson earned 500 total points
ID: 41877000
Are these in the filename or within the drawing itself?

I recall a DOS routine that can create a list of filenames from a specific directory and output to a text file. The text file can then be opened in Excel and, assuming there is a standard convention to the filename, the relevant parts can be extracted using formulae or text to columns.

If not and the required info is in the drawing, if the PDF files were created as an image then the text element will be difficult to extract.

Rob H
LVL 14

Accepted Solution

frankhelk earned 1000 total points
ID: 41877005
I have no ready-to-use code at hand, but maybe some thoughts on the issue might help you anyhow (depends on your coding experience ...).

First, I don't know of a way to directly import PDF into Excel (which would in most cases make no sense, anyhow). Not even the textual part ... that part needs to be circumvented.

I presume that the text you like to extract is a textual (opposite to graphical...) content of the PDF file. I've had a case where I got a bunch of PDF file to import data from to Excel, and I did it with a utility name "pdf2text" from the Xpdf package (see here). The utility just runs over the PDF file and spits out every plain text.

So for extracting the text from the PDFs you'll just have to walk thru the files (there are numerous examples on the web for doing that with VBA) and call pdftotext with appropriate parameters. But I'd do that with a simple batch file ... nevertheless there are various examples on the web on how to call an external program from VBA.

In the batch file I'd use some command line filter program (I'd recommend GNU grep, see here for a Windows port) to filter out the relevant text.

With grep you'll be able to filter with regular expressions. For your text example, the regular expression like

Open in new window

you could filter out all line with text matching your example. To beautify the results, you'd pipe 'em thru GNU sed (see here and filter out everything except the IDs, separated with ; (numberd captures around the IDs ...) ... I'm kind of stuck at the regex syntax for that ... any help around ?

Pipe the result appending (>>) to a temp file with .csv extension.

When you're thru with all files, the temp file contains two columns of neatly ordered data in CSV format, ready to load into excel.

You even could call that batch file from an Excel macro .... ;-)
LVL 55

Assisted Solution

by:Joe Winograd, EE MVE 2015&2016
Joe Winograd, EE MVE 2015&2016 earned 500 total points
ID: 41877182
Hi Devendra,

I see that you joined EE today — welcome aboard!

Here's a design roadmap for you:

(1) Use Xpdf's PDFtoText command line utility to extract the text from each PDF. The EE 5-minute video Micro Tutorial, Xpdf - Command Line Utility for PDF Files - Part 1, introduces the Xpdf utilities, with download information:

And another EE 5-minute video Micro Tutorial, Xpdf - Convert PDF Files to Plain Text Files - Part 3, specifically discusses PDFtoText:

(2) Write a program/script in whatever language you prefer that loops through all of the PDF files, calling PDFtoText for each one. I've done this in the AutoHotkey language, described in the EE article, AutoHotkey - Getting Started:

(3) Look for the string "P&ID:" (assuming, of course, that this string occurs only once in each PDF). Extract the string following that into a variable.

(4) Look for the string "DWG:" (again, assuming that this string occurs only once in each PDF). Extract the string following that into a variable.

(5) Create a line in a plain text file in CSV format (one for each PDF file) that has the file name, the "P&ID:" value, and the "DWG:" value. If you want to get fancier, use Excel Component Object Model (COM) calls to create an actual Excel file (XLS or XLSX) instead of a CSV file, putting the file name in column A, the "P&ID:" value in column B, and the "DWG:" value in column C (one row for each PDF).

(6) You didn't say if all 8,000+ PDF files are in the same folder or in subfolders. If the latter, then the program/script should recurse into subfolders.

As mentioned above, I would do this in the AutoHotkey language, but any programming/scripting language that can call command line tools will suffice. If you want to create an XLS/XLSX file instead of a CSV file, the language needs to have COM support, which AutoHotkey does, as discussed in this recent EE thread — and note the links in this post at that thread. Regards, Joe

Author Closing Comment

by:Devendra Ram
ID: 41897352
Could not exactly utilize the advise however, found a work around logic to figure out the numbering strategy. Thank you all.
LVL 55

Expert Comment

by:Joe Winograd, EE MVE 2015&2016
ID: 41897676
You're welcome. And thanks to you for coming back to close the question. It's always better when the asker closes the question, rather than the participants.

Featured Post

Independent Software Vendors: We Want Your Opinion

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Introduction This Article briefly covers methods of calculating the NPV and IRR variants in Excel as well as the limitations in calculating and interpreting IRR results. Paraphrasing Richard Shockley, author of my favourite finance reference tex…
You need to know the location of the Office templates folder, so that when you create new templates, they are saved to that location, and thus are available for selection when creating new documents.  The steps to find the Templates folder path are …
This Micro Tutorial demonstrates in Microsoft Excel how to consolidate your marketing data by creating an interactive charts using form controls. This creates cool drop-downs for viewers of your chart to choose from.
This Micro Tutorial will demonstrate how to use a scrolling table in Microsoft Excel using the INDEX function.

715 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question