Solved

How to extract text from an .XPS file using VBA

Posted on 2010-09-23
19
2,806 Views
Last Modified: 2013-11-27
I am connected to Corporate via VPN. I cannot install ANY software. I am trying to "screen scrape" from a web page. I can copy-paste, but I need to save up to 5 screens for each of 2,000 customers. I can print to a file, but ONLY to a .XPS file. I am further constrained in that I can only use Internet Explorer 7 (drat, no Greasemonkey).

If I change .XPS to .ZIP, then I can see all of the internal files and folders. I've tried to manually strip the text from the "Glyph" nodes I found in one of the XML files, but that was less than satisfactory (sentences were returned as individual words; all of the text was duplicated).

I am looking for a way to automate this copy-paste nightmare. I can use Office 2007 (either Access or Excel). If I knew how, I would attach to the Internet Explorer process via the Windows API, but alas, my knowledge in this area is limited.

I am a seasoned veteran of the I.T. world (meaning I remember xenix). You can throw technical stuff at me and I will not recoil from it.

I will do ANYTHING to overcome this dilemma, within the restrictions placed upon me by Corporate. And no, They will not grant me authority to do anything, As far as They are concerned, I do not exist.

I am out of the office until 6:00 PM this evening, so please do not think I am ignoring your valued responses.
0
Comment
Question by:Badotz
  • 8
  • 5
  • 4
  • +2
19 Comments
 
LVL 82

Expert Comment

by:leakim971
Comment Utility
Why Javascript ? You should replace it by a VB zone
0
 
LVL 6

Expert Comment

by:CRJ2000
Comment Utility
Could you provide a bit more detail about the structure of the web page you're trying to "screen scrape?" There are a couple methods I've used in the past for such things, but they're very dependent on the structure of the web page(s).

1) Parse the HTML source. This works if the page is truly rendered as HTML (i.e. JavaScript isn't used to change the data display dynamically).

2) Embed a WebBrowser control in a form in Excel or Access, and drive it programatically. You can reference the DOM Document (and related) objects to step through the items you care about. This is very dependent on the structure of the data on the page.

Any additional information you can offer will help us understand what approaches may work,.

Chris
0
 
LVL 74

Expert Comment

by:Jeffrey Coachman
Comment Utility
Have you looked into all/any of the apps that claim that they can do the Screen Scraping for you?

http://www.google.com/#sclient=psy&hl=en&q=screen+scraper&aq=0&aqi=g5&aql=&oq=&gs_rfai=&pbx=1&fp=8d9c50a61d5b9175

Jeff
0
 
LVL 29

Author Comment

by:Badotz
Comment Utility
leakim971:
Why JavaScript? Because if there is an IE-7 version of something like GreaseMonkey, I would try that. A quick Google search was not promising, although perhaps a bookmarklet?

CRJ2000:
After system startup, CTRL-ALT-DEL logs me into the Corporate VPN, with a user name and password.

The web page I want to scrape is a DOS program running in an <iframe> - no chrome whatsoever. I have to use the mouse to select all of the screen because CTRL-A does not work (go figure). I can use the arrow keys to move to the top of the screen and then holding down SHIFT select the screen one character across then by line. There is no accelerator key (again, go figure).

After capturing the first screen, I must press Enter to move to the next screen, where once more I use the mouse to select all of the screen because CTRL-A does not work (go figure).

I do not remember if I can view the source of this Dos beast (I do not think I can). I can view the source of the <iframe>, but honestly I do not recall seeing any of the content I am after. I will double-check today.

boag2000:
Thanks, but I cannot install any applications. I can only use what is already on the machine. Corporate pre-installs what They think we will need, and that is that.

Of course, my wanting to do this is because Corporate has put an end-of-life date on the DOS app. Consequently, we are moving all of our clients from the DOS program into a new, shiny web app. Corporate will not provide us with any sort of the data conversion (don't ask). They actually recommend that we cut-and-paste from the old into the new. Truthfully, the apps present data in such different ways that there is no straightforward way to copy and paste; we must copy into Notepad, then pick that data apart (phone numbers, driver's license numbers, dates, comments, etc) and paste each item into a specific portion of the web app. And the web app is a p.o.s. - designed by sadists with no concept of workflow. I want to yell at someone, but all we have is the Corporate Help Desk, which is staffed by troglodytes. There is no way for me to speak to the Gods in their ivory tower.

But that is my cross to bear. I am just trying to make my life a little easier.
0
 
LVL 6

Expert Comment

by:CRJ2000
Comment Utility
It sounds like there's actually an embedded control that is displaying the content -- I don't think you're really looking for something to scrape a webpage, but rather than a "window" within the webpage. Could you confirm this by looking at the source of the IFRAME and looking for an embedded object?

If it truly isn't a web page that you're trying to scrape, then you may find a macro application such as Macro Express or AutoHotKey to be of help. If you're not afraid to learn a new scripting language (it's pretty straightforward), AutoHotKey would be a good way to go on this one. It could easily automate the repetitive cut-paste stuff.

To clarify -- the "p.o.s." web app is the new one that you're putting this information into? You may be able to automate some of that work, too, using one of these programs.

Here are the links to the product websites. Macro Express is a commercial product. AutoHotKey is freeware.

http://www.macroexpress.com/
http://www.autohotkey.com/

Chris




0
 
LVL 29

Author Comment

by:Badotz
Comment Utility
Thanks, but I cannot install any applications, as much as I might like to. Corporate has a draconian policy (which we in the office AND the owner had to sign), and there are -=NO=- exceptions. The policy states:

"...the result of installing Outside Applications upon a (CORPORATION NAME HERE) desktop and/or laptop system shall result in immediate termination of the individual registered to said desktop and/or laptop system; said termination may ultimately result in revocation of the owner's reseller license."

I will investigate the environment this morning, but I will not be able to respond to any messages until after 6:00 PM.

Thanks for your help.
0
 
LVL 6

Expert Comment

by:CRJ2000
Comment Utility
I understand about the no-install policy. Anything you can tell us about what's "really" happening on the web page would be helpful in figuring out next steps.

Chris
0
 
LVL 29

Author Comment

by:Badotz
Comment Utility
The DOS app is indeed running in an <iframe>- the @SRC@ attribute points to a file with a @.DO@ extension - could be anything, since it is on the Corporate server.
0
 
LVL 29

Author Comment

by:Badotz
Comment Utility
Not sure what is up with @ instead of "...either my KB is borked or EE borked something...looks like reboot time (sigh)...
0
Do You Know the 4 Main Threat Actor Types?

Do you know the main threat actor types? Most attackers fall into one of four categories, each with their own favored tactics, techniques, and procedures.

 
LVL 6

Expert Comment

by:CRJ2000
Comment Utility
.DO is the default extension for Struts (a Java-based framework). You can find more information about it here: http://struts.apache.org/#Welcome

I'm not sure what to suggest to you at this point. It's difficult to understand what this thing really looks and feels like, and there are many constraints. Unless you can get someone in your IT department to help out a bit, I'm afraid you may be stuck with a manual process.

Chris
0
 
LVL 39

Expert Comment

by:als315
Comment Utility
Did you try to record macro from Excel or Word, which is switching to your window, copy all necessary information and paste it to Excel or Word?
Scenario can be complicated enough.
0
 
LVL 29

Author Comment

by:Badotz
Comment Utility
CRJ2000:
Yeah, I'm just hosed. There is no way for me - a mere user - to reach anyone of import in the  I.T. department, except indirectly through the Help(less) Desk.

I might investigate using the Windows API to attach to the window, but I feel that the more complicated I make the process, the less likely it is to work. Diminishing returns, and all that.

als315:
Copying into Notepad works just fine, thanks. There is no need to use Excel or Word - especially NOT Word: yuck.
0
 
LVL 39

Expert Comment

by:als315
Comment Utility
In recorded office macro you can automate process of select field-copy-paste > move to next field > select-copy-paste, etc, which can't be done in Notepad.
0
 
LVL 29

Author Comment

by:Badotz
Comment Utility
I have written VBA code that rips the five captured screens apart and stuffs a database with the results. I then prepare an HTML report from this which, after printing, is filed for reference.

While Excel might be able to automate some formatting tasks, it cannot cut from the DOS app and paste into the web app.

Here's my question to you: how, exactly, would Excel help me? Please be specific.
0
 
LVL 39

Accepted Solution

by:
als315 earned 500 total points
Comment Utility
"I am trying to "screen scrape" from a web page".
If text in your application can be selected with mouse and and keys (Shift+Arrows), you can send this keys from macro. Example from help:
Dim ReturnValue, I
ReturnValue = Shell("CALC.EXE", 1)    ' Run Calculator.
AppActivate ReturnValue     ' Activate the Calculator.
For I = 1 To 100    ' Set up counting loop.
    SendKeys I & "{+}", True    ' Send keystrokes to Calculator
Next I    ' to add each value of I.
SendKeys "=", True    ' Get grand total.
SendKeys "%{F4}", True    ' Send ALT+F4 to close Calculator.

Switch to your application, select, then copy (Ctrl+C) and paste (CTRL+V) text to word (excel)

0
 
LVL 29

Author Comment

by:Badotz
Comment Utility
Instead of CALC.EXE, how do I refer to a running instance of Internet Explorer?
0
 
LVL 39

Expert Comment

by:als315
Comment Utility
You can try to use Alt+Tab to switch from one window to other.
0
 
LVL 29

Author Closing Comment

by:Badotz
Comment Utility
I never thought about SendKeys, and for that, I thank you. And while your example wasn't the exact solution, I think I can now figure it out.

Again, thanks for the prod in the right direction!
0
 
LVL 39

Expert Comment

by:als315
Comment Utility
Thanks. It will be not so simple, but may be possible.
0

Featured Post

How to improve team productivity

Quip adds documents, spreadsheets, and tasklists to your Slack experience
- Elevate ideas to Quip docs
- Share Quip docs in Slack
- Get notified of changes to your docs
- Available on iOS/Android/Desktop/Web
- Online/Offline

Join & Write a Comment

A little background as to how I came to I design this code: Around 5 years ago I designed an add-in that formatted Excel files to a corporate standard, applying different cell colours and font type depending on whether the cells contained inputs,…
I see at least one EE question a week that pertains to using temporary tables in MS Access.  But surprisingly, I was unable to find a single article devoted solely to this topic. I don’t intend to describe all of the uses of temporary tables in t…
This Micro Tutorial will demonstrate in Google Sheets how to use the HYPERLINK function to create live links inside your spreadsheet.
This Micro Tutorial will demonstrate how to use a scrolling table in Microsoft Excel using the INDEX function.

772 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

10 Experts available now in Live!

Get 1:1 Help Now