Displaying Microsoft Word documents in a Web Page as Images Programatically

Bruce SmithSoftware Engineer II
Published:
Updated:
This article will explain how to display the first page of your Microsoft Word documents (e.g. .doc, .docx, etc...) as images in a web page programatically. I have scoured the web on a way to do this unsuccessfully. The goal is to produce something similar to this list of resume templates: http://office.microsoft.com/en-us/templates/CT010144894.aspx

A functional example of this article can be found here: http://www.patsmitty.com/gview/word_image.php
The ENTIRE source code is attached as a .zip below.

Article Disclaimer: This article is somewhat advanced and DOES NOT cover or explain any of  the HTML, CSS, PHP functions, and jQuery resources I have used here. It DOES detail and explain how to get an image for the "src" attribute of an image for your msword docs. If you have questions, please comment here as I'd love to help answer them.

Take note that this solution is a hack and relies on Google docs. So if Google docs changes or goes away... so does this solution! Also this solution is fairly complex, so allow about an hour to digest it. All of my source files will be attached down at the bottom.

So this solution starts off with the Google docs application. Lets say you have a msword document at http://www.myserver.com/test.docx. If you navigate to http://docs.google.com/gview?url=http://www.myserver.com/test.docx, you will be able to view that document in the 'bulky' Google docs viewer. I choose the word 'bulky' because all I want to accomplish is to obtain an image of the document, I'm not interested in zooming in or out or looking at all the pages, etc... So if we look closer at the Google docs application, we see that if we right-click anywhere inside the document, we see this context menu:Standard Image Context MenuFrom this we can see that Google docs actually generates an image of each page of the msword document. This is exactly what I'm after! When we select "view image" from that context menu we see the image with the image's URL in the address bar. Let's look at the parameters in the URL:
url - this is the actual URL of the msword document
docid - this is some generated id of the image
a - I don't know what this is, but it always equates the same thing (as far as I've tested...)
pagenumber - the page number of the msword document (in this tutorial, it'll always be 1...)
w - the width of the image in pixels
There it is! That's the URL that we need to programatically generate. You might ask how we're going to accomplish this if Google docs randomly generates the docid (we already have the doc's URL, the page number, the width, and whatever the 'a' is - all we need is the docid). The answer lies in web-page scrapping. The rest of this article explains in detail how we're going to programatically obtain this URL for the first pages of our msword documents.

For this article's purposes I have multiple msword documents in a directory on my website located at http://www.patsmitty.com/gview/word_documents/. It doesn't matter what you call your documents as we will programatically obtain them via PHP.

One more thing before we start, my example contains a js prototype progress bar and a frame for the images so there are "additional" source files and "extraneous" code as well.

1. Scrape docs.google.com for the docid parameter


Download the PHP Simple HTML DOM Parser here: http://sourceforge.net/projects/simplehtmldom/files/simplehtmldom/1.5/simplehtmldom_1_5.zip/download
Create a blank php file and title it "scrapeIt.php".
Include() or require() the DOM Parser
Create a function called getImageUrl that takes 2 parameters: $file_url, $thumb_width
This function will contain 2 lines of code that returns the URL of the image we need.
function getImageURL($file_url, $thumb_width) {
                      	$html = file_get_html('http://docs.google.com/gview?url=' . $file_url);
                      	return doUrl(html_to_URL($html, "{svUrl:'", "46chan"), $thumb_width);
                      }

Open in new window

The first line will load the HTML of the Google docs page that displays your msword document as an image. The Second line will glean the docid. Unfortunately, we can't just use jQuery or this parser to grab the image's src because it is programatically generated by Google docs, it's won't show up in the source code. But the information that we need shows up inside a JavaScript variable in the source code. It looks messy, but, next we will run some string manipulations on the URL and retain the pertinent parts (the docid).
Create the function that the second line in the snippet above refers to: the html_to_URL function that takes $string, $start, and $end as parameters. The $string parameter is the long URL generated by Google docs that contains some extraneous stuff that we're going to trim out via the positions of the $start and $end variables which are "{svUrl:'" and "46chan". These 2 strings that are used in a js function called by Google's gview app. Below is Google's js function code with the URL for my document:
<script type="text/javascript">
                                  
                                  function finalizeApp() {
                                    if (!gviewApp) {
                                      return;
                                    }
                        
                                    
                                    gviewApp.setDisplayData(
                                      {svUrl:'?url\75http://www.patsmitty.com/gview/word_documents/test.docx\46docid\7593c3e45e33f8c096913511ef8fe32a92\46chan\75DwAAAM2xz0nJvnNLiNHc/RtPoug%3D\46a\75sv',biUrl:'?url\75http://www.patsmitty.com/gview/word_documents/test.docx\46docid\7593c3e45e33f8c096913511ef8fe32a92\46chan\75DwAAAM2xz0nJvnNLiNHc/RtPoug%3D\46a\75bi',chanId:'DwAAAM2xz0nJvnNLiNHc/RtPoug\075',gpUrl:'http://doc-0k-8g-docsviewer.googleusercontent.com/viewer/securedownload/dsn1aovipa7l846lsfcf94nedj8q2p4u/vgceh33q6a9930abfmgliebnltb0nljm/1312587900000/dXJs/AGZ5hq8BgbJY1gwaOYx83cPOdNw6/aHR0cDovL3d3dy5wYXRzbWl0dHkuY29tL2d2aWV3L3dvcmRfZG9jdW1lbnRzL3BhdF9zbWl0aC5kb2N4?a\75gp\46filename\75test.docx\46chan\75DwAAAM2xz0nJvnNLiNHc/RtPoug%3D\46docid\7593c3e45e33f8c096913511ef8fe32a92\46sec\75AHSqidYnfLc11kcuxtjsvPOT1apoyI52utATnDA0dbZG7oiQ3GYFpmaw454_bppEvls9ZMLaqb-V',docId:'93c3e45e33f8c096913511ef8fe32a92',numPages:1,gtUrl:'?url\75http://www.patsmitty.com/gview/word_documents/test.docx\46docid\7593c3e45e33f8c096913511ef8fe32a92\46chan\75DwAAAM2xz0nJvnNLiNHc/RtPoug%3D\46a\75gt',thWidth:138,dlUrl:'http://www.patsmitty.com/gview/word_documents/test.docx',thHeight:179});
                                    gviewApp.finalizeApp();
                      
                                    
                                    gviewApp.loadLateDeps();
                                  }
                                  gviewApp.setProgress(90);
                                  finalizeApp();
                                  
                                    window.jstiming.load.tick('prt');
                                  
                                </script>

Open in new window

Notice on line 10 the docid parameter is available in between the 2 strings noted above. So inside this html_to_URL function we're going to single out the following from the example above:
?url\75http://www.patsmitty.com/gview/word_documents/test.docx\46docid\7593c3e45e33f8c096913511ef8fe32a92
Here is the code that goes inside this function to accomplish this result:
$string = " ".$string;
                      	$ini = strpos($string,$start);
                      	if ($ini == 0) return "";
                      	$ini += strlen($start);
                      	$len = strpos($string,$end,$ini) - $ini;
                      	$final = substr($string,$ini,$len-1);
                      	return $final;

Open in new window

By looking at the my example URL above, we can see a couple of things that need fixing up. Look at the beginning, it starts with "?url\75http://www...". This is obviously the url parameter in the final image's URL. So we need to make the beginning look like this instead: "?url=http://www...". Also, a little later, we see a "\46". This is supposed to be an ampersand that declares the next parameter which is the most important: docid. So this last function will take care of this generated URL and make it usable. This function called doURL takes 2 parameters called $final_url and $width. $final_url is the raw URL in quotes above. This function will turn
?url\75http://www.patsmitty.com/gview/word_documents/test.docx\46docid\7593c3e45e33f8c096913511ef8fe32a92
into
?url=http://www.patsmitty.com/word_documents/patty.docx&docid=3e4a3f6ecf6625dccc407c11df17dbfc
Here is the code for doURL:
$url_final = str_replace("\\75", "=", $url_final);
                      	$url_final = str_replace("\/", "%2F", $url_final);
                      	$url_final = str_replace("\\46", "&", $url_final);
                      	$url_final = "http://docs.google.com/gview" . $url_final;
                      	$url_final = $url_final . "&a=bi&pagenumber=1&w=" . $width;
                      	return $url_final;

Open in new window

Notice lines 4 and 5... They add "http://docs.google.com/gview" to the beginning and "&a=bi&pagenumber=1&w=" . $width" to the end. So now our URL looks exactly like it does when we click on "view image" or "copy image location" from the context menu when we right-clicked on the image of the first page of our msword document that we viewed in Google docs!!!

We're not done though. For some reason, when we loaded the html from the Google docs page the docid was generated but the image was not. Probably because something in Google gview app assumes that the root directory is "docs.google.com" and not another URL like mine: "www.patsmitty.com". So when you try to take that url and view it, error 400 spring up. No problem, the next step will explain how to bypass this.

2. Force Google docs to load your image


Now, in my example I have an upload for that uploads the msword document and then get's the image's URL via the methods described above. Now I have to get Google docs' gview app to actually render the image before I can use the URL without getting a 400. To do this I call the Google docs URL into the src of a hidden iFrame and wait 5 seconds. I make the js wait 5 seconds to give Google apps ample time to generate the image of my document . Once that finishes I call my final script that incorporates the entire deal. Create a new file called "getImages.php". This sorts through the documents in the given directory and generates the URL. This is a bit redundant. A more efficient way to do this is to create an XML file and store the URLs there when they are uploaded. But this is doable. Here is the code for this file:
<?php
                      include('msword.php');
                      if ($handle = opendir('word_documents')) {
                      	echo '<table>';
                      	$c = 0;
                          while (false !== ($file = readdir($handle))) {
                      		if(strripos($file, ".doc")!=false):
                      			if($c==4):
                      				echo '</tr>';
                      				$c=0;
                      			endif;
                      			if($c==0):
                      				echo '<tr>';
                      			endif;
                      			$url = getImageUrl('http://www.patsmitty.com/gview/word_documents/'.$file, '125');
                      			$click_url = 'http://docs.google.com/gview?url=http://www.patsmitty.com/gview/word_documents/'.$file;
                      			echo '<td><div style="padding-left:5px;padding-top:5px;" class="imgdiv"><img class="docs" onmouseover="this.style.cursor=\'pointer\'" onclick="window.location=\''.$click_url.'\'" src="'.$url.'" /></div><td>';
                      			$c++;
                      		endif;
                          }
                      	echo '</table>';
                          closedir($handle);
                      }
                      
                      ?>

Open in new window

That's It!!!
We now have images of our word documents that we can show on our page. Take note that Google Docs supports other formats than just .doc and .docx but .pdfs  and spreadsheets as well so these are all feasible with my hack as well. I have only tested .doc, .docx, and .pdf. Also, please look at my source code as it shows how to connect all these functions together as I am not explaining any html, parsing, css, or other php functions consumed here - I'm simply explaining my hack so you too can enhance your web applications. Also remember to change all the references from "www.patsmitty.com/gview/..." to your respected servers.
If you have any questions please post here or email me at psmith@patsmitty.com and I'll post them here.
gview.zip
2
21,796 Views
Bruce SmithSoftware Engineer II

Comments (0)

Have a question about something in this article? You can receive help directly from the article author. Sign up for a free trial to get started.