Extracting Metadata from MS Word documents using C#.

I would like to create an application in C# (Windows Forms VS2010) that allows me to do the following:

1). Select a folder on a hard drive.
2). Extract all the Metadata in each of the MS Word documents in that folder.  

I can handle the selecting the folder part, but I'm having a little bit of a problem coding the Metadata extraction part.  

I would like bit of help with the Metadata extraction code that will allow me to gather the different metadata details  contained in each document. By metadata I mean the author, the last saved date, the created date, the modified date, the last user that saved the document and every other piece of metadata that I can extract from the document.

Thank you.
Mr_FulanoAsked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

Karrtik IyerSoftware ArchitectCommented:
Do you have following dlls which you can refer in your c# projects?  Which office version are you targeting?
Microsoft.Office.Interop.Word
Microsoft.Office.Interop.Excel
Microsoft.Office.Interop.Powerpoint
Microsoft.Office.Core ( Microsoft Office 14.0 Object Library for office 2010)
You might need system.reflection as well.
If so you can try something like below :
Microsoft.Office.Interop.Word.ApplicationClass wordObject = new Microsoft.Office.Interop.Word.ApplicationClass();//create word app class object
object file = pathToFile; //this is the path to file to open
object nullobject = System.Reflection.Missing.Value;
Microsoft.Office.Interop.Word.Document docs = wordObject.Documents.Open(file, nullobject, nullobject, nullobject,
nullobject, nullobject, nullobject, nullobject, nullobject, nullobject, nullobject, nullobject, nullobject,  nullobject, nullobject, nullobject);//open the file
//Get Author Name
object wordProperties = docs.BuiltInDocumentProperties;
Type typeDocBuiltInProps = wordProperties.GetType();
Object Authorprop = typeDocBuiltInProps.InvokeMember(“Item”, BindingFlags.Default | BindingFlags.GetProperty, null, wordProperties, new object[] { “Author” });//query for author properties
Type typeAuthorprop = Authorprop.GetType();
string strAuthor = typeAuthorprop.InvokeMember(“Value”, BindingFlags.Default | BindingFlags.GetProperty, null, Authorprop, new object[] { }).ToString();//get author name
Console.WriteLine(strAuthor);
docs.Close(WdSaveOptions.wdDoNotSaveChanges, nullobject, nullobject);
1

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
Karrtik IyerSoftware ArchitectCommented:
If you don't have access to office DLLs to refer in your c# code, you can try the method as illustrated in this Microsoft article, although it is C++, similar code can be written in C# using COM interop.
https://support.microsoft.com/en-us/kb/186898
0
Mr_FulanoAuthor Commented:
Hi Karrtik, first and foremost, thank you for the great suggestions. I will have to wait until tomorrow to get to the office to make sure I have the DLLs you've specified. I think I might, but want to be sure.

One thing in the code you suggested that I don't understand is the part with all the "null objects", like what you listed below. What are all the null objects for?

Microsoft.Office.Interop.Word.Document docs = wordObject.Documents.Open(file, nullobject, nullobject, nullobject,
nullobject, nullobject, nullobject, nullobject, nullobject, nullobject, nullobject, nullobject, nullobject,  nullobject, nullobject, nullobject);//open the file

I'll try your suggestions in the AM and get back to you.

Thank you VERY much for a push in the right direction.
Fulano
0
Keep up with what's happening at Experts Exchange!

Sign up to receive Decoded, a new monthly digest with product updates, feature release info, continuing education opportunities, and more.

Mr_FulanoAuthor Commented:
BTW Karrtik, to answer your other question...I was hoping that the code could work with different MS Office versions. I'm beginning to believe it may not...
0
Karrtik IyerSoftware ArchitectCommented:
Hi Mr Fulano,
The null object parameters are the optional parameters to open method, check the documentation below for more details :
https://msdn.microsoft.com/en-us/library/microsoft.office.interop.word.documents.open.aspx.

Check out the COM way shown in the Microsoft article in my second response, that might give you a version independent code.
Please let me know if you need further help.
Thanks,
Karrtik
0
Mr_FulanoAuthor Commented:
Hi Karrtik, I worked out the code you provided and also found the website that the code came from and was able to make it work. Below is the solution. -- I do however have one last question related to your solution. --- How do I get the rest of the properties. In other words, how do I know what they're called. In the code below, the Author property was referred to as "Authorprop." Is that just an arbitrary name for the object or does it have to be specifically that name?

I will award you the points once I hear back from you. -- Thank you again!

Additional Source code examples: https://wditot.wordpress.com/2012/05/10/extracting-metadata-from-ms-office-docs-programmatically/


My working code:

try
            {
                Microsoft.Office.Interop.Word.Application wordObject = new Microsoft.Office.Interop.Word.Application();     //create word app class object  
                object file = @"\\Test Folder\TestDOC.docx";                                               //this is the path to file to open
                object nullobject = System.Reflection.Missing.Value;

                Microsoft.Office.Interop.Word.Document docs = wordObject.Documents.Open(file, nullobject, nullobject, nullobject, nullobject,
                nullobject, nullobject, nullobject, nullobject, nullobject, nullobject, nullobject, nullobject, nullobject, nullobject, nullobject);    //open the file 

                //Get Author Name
                object wordProperties = docs.BuiltInDocumentProperties;

                Type typeDocBuiltInProps = wordProperties.GetType();

                Object Authorprop = typeDocBuiltInProps.InvokeMember("Item", BindingFlags.Default | BindingFlags.GetProperty, null, wordProperties, new object[] { "Author" }); //query for author properties 
                Type typeAuthorprop = Authorprop.GetType();

                string strAuthor = typeAuthorprop.InvokeMember("Value", BindingFlags.Default | BindingFlags.GetProperty, null, Authorprop, new object[] { }).ToString();        //get author name 

                Console.WriteLine("The document's author is : " + strAuthor);

                //docs.Close(WdSaveOptions.wdDoNotSaveChanges, nullobject, nullobject);
                ((Microsoft.Office.Interop.Word._Application)wordObject).Quit(WdSaveOptions.wdDoNotSaveChanges);
               

            }
            catch (Exception j)
            {
                Console.WriteLine(j.Message); 
            }
            
        }

Open in new window

0
Karrtik IyerSoftware ArchitectCommented:
Hi Mr Fulano,
It is not an arbitrary name.
If we notice AuthorProp is the variable name (this can be anything), the key is the string "Author" in the code below.
Object Authorprop = typeDocBuiltInProps.InvokeMember(“Item”, BindingFlags.Default | BindingFlags.GetProperty, null, wordProperties, new object[] { “Author” });//query for author properties
The DocumentProperties collection returned by docs.BuiltInDocumentProperties contains a name/value pair. And it contains all the built in properties of the word document. The key is the property name.
If we look at the Microsoft support link (https://support.microsoft.com/en-us/kb/186898) that I had given in my second comment which had C++ COM based code, it contains the below array that contains all the built in document properties for word document.
You can use the name (first parameter) of the struct below to get all the built in properties, these are predefined for a word document or some are common across all office documents. The second field pidsi is the data type of each of the property, for C# code you can ignore that part as of now.
 struct pidsiStruct {
         char *name;
         long pidsi;
      } pidsiArr[] = {
         {"Title",            PIDSI_TITLE}, // VT_LPSTR
         {"Subject",          PIDSI_SUBJECT}, // ...
         {"Author",           PIDSI_AUTHOR},
         {"Keywords",         PIDSI_KEYWORDS},
         {"Comments",         PIDSI_COMMENTS},
         {"Template",         PIDSI_TEMPLATE},
         {"LastAuthor",       PIDSI_LASTAUTHOR},
         {"Revision Number",  PIDSI_REVNUMBER},
         {"Edit Time",        PIDSI_EDITTIME}, // VT_FILENAME (UTC)
         {"Last printed",     PIDSI_LASTPRINTED}, // ...
         {"Created",          PIDSI_CREATE_DTM},
         {"Last Saved",       PIDSI_LASTSAVE_DTM},
         {"Page Count",       PIDSI_PAGECOUNT}, // VT_I4
         {"Word Count",       PIDSI_WORDCOUNT}, // ...
         {"Char Count",       PIDSI_CHARCOUNT},

         {"Thumbnail",        PIDSI_THUMBNAIL}, // VT_CF
         {"AppName",          PIDSI_APPNAME}, // VT_LPSTR
         {"Doc Security",     PIDSI_DOC_SECURITY}, // VT_I4
         {0, 0}
      };

Open in new window

Please note that document may contain custom properties as well, to retrieve them we need to know the custom property name that we want to retrieve.
In a way the Microsoft.Office.Interop.Word is a .NET wrapper over COM interfaces (OLE) that office binaries provide.
Thanks,
Karrtik
0
Mr_FulanoAuthor Commented:
Thank you Karrtik, very helpful indeed!!!
0
Mr_FulanoAuthor Commented:
EXCELLENT!!!
0
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
C#

From novice to tech pro — start learning today.

Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.