Extracting Email Addresses from MS Word Documents Automatically

We have thousands of MS Word documents. Each one has Contact details for individual customers.

In the old days we used them as a rudementary ticketing system, each .doc was a ticket containging amoungst other things teh particular clines email addres, phone number, postal  address, etc.

I need to be able to extract the email address out of each .doc document in order to insert them into our new crm database system. It would take me days or weeks to open each ms word document individually.  Is there a product that can extract email address from either MS word documents, or from every file in a particular ms windows XP folder. ( same thing really )

I had though that http://www.mapilab.com had a product that might do this but I can no longer see it on the website.

Thanks,
Robert.

Thank you for your help

Robert
IP4IT StaffAsked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

ErezMorCommented:
are the docs constructed in a consistent format?
that is, is there an "email: " or some other known string that you can scan for , then extract the text following it?
if so, you can build a vb/vba code that will do the job (open each word file in a given folder, search for the string "Email: " or whatever, then copy what's following it to whereve you want, like an access table or an excel sheet)
NerdsOfTechTechnology ScientistCommented:
The easiest way would be to write a program that opens the .doc's for you and extracts the info.

Idea: Maybe have an excel spreadsheet run the VBA code to open the DIRECTORY containing of all of the .doc's and loop through each one --- finding the email address and outputing the email address(es) from each doc in a gridlike style (plus other fields like names, etc).


Jeffrey CoachmanMIS LiasonCommented:
<No Points wanted>

robbiebreslin,

Following up on NerdsOfTech's comment...

Please click the "Request Attention" link and ask that the MS Word Zone be added to this question.

The Experts there may be able to assist with the "Loop/Extraction"

;-)

JeffCoachman
OWASP: Avoiding Hacker Tricks

Learn to build secure applications from the mindset of the hacker and avoid being exploited.

IP4IT StaffAuthor Commented:
The ideas are good, Sould like I will need the services of a programmer, I do not6 have these skills or budget.

Thank you boaq2000, I will request that the MS Word Zone be added.

Doug BishopDatabase DeveloperCommented:
Per ErezMor's question: are the docs constructed in a consistent format? that is, is there an "email: " or some other known string that you can scan for , then extract the text following it?

What else do you want from the .doc file other than an email address? Is there other information (i.e. ticket number, date opened, etc.). Seems like a list of email addresses in and of itself is pretty much worhtless unless you are in the spam industry. Can you post a sample document?
jrb1senior developerCommented:
Put this procedure in your Normal document.  Change the pDir directory to point to the directory you want to scan.  Open up a new word document and run the procedure.  It will read through all of the *.doc files in the directory and copy the email addresses to the new word document.

Public Sub GetEmailAddresses()
Dim pWordApp As Object
Dim pWordDoc As Object
Dim pDir As String
Dim pFileName As String

Set pWordApp = CreateObject("word.application")
pWordApp.Visible = True

pDir = "c:\yourdirectory\"
pFileName = Dir(pDir + "*.doc")

Do Until Len(pFileName) = 0
  Set pWordDoc = pWordApp.Documents.Open(pDir & pFileName)
 
  With pWordApp.Selection.Find
    .Text = "[a-zA-Z0-9\-_.]{1,}\@[a-zA-Z0-9\-_.]{1,}"
    .Replacement.Text = ""
    .Forward = True
    .Wrap = wdFindContinue
    .Format = False
    .MatchCase = False
    .MatchWholeWord = False
    .MatchAllWordForms = False
    .MatchSoundsLike = False
    .MatchWildcards = True
  End With
 
  pWordApp.Selection.Find.Execute
 
  Do While pWordApp.Selection.Find.Found
     pWordApp.Selection.Copy
     Selection.Paste
     Selection.InsertParagraphAfter
     Selection.EndKey Unit:=wdStory
     pWordApp.Selection.EndKey Unit:=wdLine
 
    With pWordApp.Selection.Find
      .Text = "[a-zA-Z0-9\-_.]{1,}\@[a-zA-Z0-9\-_.]{1,}"
      .Replacement.Text = ""
      .Forward = True
      .Wrap = wdFindStop
      .Format = False
      .MatchCase = False
      .MatchWholeWord = False
      .MatchAllWordForms = False
      .MatchSoundsLike = False
      .MatchWildcards = True
    End With
    pWordApp.Selection.Find.Execute
  Loop

  pWordDoc.Close
  Set pWordDoc = Nothing
 
  pFileName = Dir
Loop

  pWordApp.Quit
  Set pWordApp = Nothing
End Sub

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
BrainBCommented:
Here is some code which I think will do what you want. Please read the instructions at the top.
Basically you hard-code a folder into the 'BaseFolder' variable and run the macro which searches that folder and its subfolders for .doc files. Opens each doc file, and extracts any email addresses it finds. The addresses go into the word document which is active at the time you run the macro.

Please run it on some test files & check in case it misses addresses with unusual formats.

I am attaching my test file containing the code.

'=============================================================================
'- MS WORD MACRO
'- EXTRACT EMAIL ADDRESSES FROM ALL .DOC FILES IN A FOLDER & ITS SUBFOLDERS
'- Puts results into the active document
'- NB. Requires VB Editor : Tools\Reference\ ..."Microsoft VBScript Regular Expressions"
'- Set the BaseFolder variable & run from where you want the results to go.
'-----------------------------------------------------------------------------
'- Brian Baulsom November 2010
'=============================================================================
'- 1. Get Files of specified base file - process documents
'- 2. Get subfolders & files of specified base folder - process documents
'=============================================================================
'- *** SET THE BASE FOLDER HERE ***********
Const BaseFolder As String = "F:\Test\"
'=============================================================================
Dim ResultsDoc As Word.Document        ' active document
'-----------------------------
Dim FSO As Object
Dim TempDoc As Word.Document           ' opened document
Dim MyFolderName As String
Dim MyFileName As String
Dim DocContents As String
Dim FolderCount As Long
Dim FileCount As Long
'-----------------------------
Dim MyRegExp As Object
Dim MyPattern As String
Dim MyMatches As Variant
Dim MyMatch As String
Dim MatchCount As Integer
'-----------------------------------------------------------------------------

'=============================================================================
'- GET FILES & SUBFOLDERS OF SPECIFIED BASE FOLDER
'=============================================================================
Sub GET_DOCUMENTS()
    Set ResultsDoc = ActiveDocument
    rsp = MsgBox("Base folder :=  " & BaseFolder & vbCr _
        & "Delete contents of " & ActiveDocument.Name & " ?" & vbCr _
        & "Answer 'No' to add to the existing list", vbYesNoCancel)
    If rsp = vbCancel Then End
    If rsp = vbYes Then ResultsDoc.Content.Delete
    '-------------------------------------------------------------------------
    Application.ScreenUpdating = False
    FolderCount = 0
    FileCount = 0
    '-------------------------------------------------------------------------
    '- FILE SYSTEM OBJECT
    Set FSO = CreateObject("Scripting.FileSystemObject")
    '-------------------------------------------------------------------------
    '- REGULAR EXPRESSION TO EXTRACT EMAIL ADDRESSES
    Set MyRegExp = CreateObject("VbScript.RegExp")
    MyPattern = "[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}"
    '-------------------------------------------------------------------------
    '- CALL FILE SUBROUTINE FOR BASE FOLDER
    Application.StatusBar = BaseFolder
    GetFileList (BaseFolder)
    '-------------------------------------------------------------------------
    '- CALL FOLDER SUBROUTINE (WHICH CALLS THE FILE ROUTINE)
    GetFolderList (BaseFolder)
    '-------------------------------------------------------------------------
    Application.ScreenRefresh
    Application.ScreenUpdating = True
    MsgBox ("Done" & vbCr & FolderCount & " Folders" & vbCr _
            & FileCount & " files")
End Sub
'========== END OF MAIN ROUTINE ==============================================

'=============================================================================
'- SUBROUTINE 1 : GET SUBFOLDERS OF SPECIFIED FOLDER
'=============================================================================
Private Sub GetFolderList(FolderSpec)
    Dim f, f1, fc, s
    Set f = FSO.GetFolder(FolderSpec)
    Set fc = f.SubFolders
    '-----------------------------------------------------------------------
    '- CHECK SUBFOLDER COUNT
    If fc.Count = 0 Then
        Exit Sub
    Else
        '- LOOP FOLDERS
        For Each f1 In fc
            FolderName = f1.Path
            GetFileList (FolderName)
            '----------------------------------------------------------------
            '- CALL SELF TO GET ANY SUBFOLDERS IN THIS SUBFOLDER
            GetFolderList (FolderName)
            '----------------------------------------------------------------
            Application.ScreenRefresh
        Next
    End If
    '-------------------------------------------------------------------------
End Sub
'=========== END OF SUB 1 ====================================================

'=============================================================================
'- SUBROUTINE 2 : TO GET FILES FROM A FOLDER
'=============================================================================
Private Sub GetFileList(FileSpec)
    Dim f, fl, fc, Spec
    Set f = FSO.GetFolder(FileSpec)
    Set fc = f.Files
    '-------------------------------------------------------------------------
    '- CHECK FILE COUNT
    If fc.Count = 0 Then
        Exit Sub
    Else
        FolderCount = FolderCount + 1
        '---------------------------------------------------------------------
        '- LOOP FILES
        For Each fl In fc
            MyFileName = UCase(fl.Name)
                '- ignore temp files. get .doc's
            If InStr(1, MyFileName, "~", vbTextCompare) = 0 _
                    And MyFileName <> UCase(ThisDocument.Name) _
                    And Right(MyFileName, 4) = ".DOC" Then
                MyFolderName = FileSpec
                GET_ADDRESSES
            End If
        Next
        '---------------------------------------------------------------------
    End If
End Sub
'=============================================================================

'=============================================================================
'- SUBROUTINE 3 : OPEN WORD DOCUMENT - EXTRACT EMAIL ADDRESSES
'=============================================================================
Private Sub GET_ADDRESSES()
    FileCount = FileCount + 1
    Application.StatusBar = FileCount & " files. Processing Folder : " _
        & FolderCount & " - " & MyFolderName & "\"
    '-------------------------------------------------------------------------
    Set TempDoc = Documents.Open(MyFolderName & "\" & MyFileName)
    DocContents = ActiveDocument.Content.Text
    '-------------------------------------------------------------------------
    With MyRegExp
        .Global = True
        .ignorecase = True
        .multiline = True
        .Pattern = MyPattern
        Set MyMatches = .Execute(DocContents)
    End With
    '-----------------------------------------------------------------------
    '- RESULTS TO DOCUMENT
    '-----------------------------------------------------------------------
    MatchCount = MyMatches.Count
    Application.StatusBar = "Matches : " & MatchCount
    If MatchCount > 0 Then
        For m = 0 To MatchCount - 1
            MyMatch = MyMatches(m)
            ResultsDoc.Content.InsertAfter MyMatch & vbCr
        Next
    End If
    '-----------------------------------------------------------------------
    TempDoc.Close savechanges:=False
End Sub
'===========================================================================

Open in new window

Extract-email-addresses.doc
IP4IT StaffAuthor Commented:
Thank you, that is very good feedback indeed.

So it would appear that there is a way to aty least get the email addresses out. Thats a very good start.

THe MS Word document file name contains a) the company and b) the end users name.

Example:

1017284-Blue-Chip-Company-Inc-John-Smith-Issue-Email-is-bouncing-Owner-Tech-Mike-Priority-2.doc

Is there anyway that we could match the email address exctracted to the file name of the ms word document so that we would know who the email address below to?

Thanks again for all you help.

Robert.
Doug BishopDatabase DeveloperCommented:
Change line 152 in BrainB's code above to:
   
ResultsDoc.Content.InsertAfter MyMatch & " - " & MyFileName & vbCr

Open in new window

BrainBCommented:
does robbiebreslin's suggestion solve your problem ?
If not, what do you want it to do ?
BrainBCommented:
Sorry, I meant dbbishop
Doug BishopDatabase DeveloperCommented:
BTW, no points please if accepted.
jrb1senior developerCommented:
In mine, same idea after the Selection.Paste:

Selection.Insert " - " & pFileName
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
DB Reporting Tools

From novice to tech pro — start learning today.