Uploading sections in a Word document into an Access database

I have a large Word doc, like a legal document in sections and clauses.
I'm looking for a sample that uploads the text into an Access database, with section/clause reference. (with ADO/SQL)

Who is Participating?

[Product update] Infrastructure Analysis Tool is now available with Business Accounts.Learn More

I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

David FavorLinux/LXD/WordPress/Hosting SavantCommented:
Likely this is better done as a Gig, because documents tend to be fairly unique in layout.

You can use libreoffice --headless to convert your .doc + .docx + .pdf files to text + then parse the text, inserting it into your database.

One caveat. Using Access makes accessing data very complex, if more than one person requires access.

Takes the same amount of time to setup a MariaDB (MySQL that works) database + you can use the built in GRANT system to allow anyone to access your data.
Scott McDaniel (Microsoft Access MVP - EE MVE )Infotrakker SoftwareCommented:
One caveat. Using Access makes accessing data very complex, if more than one person requires access.
I'd not agree with this statement. There are thousands upon thousands of multiuser Access systems that work perfectly fine. MariaDB (or MySQL) and other server database systems are much more difficult to setup and maintain than an Access ACE or JET database, at least for the average user.

I do agree that you may be better off having someone else do this, if you're not comfortable with automating Access and Word. It's fairly simple to get data from Access into Word documents, but retrieving the actual text from a Word document can be a bit more tricky. You would need to open a Word Document in code, then walk through the Paragraphs in the Document to determine what you need to import. As David said, Word documents are unique so there is no one-size-fits-all code that can do what you ask.
hindersalivaAuthor Commented:
Scott, yes. It's the 'walking through the sections and clauses' to determine what to upload to which record is the issue I'm researching. I was hoping (searching for) some kind of paragraphID that uniquely identifies a clause.

I'd be interested to know i there is such an identifier in Word.

My main/everyday activity is using ADO/SQL to shunt data from Excel to Access/SQL Server and back. So that part of it I'm OK with. Access will simply hold the data in tables, and manage relations between them. In my travels, because the connection to the DB is brief (milliseconds), the number of users is practically 'unlimited'.
Acronis True Image 2019 just released!

Create a reliable backup. Make sure you always have dependable copies of your data so you can restore your entire system or individual files.

Scott McDaniel (Microsoft Access MVP - EE MVE )Infotrakker SoftwareCommented:
I don't believe a Paragraph in Word would have any sort of unique identifier. It's just a block of (possibly formatted) text.

You may be able to modify the Word document to define specific portions of the document with ID values you provide. But I don't think Word does this automatically.
Jim Dettman (Microsoft MVP/ EE MVE)President / OwnerCommented:
If it's a one time deal, a cut & paste is probably best.

If not, then it is possible to:

1. Control Word through OLE automation.
2. Open a document
3. Search for specific keywords.
4. Copy text into Access.

 and Scott is correct; there is nothing that uniquely identifies a paragraph unless you take the time to add bookmarks to the Word doc.

 If you outlined what it is your trying to accomplish, we may be able to suggest an approach.

hindersalivaAuthor Commented:
New info:
I now have a sample document. It has 411 pages.
The document is organised into Sections (eg. M680100) and Clauses (eg. M680110).
In each Clause is either a paragraph, a bulleted list (of up to 2 levels), or a Table - some with bulleted lists within them.
I want to grab each Clause intact (Lists/Table included), to put into the database.

(the objective being to be able to reverse the process, and pull in only the Sections and Clauses the user has selected for any particular specification. But that's another step beyond this step)

The huge number of Sections/Clauses makes it impractical to do this manually. (there are 10 of these 400+ page model specifications)

In the attached image I have marked (1) ClauseID (2) Clause Name (3) Clause Content.
crystal (strive4peace) - Microsoft MVP, AccessRemote Training and ProgrammingCommented:

Word is pretty amazing. You can loop through documents and get all kinds of information.

> "some kind of paragraphID that uniquely identifies a clause"

For marking unique text, Bookmarks can be used.  It might be a good idea to develop a naming convention so a pattern match could be done on the bookmark name to see what kind of a name it is. Bookmarks can contain text you want to replace or do something else with, or they just be used to mark a spot.  You can also use other features in Word like index and other cross-references too.

A paragraph number will change as paragraphs are inserted and rearranged so while it is unique in a version of the document, it is not necessarily the same in another.

for specifying the nature of a paragraph, Style can be used. This might be Normal, Heading 1, Heading 2, myStyleName, etc

for information about a paragraph, wdActiveEndSectionNumber is the section number, wdActiveEndPageNumber is the page number it is on.  If the document page number has been redefined in places, wdActiveEndAdjustedPageNumber is the adjusted page number

In the Word document, you might have tables with rows and columns. Like in Excel, each intersection is called a 'cell'.

Depending on how the document is formatted, you could loop through paragraphs and/or tables looking to see if the text matches the pattern for a new section number at the beginning. If you have specific questions about searching ranges for specific text, or a pattern, please start a new thread.

I like to keep track of a  number for reporting and other reasons in the loop but instead of, for instance, For i = 1 To oDoc.Tables.Count, you could use For Each oTable In oDoc.Tables

Likewise, instead of booKeepGoing to keep track of when to stop, you could also do this:
   Dim oPara As Word.Paragraph 'or Object for late-binding
   For Each oPara In oDoc.Paragraphs
   next oPara

   set oPara = nothing

Open in new window

Before I saw at the image you attached, I thought you wanted the Word section number ... now I see you are using 'section' to mean something else.

The code shows how to put the contents of a paragraph (or range) into a string variable -- you could create records and write this information to a table in Access.

Sub runWord_ReadDocument()
'171217 s4p
   Dim sPathFile As String
   sPathFile = "c:\path\filename.docx"
   Call Word_ReadDocument(sPathFile)
End Sub

Function Word_ReadDocument(psPathFile As String _
   ) As Long
'171217 crystal (strive4peace)
   '  for early binding,
   '     Microsoft Word ##.0 Object Library  -- for instance 15.0
   'Open Word Document
   '  loop and write paragraph #, page#, section#, Style, etc to debug window
   '  document tables
   '  document bookmarks
   'set up error handler
   On Error GoTo Proc_Err

   Dim db As DAO.Database _
      , rs As DAO.Recordset
   '====================== ++++++++++++++++++++++++++++++++++++++++++ CHANGE
   'late-binding for deployment
   Dim oWrd As Object _
      , oDoc As Object _
      , oTable As Object _
      , oBookmark As Object _
      , oRng As Object
   'early-binding for development
   'needs Microsoft Word ##.0 Object Library
'   Dim oWrd As Word.Application _
      , oDoc As Word.Document _
      , oTable As Word.Table _
      , oBookmark As Word.Bookmark _
      , oRng As Word.Range
   Dim nParaTotal As Long _
      , nCharTotal As Long _
      , nParaCurrent As Long _
      , nPosStart As Long _
      , nPosEnd As Long
   Dim sText As String _
      , sStyle As String _
      , iSection As Integer _
      , iPage As Integer _
      , i As Integer _
      , sMsg As String
   Dim booKeepGoing As Boolean
   Set db = CurrentDb
   '---------------- Initialize Word
   Set oWrd = CreateObject("Word.Application")
   oWrd.Visible = True
   'open Word document, ReadOnly
   Set oDoc = oWrd.Documents.Open(fileName:=psPathFile, ReadOnly:=True)
   'total number of paragraphs
   nParaTotal = oDoc.Paragraphs.Count
   'total number of characters
   nCharTotal = oDoc.characters.Count
   Debug.Print "******* " & Format(nParaTotal, "#,##0") & " paragraphs in " & psPathFile
   nPosStart = 1 'where to start
   nPosEnd = nCharTotal 'where to end
   nParaCurrent = 1
   'define a range from a character start position to end position
   'not used -- only here to demonstrate
   Set oRng = oDoc.Range(nPosStart, nPosEnd)

   '----------------------------------------------------- Loop Paragraphs - get Style, Section
   booKeepGoing = True
   'loop through the document
   Do While booKeepGoing = True
      'stop if the current paragraph number is greater than the document
      If nParaCurrent > nParaTotal Then
         booKeepGoing = False
         Exit Do
      End If
      'With the current paragraph ...
      With oDoc.Paragraphs(nParaCurrent)
         'text of paragraph - can be searched, parsed, etc
         sText = .Range.Text
         'style name
         sStyle = .Style
         'what section end of range is in
         iSection = .Range.Information(2) 'wdActiveEndSectionNumber=2
         'what page the end of range is on
         iPage = .Range.Information(3)  'wdActiveEndPageNumber=3
         '  Help:
         '  WdInformation Enumeration (Word)
         '  https://msdn.microsoft.com/en-us/vba/word-vba/articles/wdinformation-enumeration-word
      End With
      'print information to debug window -- or do other things
      '  current paragraph number, section#, page#
      sMsg = "Para# " & nParaCurrent & ", section: " & iSection & ", page: " & iPage
      '  paragraph style, number of characters, length trimmed text
      sMsg = sMsg & ", " & sStyle
      '  number of characters, length trimmed text
      sMsg = sMsg & ", " & Format(Len(sText), "#,##0") & " characters"
      If Len(sText) <> Len(Trim(sText)) Then
         sMsg = sMsg & ", trimmed = " & Format(Len(Trim(sText)), "#,##0") & " characters"
      End If
      'comment this if you do not want to see the paragraph text
      sMsg = sMsg & ", " & sText
      Debug.Print sMsg

      'increment the current paragraph counter
      nParaCurrent = nParaCurrent + 1

   '----------------------------------------------------- Document Tables
   If oDoc.Tables.Count > 0 Then
      Debug.Print "************ TABLES"
      For i = 1 To oDoc.Tables.Count
 '      For Each oTable In oDoc.Tables 'this could be done instead
         With oDoc.Tables(i) 'oTable is not acutally used -- but could be
            Debug.Print i & ", " & .Rows.Count & " Rows, " & .Columns.Count & " Columns, ";
            Debug.Print .Range.Cells.Count & " cells, ";
           Debug.Print "Start = " & Format(.Range.Start, "#,##0") & ", End = " & Format(.Range.End, "#,##0")
         End With
      Next i
   End If
   '----------------------------------------------------- Document Bookmarks
   If oDoc.Bookmarks.Count > 0 Then
      Debug.Print "************ BOOKMARKS"
      For Each oBookmark In oDoc.Bookmarks
         With oBookmark
            Debug.Print .Name & " = " & .Range.Text
         End With
      Next oBookmark
   End If
   'close Word without saving changes (opened ReadOnly so False may not be necessary)
   oDoc.Close SaveChanges:=False
   Set oDoc = Nothing
   'quit Word without saving changes
   oWrd.Quit SaveChanges:=False
   Set oWrd = Nothing

   MsgBox "Done - press Ctrl-G to look at Immediate (Debug) window", , "Done"

   On Error Resume Next
   Set oRng = Nothing
   Set oBookmark = Nothing
   Set oTable = Nothing
   If Not oDoc Is Nothing Then
      oDoc.Close False
      Set oDoc = Nothing
   End If
   If Not oWrd Is Nothing Then
      oWrd.Quit False
      Set oWrd = Nothing
   End If

   Set db = Nothing
   Exit Function
   MsgBox Err.Description, , _
        "ERROR " & Err.Number _
        & "   Word_ReadDocument"

   Resume Proc_Exit
End Function

Open in new window

For more information on error handling, here is a video tutorial on EE:

Hopefully this gives you a good start ~

have an awesome day,

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
hindersalivaAuthor Commented:
Oh, wow. Crystal. That’s great. I’ve got your code running at the moment against the sample doc my client has given me. That’s showing me what Word elements i’m up against. I had to write the output to a text file as the count is 8,500+.

I’ll be back.
crystal (strive4peace) - Microsoft MVP, AccessRemote Training and ProgrammingCommented:
thanks, and you're welcome!
crystal (strive4peace) - Microsoft MVP, AccessRemote Training and ProgrammingCommented:

for those who are following this ... more is explained in this thread:

It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Microsoft Access

From novice to tech pro — start learning today.