Link to home
Start Free TrialLog in
Avatar of excel learner
excel learnerFlag for United Kingdom of Great Britain and Northern Ireland

asked on

Extraction of text from word document

Dear experts

I have the source document with vocabulary, meaning and its usage, each identified separately.
I need the help of the experts with a macro which can do the following onto a excel sheet:
Step1: Strip the word starting with bold and ending before the colon=>Put this in cell A1=> Example- abstruse (adjective):
Step2: Strip the words after the colon in step1 above but ending in the same line in cell B1=> difficult to understand; incomprehensible
Step3:Move to next line and copy entire stuff after step2 but before the next word beings in cell C1 example=> Physics textbooks can seem so abstruse to the uninitiated that readers feel as though they are looking at hieroglyphics.

Repeat the above steps by moving to the next word accolade (noun): and copying them in row 2 onwards.
The word document has 95 pages in it.
Each page starts with the logo (gif and the name below given in url). This detail need not be extracted.
I sincerely believe there is no copy right violation.
Kindly help.
Thank you


Extract from the page

abstruse (adjective): difficult to understand; incomprehensible
Physics textbooks can seem so abstruse to the uninitiated that readers feel as though they are looking at hieroglyphics.
accolade (noun): an award or praise granted as a special honor
Jean Paul-Sartre was not a fan of accolades, and as such, he refused to accept the Nobel Prize for
Literature in 1964.
Avatar of Rob Henson
Rob Henson
Flag of United Kingdom of Great Britain and Northern Ireland image

You may be able to achieve this with Excel Text to Columns feature.

Copy the text from Word and paste into cell A1 on an Excel sheet. The line breaks in the Word document should force new rows for each line of text in Excel.

You can then use the Text to Column feature to use the colon as the separator to split the Word and meaning into two columns; A & B.

If you then filter on lines containing ( or ) you can then copy the Example text and paste into a separate sheet because the example text will probably skip the blank rows so will be in a continuous block. You can then use AutoFilter to remove blank rows from columns A & B and then copy and paste the example text back onto the same sheet in column C.
Please attach the Word doc.
Alternatively, before using the Text to Column feature, filter for lines not containing a colon to separate out the Example text rows.

See attached using the sample text from question using the TTC and filter method.
Copy-of-text.xlsx
Avatar of excel learner

ASKER

Hi Martin, As requested please find attached the sample document.
alteration2.docx
There is an issue with using my method as some of the Example text lines have more than 1 line break so will end up on two lines.
If you first save the Word doc in plain text format then this will do it.
Sub DoIt()
Dim FF As Integer
Dim strLine As String
Dim lngNewRow As Long
Dim strParts() As String

FF = FreeFile

' Change the file name and location as needed
Open "C:\temp\alteration2.txt" For Input As #FF

Do While Not EOF(FF)
    Line Input #FF, strLine
    strParts = Split(strLine, ":")
    If UBound(strParts) > 0 Then
        lngNewRow = lngNewRow + 1
        Cells(lngNewRow, "A") = strParts(0)
        Cells(lngNewRow, "B") = strParts(1)
        Do Until Trim(strLine) = ""
            Line Input #FF, strLine
            Cells(lngNewRow, "C") = Cells(lngNewRow, "C") & " " & strLine
        Loop
    End If
Loop
Close

End Sub

Open in new window

Hi Martin, thank you for the macro

I have the following observation

As in the document

accolade (noun): an award or praise granted as a special honor
Jean Paul-Sartre was not a fan of accolades, and as such, he refused to accept the Nobel Prize for
Literature in 1964.

Column C extract included only the below
Jean Paul-Sartre was not a fan of accolades, and as such, he refused to accept the Nobel Prize for


I think the macro command is not seeking all the data till the next word starts

I have a question do you want me to put any sort of identifier infront of every new word say 77 or 88or !!

I am happy to standardize my document with the above identifier.

Kindly advise.

Thank you
I made a late update to the code in post ID: 42298497 that you may not have seen. In any case here is the code again, and it gives the results that you expect.
Sub DoIt()
Dim FF As Integer
Dim strLine As String
Dim lngNewRow As Long
Dim strParts() As String

FF = FreeFile

' Change the file name and location as needed
Open "C:\temp\alteration2.txt" For Input As #FF

Do While Not EOF(FF)
    Line Input #FF, strLine
    strParts = Split(strLine, ":")
    If UBound(strParts) > 0 Then
        lngNewRow = lngNewRow + 1
        Cells(lngNewRow, "A") = strParts(0)
        Cells(lngNewRow, "B") = strParts(1)
        Do Until Trim(strLine) = ""
            Line Input #FF, strLine
            Cells(lngNewRow, "C") = Cells(lngNewRow, "C") & " " & strLine
        Loop
    End If
Loop
Close

End Sub

Open in new window


Output:
User generated image
SOLUTION
Avatar of Subodh Tiwari (Neeraj)
Subodh Tiwari (Neeraj)
Flag of India image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Hi Martin,

Thank you for the updated code
I got the below error:
Run-time error:'62':
Input past end of file

The macro pulled only 6 records from the entire file

I have made some changes to the document like removed the blank rows between the end of the previous word and the next word.

I am attaching the sample once again.

Apologies for the change.

Kindly help.
for-extraction-partEE.txt
Hi Subodh,

Thank you for the macro

The error message i got is compile error: User-defined type not defined.

I saved the word document on desktop.

I think it will be easy if the entire path is defined in the macro rather than the approach you have taken. I am a neophyte so please get annoyed with my input.

Kindly help.
I have made some changes to the document like removed the blank rows between the end of the previous word and the next word.
My code depended on the blank line to know when a new definition was found. Let me see what I can do.
ASKER CERTIFIED SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Hi Martin,

Thank you

I think the macro is doing the job.

Let me complete the task. I will close the question by EOD today (India time).

Thank you
Okay try this and assign the correct path to DocPath variable to make sure the Document's name and it's path is correct.
Currently, the line#16 in the code assumes that your document is called "alteration2.docx" and it is saved in C drive. Change it accordingly.
DocPath = "C:\alteration2.docx"

Open in new window



Sub GetWordData()
Dim ws As Worksheet
Dim wdApp As Object
Dim Doc As Object
Dim par As Object
Dim DocPath As String
Dim i As Long, j As Long
Dim r As Long, c As Long
Dim str As String
Application.ScreenUpdating = False

Set ws = ActiveSheet
Set wdApp = CreateObject("Word.Application")

'Following line assumes that the word document "alteration2.docx" is saved in C drive
DocPath = "C:\alteration2.docx"

Set Doc = wdApp.Documents.Open(DocPath)

r = 1

For i = 1 To Doc.Paragraphs.Count
    If UBound(Split(Doc.Paragraphs(i).Range.Text, ":")) > 0 Then
        c = 1
        ws.Cells(r, c) = Split(Doc.Paragraphs(i).Range.Text, ":")(0)
        ws.Cells(r, c + 1) = Split(Doc.Paragraphs(i).Range.Text, ":")(1)
        On Error Resume Next
        For j = i + 1 To Doc.Paragraphs.Count
            If Len(Doc.Paragraphs(j).Range.Text) = 1 Then
                i = j
                Exit For
            End If
            str = str & " " & Doc.Paragraphs(j).Range.Text
        Next j
        ws.Cells(r, c + 2) = str
        str = ""
        r = r + 1
    End If
    
    
Next i
ws.UsedRange.Columns.AutoFit
wdApp.Quit
Set Doc = Nothing
Set wdApp = Nothing
Application.ScreenUpdating = True
End Sub

Open in new window

Hi Martin
I got disturbed with the continuity for checking the output of this macro.

Please allow me two to three days.

I wanted to give reponsible response rather than just a casual one.

Subodh I will test your macro once I complete my work with the output from Martin's macro.

Thank you
Martin, your Macro worked
Subodh I did not manage to get time.
I have kept this question for very long. I will give you my feedback to you tomorrow.
Thank you both
You're welcome and I'm glad I was able to help.

If you expand the “Full Biography” section of my profile you'll find links to some articles I've written that may interest you.

Marty - Microsoft MVP 2009 to 2017
              Experts Exchange MVE 2015
              Experts Exchange Top Expert Visual Basic Classic 2012 to 2017