excel learner
asked on
Extraction of text from word document
Dear experts
I have the source document with vocabulary, meaning and its usage, each identified separately.
I need the help of the experts with a macro which can do the following onto a excel sheet:
Step1: Strip the word starting with bold and ending before the colon=>Put this in cell A1=> Example- abstruse (adjective):
Step2: Strip the words after the colon in step1 above but ending in the same line in cell B1=> difficult to understand; incomprehensible
Step3:Move to next line and copy entire stuff after step2 but before the next word beings in cell C1 example=> Physics textbooks can seem so abstruse to the uninitiated that readers feel as though they are looking at hieroglyphics.
Repeat the above steps by moving to the next word accolade (noun): and copying them in row 2 onwards.
The word document has 95 pages in it.
Each page starts with the logo (gif and the name below given in url). This detail need not be extracted.
I sincerely believe there is no copy right violation.
Kindly help.
Thank you
Extract from the page
abstruse (adjective): difficult to understand; incomprehensible
Physics textbooks can seem so abstruse to the uninitiated that readers feel as though they are looking at hieroglyphics.
accolade (noun): an award or praise granted as a special honor
Jean Paul-Sartre was not a fan of accolades, and as such, he refused to accept the Nobel Prize for
Literature in 1964.
I have the source document with vocabulary, meaning and its usage, each identified separately.
I need the help of the experts with a macro which can do the following onto a excel sheet:
Step1: Strip the word starting with bold and ending before the colon=>Put this in cell A1=> Example- abstruse (adjective):
Step2: Strip the words after the colon in step1 above but ending in the same line in cell B1=> difficult to understand; incomprehensible
Step3:Move to next line and copy entire stuff after step2 but before the next word beings in cell C1 example=> Physics textbooks can seem so abstruse to the uninitiated that readers feel as though they are looking at hieroglyphics.
Repeat the above steps by moving to the next word accolade (noun): and copying them in row 2 onwards.
The word document has 95 pages in it.
Each page starts with the logo (gif and the name below given in url). This detail need not be extracted.
I sincerely believe there is no copy right violation.
Kindly help.
Thank you
Extract from the page
abstruse (adjective): difficult to understand; incomprehensible
Physics textbooks can seem so abstruse to the uninitiated that readers feel as though they are looking at hieroglyphics.
accolade (noun): an award or praise granted as a special honor
Jean Paul-Sartre was not a fan of accolades, and as such, he refused to accept the Nobel Prize for
Literature in 1964.
Please attach the Word doc.
Alternatively, before using the Text to Column feature, filter for lines not containing a colon to separate out the Example text rows.
See attached using the sample text from question using the TTC and filter method.
Copy-of-text.xlsx
See attached using the sample text from question using the TTC and filter method.
Copy-of-text.xlsx
ASKER
Hi Martin, As requested please find attached the sample document.
alteration2.docx
alteration2.docx
There is an issue with using my method as some of the Example text lines have more than 1 line break so will end up on two lines.
If you first save the Word doc in plain text format then this will do it.
Sub DoIt()
Dim FF As Integer
Dim strLine As String
Dim lngNewRow As Long
Dim strParts() As String
FF = FreeFile
' Change the file name and location as needed
Open "C:\temp\alteration2.txt" For Input As #FF
Do While Not EOF(FF)
Line Input #FF, strLine
strParts = Split(strLine, ":")
If UBound(strParts) > 0 Then
lngNewRow = lngNewRow + 1
Cells(lngNewRow, "A") = strParts(0)
Cells(lngNewRow, "B") = strParts(1)
Do Until Trim(strLine) = ""
Line Input #FF, strLine
Cells(lngNewRow, "C") = Cells(lngNewRow, "C") & " " & strLine
Loop
End If
Loop
Close
End Sub
ASKER
Hi Martin, thank you for the macro
I have the following observation
As in the document
accolade (noun): an award or praise granted as a special honor
Jean Paul-Sartre was not a fan of accolades, and as such, he refused to accept the Nobel Prize for
Literature in 1964.
Column C extract included only the below
Jean Paul-Sartre was not a fan of accolades, and as such, he refused to accept the Nobel Prize for
I think the macro command is not seeking all the data till the next word starts
I have a question do you want me to put any sort of identifier infront of every new word say 77 or 88or !!
I am happy to standardize my document with the above identifier.
Kindly advise.
Thank you
I have the following observation
As in the document
accolade (noun): an award or praise granted as a special honor
Jean Paul-Sartre was not a fan of accolades, and as such, he refused to accept the Nobel Prize for
Literature in 1964.
Column C extract included only the below
Jean Paul-Sartre was not a fan of accolades, and as such, he refused to accept the Nobel Prize for
I think the macro command is not seeking all the data till the next word starts
I have a question do you want me to put any sort of identifier infront of every new word say 77 or 88or !!
I am happy to standardize my document with the above identifier.
Kindly advise.
Thank you
I made a late update to the code in post ID: 42298497 that you may not have seen. In any case here is the code again, and it gives the results that you expect.
Output:
Sub DoIt()
Dim FF As Integer
Dim strLine As String
Dim lngNewRow As Long
Dim strParts() As String
FF = FreeFile
' Change the file name and location as needed
Open "C:\temp\alteration2.txt" For Input As #FF
Do While Not EOF(FF)
Line Input #FF, strLine
strParts = Split(strLine, ":")
If UBound(strParts) > 0 Then
lngNewRow = lngNewRow + 1
Cells(lngNewRow, "A") = strParts(0)
Cells(lngNewRow, "B") = strParts(1)
Do Until Trim(strLine) = ""
Line Input #FF, strLine
Cells(lngNewRow, "C") = Cells(lngNewRow, "C") & " " & strLine
Loop
End If
Loop
Close
End Sub
Output:
SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
Hi Martin,
Thank you for the updated code
I got the below error:
Run-time error:'62':
Input past end of file
The macro pulled only 6 records from the entire file
I have made some changes to the document like removed the blank rows between the end of the previous word and the next word.
I am attaching the sample once again.
Apologies for the change.
Kindly help.
for-extraction-partEE.txt
Thank you for the updated code
I got the below error:
Run-time error:'62':
Input past end of file
The macro pulled only 6 records from the entire file
I have made some changes to the document like removed the blank rows between the end of the previous word and the next word.
I am attaching the sample once again.
Apologies for the change.
Kindly help.
for-extraction-partEE.txt
ASKER
Hi Subodh,
Thank you for the macro
The error message i got is compile error: User-defined type not defined.
I saved the word document on desktop.
I think it will be easy if the entire path is defined in the macro rather than the approach you have taken. I am a neophyte so please get annoyed with my input.
Kindly help.
Thank you for the macro
The error message i got is compile error: User-defined type not defined.
I saved the word document on desktop.
I think it will be easy if the entire path is defined in the macro rather than the approach you have taken. I am a neophyte so please get annoyed with my input.
Kindly help.
I have made some changes to the document like removed the blank rows between the end of the previous word and the next word.My code depended on the blank line to know when a new definition was found. Let me see what I can do.
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
Hi Martin,
Thank you
I think the macro is doing the job.
Let me complete the task. I will close the question by EOD today (India time).
Thank you
Thank you
I think the macro is doing the job.
Let me complete the task. I will close the question by EOD today (India time).
Thank you
Okay try this and assign the correct path to DocPath variable to make sure the Document's name and it's path is correct.
Currently, the line#16 in the code assumes that your document is called "alteration2.docx" and it is saved in C drive. Change it accordingly.
Currently, the line#16 in the code assumes that your document is called "alteration2.docx" and it is saved in C drive. Change it accordingly.
DocPath = "C:\alteration2.docx"
Sub GetWordData()
Dim ws As Worksheet
Dim wdApp As Object
Dim Doc As Object
Dim par As Object
Dim DocPath As String
Dim i As Long, j As Long
Dim r As Long, c As Long
Dim str As String
Application.ScreenUpdating = False
Set ws = ActiveSheet
Set wdApp = CreateObject("Word.Application")
'Following line assumes that the word document "alteration2.docx" is saved in C drive
DocPath = "C:\alteration2.docx"
Set Doc = wdApp.Documents.Open(DocPath)
r = 1
For i = 1 To Doc.Paragraphs.Count
If UBound(Split(Doc.Paragraphs(i).Range.Text, ":")) > 0 Then
c = 1
ws.Cells(r, c) = Split(Doc.Paragraphs(i).Range.Text, ":")(0)
ws.Cells(r, c + 1) = Split(Doc.Paragraphs(i).Range.Text, ":")(1)
On Error Resume Next
For j = i + 1 To Doc.Paragraphs.Count
If Len(Doc.Paragraphs(j).Range.Text) = 1 Then
i = j
Exit For
End If
str = str & " " & Doc.Paragraphs(j).Range.Text
Next j
ws.Cells(r, c + 2) = str
str = ""
r = r + 1
End If
Next i
ws.UsedRange.Columns.AutoFit
wdApp.Quit
Set Doc = Nothing
Set wdApp = Nothing
Application.ScreenUpdating = True
End Sub
ASKER
Hi Martin
I got disturbed with the continuity for checking the output of this macro.
Please allow me two to three days.
I wanted to give reponsible response rather than just a casual one.
Subodh I will test your macro once I complete my work with the output from Martin's macro.
Thank you
I got disturbed with the continuity for checking the output of this macro.
Please allow me two to three days.
I wanted to give reponsible response rather than just a casual one.
Subodh I will test your macro once I complete my work with the output from Martin's macro.
Thank you
ASKER
Martin, your Macro worked
Subodh I did not manage to get time.
I have kept this question for very long. I will give you my feedback to you tomorrow.
Subodh I did not manage to get time.
I have kept this question for very long. I will give you my feedback to you tomorrow.
ASKER
Thank you both
You're welcome and I'm glad I was able to help.
If you expand the “Full Biography” section of my profile you'll find links to some articles I've written that may interest you.
Marty - Microsoft MVP 2009 to 2017
Experts Exchange MVE 2015
Experts Exchange Top Expert Visual Basic Classic 2012 to 2017
If you expand the “Full Biography” section of my profile you'll find links to some articles I've written that may interest you.
Marty - Microsoft MVP 2009 to 2017
Experts Exchange MVE 2015
Experts Exchange Top Expert Visual Basic Classic 2012 to 2017
Copy the text from Word and paste into cell A1 on an Excel sheet. The line breaks in the Word document should force new rows for each line of text in Excel.
You can then use the Text to Column feature to use the colon as the separator to split the Word and meaning into two columns; A & B.
If you then filter on lines containing ( or ) you can then copy the Example text and paste into a separate sheet because the example text will probably skip the blank rows so will be in a continuous block. You can then use AutoFilter to remove blank rows from columns A & B and then copy and paste the example text back onto the same sheet in column C.