Parsing a MS Word document with Coldfusion, XPath and creating a new one with the same formatting as the original.

Greeting Experts! I have a coding issue that I could use some help with. I'm using ColdFusion 10 to try and parse a Microsoft Word document, extract the contents, keep it's formatting, and put it into a SQL Server database in multiple parts. For example if the word doc were HTML, each row would go into the database separately. This is turning out to be a tough nut to crack. I'm having trouble with the XPath part. I can find individual cells with the XML structure that ColdFusion creates but I can't seem to list all child nodes for a given element. This is all an attempt to keep the MS Word formatting, including special characters. I know parsing HTML would be easier but the goal is to do it with the original MS Word document. Given the key words from the left hand column, the Word Doc will filter out the Details Text in the right column. For example if the key word search was "Bob" then only the rows that have "Bob" in the first column will be displayed. I'd like to stay away from third party plug-ins. Any help is appreciated.  Thank you.
EE_CFM_XML_DB.txt
TheDocument.docx
Garbonzo_HorowitzAsked:
Who is Participating?

[Product update] Infrastructure Analysis Tool is now available with Business Accounts.Learn More

x
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

Gurpreet Singh RandhawaCEOCommented:
Garbonzo_HorowitzAuthor Commented:
Thank you for the reply Gurpreet. I don't think Open Office will help. Open office converts Word to PDF, I need the final format to be MS Word. Also, I'm using CF 10 and not Coldfusion 2018. Parsing the xml seems like the right way to go but I'm unsure how to traverse over a table that is produced in MS Word XML.
Garbonzo_HorowitzAuthor Commented:
Yes, I did. The file EE_CFM_XML_DB.txt is actually a coldfusion file that uses XmlParse as Ben Nadel's example notes. I haven't been successful in going the extra step and actually building a parsing algorithm for the tables. There are actually 2 tables in the docx file. The first table is here:  /w:document/w:body[1]/w:tbl[1]    and the second one is here: /w:document[1]/w:body[1]/w:tbl[1]/w:tr[2]/w:tc[4]/w:tbl[1]

I'm not sure how to begin extracting the data form the rows and columns of the table while taking into account the second table.
_agx_Commented:
It's certainly possible IF you're comfortable with xpath - and have a good grasp of the ooxml structure.  Just keep in mind styles are embedded in the ooxml too, which makes for some VERY narly nesting...

Something like this gets you part of the way.  It loops through all table rows and cells and extracts the "text runs" in each cell. However, it doesn't account for that 2nd nested table - so it ends up getting merged with the text from the 1st table.  My xpath isn't great, so I'd probably open a separate thread. Ask about the xpath expressions only .  (CF isn't relevant to that part and might scare off someone knowledgeable about xpath but unfamiliar with CFML ;-)

<cfset rowData = []>

<!--- loop through rows in table --->
<cfset tableRows = xmlSearch(MyXml.document.body.tbl, ".//w:tr")>
<cfloop array="#tableRows#" index="row">
	<cfset columnData = []>
	<!--- loop through cells in current row --->
	<cfset tableCells = xmlSearch(row, ".//w:tc")>
	<cfloop array="#tableCells#" index="cell">
	
		<cfset nestedTables = xmlSearch(cell, ".//w:tbl")>
		<cfif arrayLen(nestedTables)>
			** do same type of extraction with nested table
		</cfif>
		
		
		<!--- get all text runs in current cell --->
		<cfset textRuns = []>
		<cfset textNodes = xmlSearch(cell, ".//w:t")>
		<cfloop array="#textNodes#" index="node">
			<cfset arrayAppend(textRuns, node.xmlText)>
		</cfloop>
		
		<!--- save text for current cell/column --->
		<cfset arrayAppend(columnData, arrayToList(textRuns, " "))>
	</cfloop>
	
	<!--- save columns in current row --->
	<cfset arrayAppend(rowData, columnData)>
</cfloop>

<cfdump var="#rowData#">

Open in new window

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
ColdFusion Language

From novice to tech pro — start learning today.