We help IT Professionals succeed at work.

Parsing a MS Word document with Coldfusion, XPath and creating a new one with the same formatting as the original.

Greeting Experts! I have a coding issue that I could use some help with. I'm using ColdFusion 10 to try and parse a Microsoft Word document, extract the contents, keep it's formatting, and put it into a SQL Server database in multiple parts. For example if the word doc were HTML, each row would go into the database separately. This is turning out to be a tough nut to crack. I'm having trouble with the XPath part. I can find individual cells with the XML structure that ColdFusion creates but I can't seem to list all child nodes for a given element. This is all an attempt to keep the MS Word formatting, including special characters. I know parsing HTML would be easier but the goal is to do it with the original MS Word document. Given the key words from the left hand column, the Word Doc will filter out the Details Text in the right column. For example if the key word search was "Bob" then only the rows that have "Bob" in the first column will be displayed. I'd like to stay away from third party plug-ins. Any help is appreciated.  Thank you.
EE_CFM_XML_DB.txt
TheDocument.docx
Comment
Watch Question

Author

Commented:
Thank you for the reply Gurpreet. I don't think Open Office will help. Open office converts Word to PDF, I need the final format to be MS Word. Also, I'm using CF 10 and not Coldfusion 2018. Parsing the xml seems like the right way to go but I'm unsure how to traverse over a table that is produced in MS Word XML.

Author

Commented:
Yes, I did. The file EE_CFM_XML_DB.txt is actually a coldfusion file that uses XmlParse as Ben Nadel's example notes. I haven't been successful in going the extra step and actually building a parsing algorithm for the tables. There are actually 2 tables in the docx file. The first table is here:  /w:document/w:body[1]/w:tbl[1]    and the second one is here: /w:document[1]/w:body[1]/w:tbl[1]/w:tr[2]/w:tc[4]/w:tbl[1]

I'm not sure how to begin extracting the data form the rows and columns of the table while taking into account the second table.
Most Valuable Expert 2015
Commented:
It's certainly possible IF you're comfortable with xpath - and have a good grasp of the ooxml structure.  Just keep in mind styles are embedded in the ooxml too, which makes for some VERY narly nesting...

Something like this gets you part of the way.  It loops through all table rows and cells and extracts the "text runs" in each cell. However, it doesn't account for that 2nd nested table - so it ends up getting merged with the text from the 1st table.  My xpath isn't great, so I'd probably open a separate thread. Ask about the xpath expressions only .  (CF isn't relevant to that part and might scare off someone knowledgeable about xpath but unfamiliar with CFML ;-)

<cfset rowData = []>

<!--- loop through rows in table --->
<cfset tableRows = xmlSearch(MyXml.document.body.tbl, ".//w:tr")>
<cfloop array="#tableRows#" index="row">
	<cfset columnData = []>
	<!--- loop through cells in current row --->
	<cfset tableCells = xmlSearch(row, ".//w:tc")>
	<cfloop array="#tableCells#" index="cell">
	
		<cfset nestedTables = xmlSearch(cell, ".//w:tbl")>
		<cfif arrayLen(nestedTables)>
			** do same type of extraction with nested table
		</cfif>
		
		
		<!--- get all text runs in current cell --->
		<cfset textRuns = []>
		<cfset textNodes = xmlSearch(cell, ".//w:t")>
		<cfloop array="#textNodes#" index="node">
			<cfset arrayAppend(textRuns, node.xmlText)>
		</cfloop>
		
		<!--- save text for current cell/column --->
		<cfset arrayAppend(columnData, arrayToList(textRuns, " "))>
	</cfloop>
	
	<!--- save columns in current row --->
	<cfset arrayAppend(rowData, columnData)>
</cfloop>

<cfdump var="#rowData#">

Open in new window