Link to home
Start Free TrialLog in
Avatar of Robb Hill
Robb HillFlag for United States of America

asked on

Removing HTML/DIV/P tags from an XML file withoug losing the value in the element.

I load an xml file dynamically.

Some of the xml files have some html tags that I need to clean out.

I cannot simply ignore the tag as that will remove the content within that element.

For example for a given xml file suppose I have this structure:

<my:ProjectDescription>
<html xmlns="http://www.w3.org/1999/xhtml" xml:space="preserve">
<p>BLAH BLAH BLAH
</p>
</html>
</my:ProjectDescription>

In this scenario I would want the contents of the <p> to be the ProjectDescription.

Is there a function or methodology that I can use to manipulate this?


Thanks,
Avatar of Terry Woods
Terry Woods
Flag of New Zealand image

Is there a specific tool or language you want to use? Is it a process you'll do regularly, or just a one-off?
If it's really HTML and not XHTML, then this is a pretty tough task. Thus you should look into the HTML Agility Pack.
Avatar of Robb Hill

ASKER

It would be nice to do it with SQL as that is where I am processing these files but it is a one off so I can do another way if necessary...just many files...
Another approach that if possible might help me "hack" around this.  
If its too much effort to clean the xml.

Can I rename the tag if found.

So for example if I find the tag within projectdescription node html then <p>
replace P with Div  ?
Well, you can check the html element of your XML, whether it contains CDATA. When not, then it should be sufficient to use regex to search for <*> and remove the matches. When there is a CDATA, then you need to replace it only before and after the CDATA tags.

But the basic problem remains: HTML is running text which optional hard-brakes like p or br. And depending on the encoding, there maybe UTF characters which have breaks (NEL, LS, PS), which would remain in the resulting text and lead to a different formatting,
interesting.  

I dont think formatting matters in this project as the purpose of this is to get the data into SQL...not for the purpose of every recreating the xml but just to store the data for auditing reasons.  So its just important that I get the data.

The problem arrises for me with the html that as my query goes over the nodes the div/p/strong/u ...and so forth tags are random thus causing my structure to change thus breaking my inserts into my table.
Another Idea maybe.....is if it finds an html like attribute then iterate to the most child node of the html nodes..example.  if this is possible.  I looks like I can guantree html is a parent node in all of these cases...but I cannot guarantee what html follows...sometimes its div..or p..etc.

html
div
p
strong
u
"GET THIS VALUE"
Well, when it is for auditing reasons, I would store the HTML. Cause displaying it or modifying it later in the front-end on is much better, cause it means that you have stored the original (values), which is what audits are about.

Otherwise, you should provide more context, what values are relevant, then you may consider scraping them from the HTML and storing only the actual values.
yea the 2nd part of what you said I think is relative.   If I still had access to the sharepoint I could remove the editor on info path that allowed this rich text to get injected into the xml.

Here is a detailed xpath of the problem this creates.  

AppropriationRequestForm[1]/RequestHeader[1]/ProjectDescription[1]
AppropriationRequestForm[1]/RequestHeader[1]/ProjectDescription[1]/html[1]
AppropriationRequestForm[1]/RequestHeader[1]/ProjectDescription[1]/html[1]/@space
AppropriationRequestForm[1]/RequestHeader[1]/ProjectDescription[1]/html[1]/div[1]
AppropriationRequestForm[1]/RequestHeader[1]/ProjectDescription[1]/html[1]/div[1]/span[1]
AppropriationRequestForm[1]/RequestHeader[1]/ProjectDescription[1]/html[1]/div[1]/span[1]/@style
AppropriationRequestForm[1]/RequestHeader[1]/ProjectDescription[1]/html[1]/div[1]/span[1]/font[1]
AppropriationRequestForm[1]/RequestHeader[1]/ProjectDescription[1]/html[1]/div[1]/span[1]/font[1]/@face
AppropriationRequestForm[1]/RequestHeader[1]/ProjectDescription[1]/html[1]/div[1]/span[1]/p[1]
AppropriationRequestForm[1]/RequestHeader[1]/ProjectDescription[1]/html[1]/div[1]/span[1]/p[1]/@class
AppropriationRequestForm[1]/RequestHeader[1]/ProjectDescription[1]/html[1]/div[1]/span[1]/p[1]/@style
AppropriationRequestForm[1]/RequestHeader[1]/ProjectDescription[1]/html[1]/div[1]/span[1]/p[1]/span[1]
AppropriationRequestForm[1]/RequestHeader[1]/ProjectDescription[1]/html[1]/div[1]/span[1]/p[1]/span[1]/@style
AppropriationRequestForm[1]/RequestHeader[1]/ProjectDescription[1]/html[1]/div[1]/span[1]/p[1]/span[1]/span[1]
AppropriationRequestForm[1]/RequestHeader[1]/ProjectDescription[1]/html[1]/div[1]/span[1]/p[1]/span[1]/span[1]/@style
AppropriationRequestForm[1]/RequestHeader[1]/ProjectDescription[1]/html[1]/div[1]/span[1]/p[1]/span[1]/sup[1]

Open in new window

If I could query this xml file for all values from the beginning to end of ProjectDescriptions...and just let any level of child elements and values all be concatenated as the value for Projectdescription...that would be just fine.

Is that possible?
I'm still not clear enough on the problem... can you give a more complex example of the data you've got to work with, and what you want to reduce it to? Is any of it nested?

I say more complex because we need to handle the worst case scenario, not just the simplest.
essentially the issue was that html tags are showing up specifically in this node...because this was an info path form and this entry had a rich text editor on it.  Honestly worse case scenario is every html tag on the planet:)

With that being said I have just been manually filtering them out in a where clause..treating them as nodes I do not want to see....its not gracefull but its working.

The only other thing I can say is these types of nodes do have the html schema associated to them.   But once again I have it working with a where filter.  If you have any thing from that info then we can try otherwise I will close using the where filter as a solution.
filtered out tags with where filter.
ASKER CERTIFIED SOLUTION
Avatar of Robb Hill
Robb Hill
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
If you've got it solved, there's probably no point in spending more time on it... glad you got something working!
yea always like to know a better way but this project is anything than normal.  Its literally a massive sql migration to store data from forms infinitely from about 20 sharepoint sites.....