Improve company productivity with a Business Account.Sign Up

x
  • Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 4520
  • Last Modified:

regex to strip illegal characters from xml

I have an XML file that I need to make sure it is valid before I parse it.

I think the simplest way is to run the entire contents of the xml file though preg_replace() to delete all characters except these:
#x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]

Will someone please help write the regex.

Thanks.

Here is the post related to this:
http://www.experts-exchange.com/Web_Development/Web_Languages-Standards/PHP/Q_24397128.html

0
ray-solomon
Asked:
ray-solomon
  • 9
  • 3
  • 2
  • +2
1 Solution
 
TViYHCommented:
Maybe you could create an array of all the characters that you want to get rid of, then just run that through preg_replace().
0
 
ray-solomonAuthor Commented:
I would rather do it as explained in my question because I don't know what all the illegal characters are. There could be a lot.
So it would probably be best to write a regex that only looks for the valid characters. I think this would be more reliable.

preg_replace($regex, '', $xml);
0
 
Jonah11Commented:
ray,

whenever you find yourself doing something like this, i recommend googling for an existing solution.  no sense trying to re-invent the wheel (and make one with bumpy edges :) )

here's something to start:
http://www.contentwithstyle.co.uk/content/xml-validation-in-php
0
Free Tool: Port Scanner

Check which ports are open to the outside world. Helps make sure that your firewall rules are working as intended.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

 
ray-solomonAuthor Commented:
I already googled and I have already seen that web page before. That is not the answer I am looking for.
I don't need to know which lines contain invalid xml. I already know this, I simply want to make sure the xml file is valid before I parse it by removing illegal characters with a regex pattern using hexidecimal notation.
Please re-read my question.

Maybe I could have written my question differently so I don't confuse some people.
0
 
ray-solomonAuthor Commented:
There used to be a few people here that were really good at regular expressions.
0
 
Jonah11Commented:
Sorry I misunderstood Ray.  It seems that you already have the answer tho, unless I am still misunderstanding.  Won't replacing the "#x" with "\x" in your op string, and then using that in preg_replace as you did above work?

0
 
ray-solomonAuthor Commented:
I think I have the answer, but I am not sure how to implement it with regex.

As for replacing the #x with \x I was not sure about that. I will give that a try.
0
 
David S.Commented:
Try this:
$string = preg_replace("/[^\x09\x0A\x0D\x20-\xD7FF\xE000-\xFFFD\x10000-\x10FFFF]/","",$string);

Open in new window

0
 
ray-solomonAuthor Commented:
hmm... That does not work either.

Warning: preg_replace() [function.preg-replace]: Compilation failed: range out of order in character class at offset 22 in /home/....


The example code is below.

I am trying to get rid of characters like this, but not limited to this: b while retaining the rest of the xml code in tact.
$contents = '<MyReader Version="1.0">
<SearchTableRows>
<Row Date="2009-05-2" Category="mycategory" SubCategory="mysubcategory" Location="mylocation" Title="I started this business for next to nothing, and it now pays me 7 figures annually.  Interested ?" Price="" Link="http://domain.com" Viewed="False" />
<Row Date="2009-05-2" Category="mycategory" SubCategory="mysubcategory" Location="mylocation" Title="Been downsized? what is your plan &#x1C;b&#x1D;? earn while you..." Price="" Link="http://domain.com" Viewed="False" />
<Row Date="2009-05-3" Category="mycategory" SubCategory="mysubcategory" Location="mylocation" Title="Get Paid to Give Away FREE Prescription Card!" Price="" Link="http://domain.com" Viewed="False" />
</SearchTableRows>
</MyReader>';
 
echo preg_replace("/[^\x09\x0A\x0D\x20-\xD7FF\xE000-\xFFFD\x10000-\x10FFFF]/", '', $contents);

Open in new window

0
 
ray-solomonAuthor Commented:
So maybe the regex needs to just include the ranges of characters that need to be stripped out.
I think currently, we are trying to strip out the valid xml characters which is not what we want to do.
0
 
David S.Commented:
Yeah, I don't remember seeing 5 digit hex numbers used for that before. Try it without that last range of characters.

> I am trying to get rid of characters like this, but not limited to this: &#x1C;b&#x1D; while retaining the rest of the xml code in tact.

Wait. You actually want to remove character entities? Why?
echo preg_replace("/[^\x09\x0A\x0D\x20-\xD7FF\xE000-\xFFFD]/",'', $contents);

Open in new window

0
 
ray-solomonAuthor Commented:
I got no errors that time, but it did not remove this: &#x1C;b&#x1D; from one of the element nodes. You will still see it in the Title attribute.
0
 
David S.Commented:
That regular expression won't remove character entities, because that's not what it's written to do.

Why do you want to remove those character entities?
0
 
ray-solomonAuthor Commented:
I need to remove all illegal characters from the xml file so when I parse it, I won't get an error stating there is an invalid xmlChar anywhere.

1) I need to know what the invalid xml characters are.
2) remove them from the file by using preg_replace

Now in another post, someone gave me the ranges of characters that are valid which is represented as this supposedly:
#x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]

I thought it would be a good idea use a regex to strip out all characters that don't match, however, I think we were about to strip all the valid characters. Not good.

So instead I think we need to know a list of invalid characters or the ranges and put that into a form of regex.

0
 
thehagmanCommented:
if the characters are given as hexadecimal entities, try the code below.
It should transform
'Plan &#x1C;b&#x1D;' -> "Plan <badUCSchar codepoint="1C"/>b<badUCSchar codepoint="1D"/>"
or change tpo produce your preferred kind of output
However, that would merely detect (hexadecimal) character entities not belnging to the valid XML chars. It still won't detect totally malformed entities like
&#x123XYZ;


$output = preg_replace('/&#x0*([0-8BCEF]|1[0-9A-F]|D[8-9][0-9A-F][0-9A-F]|1[1-9A-F][0-9A-F]{4}|[2-9A-F][0-9A-F]{4}|[2-9A-F][0-9A-F]{5}|[1-9A-F][0-9A-F{6,});/', '<badUCSchar codepoint="$1"/>', $input)

Open in new window

0
 
ray-solomonAuthor Commented:
thehagman, Thank you!

It works flawlessly. I have not run into any problems with it.

I should note there was a left-bracket missing in the end of the regex, but I fixed it.

echo preg_replace('/&#x0*([0-8BCEF]|1[0-9A-F]|D[8-9][0-9A-F][0-9A-F]|1[1-9A-F][0-9A-F]{4}|[2-9A-F][0-9A-F]{4}|[2-9A-F][0-9A-F]{5}|[1-9A-F][0-9A-F]{6,});/', '', $contents);

Many thanks.
0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

Join & Write a Comment

Featured Post

Free Tool: IP Lookup

Get more info about an IP address or domain name, such as organization, abuse contacts and geolocation.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

  • 9
  • 3
  • 2
  • +2
Tackle projects and never again get stuck behind a technical roadblock.
Join Now