Solved

regex to strip illegal characters from xml

Posted on 2009-05-11
16
4,301 Views
Last Modified: 2012-05-06
I have an XML file that I need to make sure it is valid before I parse it.

I think the simplest way is to run the entire contents of the xml file though preg_replace() to delete all characters except these:
#x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]

Will someone please help write the regex.

Thanks.

Here is the post related to this:
http://www.experts-exchange.com/Web_Development/Web_Languages-Standards/PHP/Q_24397128.html

0
Comment
Question by:ray-solomon
  • 9
  • 3
  • 2
  • +2
16 Comments
 
LVL 1

Expert Comment

by:TViYH
ID: 24360331
Maybe you could create an array of all the characters that you want to get rid of, then just run that through preg_replace().
0
 
LVL 10

Author Comment

by:ray-solomon
ID: 24360367
I would rather do it as explained in my question because I don't know what all the illegal characters are. There could be a lot.
So it would probably be best to write a regex that only looks for the valid characters. I think this would be more reliable.

preg_replace($regex, '', $xml);
0
 
LVL 7

Expert Comment

by:Jonah11
ID: 24360422
ray,

whenever you find yourself doing something like this, i recommend googling for an existing solution.  no sense trying to re-invent the wheel (and make one with bumpy edges :) )

here's something to start:
http://www.contentwithstyle.co.uk/content/xml-validation-in-php
0
 
LVL 10

Author Comment

by:ray-solomon
ID: 24360460
I already googled and I have already seen that web page before. That is not the answer I am looking for.
I don't need to know which lines contain invalid xml. I already know this, I simply want to make sure the xml file is valid before I parse it by removing illegal characters with a regex pattern using hexidecimal notation.
Please re-read my question.

Maybe I could have written my question differently so I don't confuse some people.
0
 
LVL 10

Author Comment

by:ray-solomon
ID: 24360475
There used to be a few people here that were really good at regular expressions.
0
 
LVL 7

Expert Comment

by:Jonah11
ID: 24360526
Sorry I misunderstood Ray.  It seems that you already have the answer tho, unless I am still misunderstanding.  Won't replacing the "#x" with "\x" in your op string, and then using that in preg_replace as you did above work?

0
 
LVL 10

Author Comment

by:ray-solomon
ID: 24360595
I think I have the answer, but I am not sure how to implement it with regex.

As for replacing the #x with \x I was not sure about that. I will give that a try.
0
 
LVL 42

Expert Comment

by:David S.
ID: 24360955
Try this:
$string = preg_replace("/[^\x09\x0A\x0D\x20-\xD7FF\xE000-\xFFFD\x10000-\x10FFFF]/","",$string);

Open in new window

0
Is Your Active Directory as Secure as You Think?

More than 75% of all records are compromised because of the loss or theft of a privileged credential. Experts have been exploring Active Directory infrastructure to identify key threats and establish best practices for keeping data safe. Attend this month’s webinar to learn more.

 
LVL 10

Author Comment

by:ray-solomon
ID: 24361035
hmm... That does not work either.

Warning: preg_replace() [function.preg-replace]: Compilation failed: range out of order in character class at offset 22 in /home/....


The example code is below.

I am trying to get rid of characters like this, but not limited to this: b while retaining the rest of the xml code in tact.
$contents = '<MyReader Version="1.0">

<SearchTableRows>

<Row Date="2009-05-2" Category="mycategory" SubCategory="mysubcategory" Location="mylocation" Title="I started this business for next to nothing, and it now pays me 7 figures annually.  Interested ?" Price="" Link="http://domain.com" Viewed="False" />

<Row Date="2009-05-2" Category="mycategory" SubCategory="mysubcategory" Location="mylocation" Title="Been downsized? what is your plan &#x1C;b&#x1D;? earn while you..." Price="" Link="http://domain.com" Viewed="False" />

<Row Date="2009-05-3" Category="mycategory" SubCategory="mysubcategory" Location="mylocation" Title="Get Paid to Give Away FREE Prescription Card!" Price="" Link="http://domain.com" Viewed="False" />

</SearchTableRows>

</MyReader>';
 

echo preg_replace("/[^\x09\x0A\x0D\x20-\xD7FF\xE000-\xFFFD\x10000-\x10FFFF]/", '', $contents);

Open in new window

0
 
LVL 10

Author Comment

by:ray-solomon
ID: 24361070
So maybe the regex needs to just include the ranges of characters that need to be stripped out.
I think currently, we are trying to strip out the valid xml characters which is not what we want to do.
0
 
LVL 42

Expert Comment

by:David S.
ID: 24361103
Yeah, I don't remember seeing 5 digit hex numbers used for that before. Try it without that last range of characters.

> I am trying to get rid of characters like this, but not limited to this: &#x1C;b&#x1D; while retaining the rest of the xml code in tact.

Wait. You actually want to remove character entities? Why?
echo preg_replace("/[^\x09\x0A\x0D\x20-\xD7FF\xE000-\xFFFD]/",'', $contents);

Open in new window

0
 
LVL 10

Author Comment

by:ray-solomon
ID: 24361133
I got no errors that time, but it did not remove this: &#x1C;b&#x1D; from one of the element nodes. You will still see it in the Title attribute.
0
 
LVL 42

Expert Comment

by:David S.
ID: 24361141
That regular expression won't remove character entities, because that's not what it's written to do.

Why do you want to remove those character entities?
0
 
LVL 10

Author Comment

by:ray-solomon
ID: 24361462
I need to remove all illegal characters from the xml file so when I parse it, I won't get an error stating there is an invalid xmlChar anywhere.

1) I need to know what the invalid xml characters are.
2) remove them from the file by using preg_replace

Now in another post, someone gave me the ranges of characters that are valid which is represented as this supposedly:
#x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]

I thought it would be a good idea use a regex to strip out all characters that don't match, however, I think we were about to strip all the valid characters. Not good.

So instead I think we need to know a list of invalid characters or the ranges and put that into a form of regex.

0
 
LVL 20

Accepted Solution

by:
thehagman earned 500 total points
ID: 24363371
if the characters are given as hexadecimal entities, try the code below.
It should transform
'Plan &#x1C;b&#x1D;' -> "Plan <badUCSchar codepoint="1C"/>b<badUCSchar codepoint="1D"/>"
or change tpo produce your preferred kind of output
However, that would merely detect (hexadecimal) character entities not belnging to the valid XML chars. It still won't detect totally malformed entities like
&#x123XYZ;


$output = preg_replace('/&#x0*([0-8BCEF]|1[0-9A-F]|D[8-9][0-9A-F][0-9A-F]|1[1-9A-F][0-9A-F]{4}|[2-9A-F][0-9A-F]{4}|[2-9A-F][0-9A-F]{5}|[1-9A-F][0-9A-F{6,});/', '<badUCSchar codepoint="$1"/>', $input)

Open in new window

0
 
LVL 10

Author Closing Comment

by:ray-solomon
ID: 31580373
thehagman, Thank you!

It works flawlessly. I have not run into any problems with it.

I should note there was a left-bracket missing in the end of the regex, but I fixed it.

echo preg_replace('/&#x0*([0-8BCEF]|1[0-9A-F]|D[8-9][0-9A-F][0-9A-F]|1[1-9A-F][0-9A-F]{4}|[2-9A-F][0-9A-F]{4}|[2-9A-F][0-9A-F]{5}|[1-9A-F][0-9A-F]{6,});/', '', $contents);

Many thanks.
0

Featured Post

Is Your Active Directory as Secure as You Think?

More than 75% of all records are compromised because of the loss or theft of a privileged credential. Experts have been exploring Active Directory infrastructure to identify key threats and establish best practices for keeping data safe. Attend this month’s webinar to learn more.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Part of the Global Positioning System A geocode (https://developers.google.com/maps/documentation/geocoding/) is the major subset of a GPS coordinate (http://en.wikipedia.org/wiki/Global_Positioning_System), the other parts being the altitude and t…
This article discusses four methods for overlaying images in a container on a web page
The viewer will learn how to count occurrences of each item in an array.
This tutorial will teach you the core code needed to finalize the addition of a watermark to your image. The viewer will use a small PHP class to learn and create a watermark.

939 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

7 Experts available now in Live!

Get 1:1 Help Now