Solved

regex to strip illegal characters from xml

Posted on 2009-05-11
16
4,285 Views
Last Modified: 2012-05-06
I have an XML file that I need to make sure it is valid before I parse it.

I think the simplest way is to run the entire contents of the xml file though preg_replace() to delete all characters except these:
#x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]

Will someone please help write the regex.

Thanks.

Here is the post related to this:
http://www.experts-exchange.com/Web_Development/Web_Languages-Standards/PHP/Q_24397128.html

0
Comment
Question by:ray-solomon
  • 9
  • 3
  • 2
  • +2
16 Comments
 
LVL 1

Expert Comment

by:TViYH
ID: 24360331
Maybe you could create an array of all the characters that you want to get rid of, then just run that through preg_replace().
0
 
LVL 10

Author Comment

by:ray-solomon
ID: 24360367
I would rather do it as explained in my question because I don't know what all the illegal characters are. There could be a lot.
So it would probably be best to write a regex that only looks for the valid characters. I think this would be more reliable.

preg_replace($regex, '', $xml);
0
 
LVL 7

Expert Comment

by:Jonah11
ID: 24360422
ray,

whenever you find yourself doing something like this, i recommend googling for an existing solution.  no sense trying to re-invent the wheel (and make one with bumpy edges :) )

here's something to start:
http://www.contentwithstyle.co.uk/content/xml-validation-in-php
0
 
LVL 10

Author Comment

by:ray-solomon
ID: 24360460
I already googled and I have already seen that web page before. That is not the answer I am looking for.
I don't need to know which lines contain invalid xml. I already know this, I simply want to make sure the xml file is valid before I parse it by removing illegal characters with a regex pattern using hexidecimal notation.
Please re-read my question.

Maybe I could have written my question differently so I don't confuse some people.
0
 
LVL 10

Author Comment

by:ray-solomon
ID: 24360475
There used to be a few people here that were really good at regular expressions.
0
 
LVL 7

Expert Comment

by:Jonah11
ID: 24360526
Sorry I misunderstood Ray.  It seems that you already have the answer tho, unless I am still misunderstanding.  Won't replacing the "#x" with "\x" in your op string, and then using that in preg_replace as you did above work?

0
 
LVL 10

Author Comment

by:ray-solomon
ID: 24360595
I think I have the answer, but I am not sure how to implement it with regex.

As for replacing the #x with \x I was not sure about that. I will give that a try.
0
 
LVL 42

Expert Comment

by:David S.
ID: 24360955
Try this:
$string = preg_replace("/[^\x09\x0A\x0D\x20-\xD7FF\xE000-\xFFFD\x10000-\x10FFFF]/","",$string);

Open in new window

0
Do You Know the 4 Main Threat Actor Types?

Do you know the main threat actor types? Most attackers fall into one of four categories, each with their own favored tactics, techniques, and procedures.

 
LVL 10

Author Comment

by:ray-solomon
ID: 24361035
hmm... That does not work either.

Warning: preg_replace() [function.preg-replace]: Compilation failed: range out of order in character class at offset 22 in /home/....


The example code is below.

I am trying to get rid of characters like this, but not limited to this: b while retaining the rest of the xml code in tact.
$contents = '<MyReader Version="1.0">

<SearchTableRows>

<Row Date="2009-05-2" Category="mycategory" SubCategory="mysubcategory" Location="mylocation" Title="I started this business for next to nothing, and it now pays me 7 figures annually.  Interested ?" Price="" Link="http://domain.com" Viewed="False" />

<Row Date="2009-05-2" Category="mycategory" SubCategory="mysubcategory" Location="mylocation" Title="Been downsized? what is your plan &#x1C;b&#x1D;? earn while you..." Price="" Link="http://domain.com" Viewed="False" />

<Row Date="2009-05-3" Category="mycategory" SubCategory="mysubcategory" Location="mylocation" Title="Get Paid to Give Away FREE Prescription Card!" Price="" Link="http://domain.com" Viewed="False" />

</SearchTableRows>

</MyReader>';
 

echo preg_replace("/[^\x09\x0A\x0D\x20-\xD7FF\xE000-\xFFFD\x10000-\x10FFFF]/", '', $contents);

Open in new window

0
 
LVL 10

Author Comment

by:ray-solomon
ID: 24361070
So maybe the regex needs to just include the ranges of characters that need to be stripped out.
I think currently, we are trying to strip out the valid xml characters which is not what we want to do.
0
 
LVL 42

Expert Comment

by:David S.
ID: 24361103
Yeah, I don't remember seeing 5 digit hex numbers used for that before. Try it without that last range of characters.

> I am trying to get rid of characters like this, but not limited to this: &#x1C;b&#x1D; while retaining the rest of the xml code in tact.

Wait. You actually want to remove character entities? Why?
echo preg_replace("/[^\x09\x0A\x0D\x20-\xD7FF\xE000-\xFFFD]/",'', $contents);

Open in new window

0
 
LVL 10

Author Comment

by:ray-solomon
ID: 24361133
I got no errors that time, but it did not remove this: &#x1C;b&#x1D; from one of the element nodes. You will still see it in the Title attribute.
0
 
LVL 42

Expert Comment

by:David S.
ID: 24361141
That regular expression won't remove character entities, because that's not what it's written to do.

Why do you want to remove those character entities?
0
 
LVL 10

Author Comment

by:ray-solomon
ID: 24361462
I need to remove all illegal characters from the xml file so when I parse it, I won't get an error stating there is an invalid xmlChar anywhere.

1) I need to know what the invalid xml characters are.
2) remove them from the file by using preg_replace

Now in another post, someone gave me the ranges of characters that are valid which is represented as this supposedly:
#x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]

I thought it would be a good idea use a regex to strip out all characters that don't match, however, I think we were about to strip all the valid characters. Not good.

So instead I think we need to know a list of invalid characters or the ranges and put that into a form of regex.

0
 
LVL 20

Accepted Solution

by:
thehagman earned 500 total points
ID: 24363371
if the characters are given as hexadecimal entities, try the code below.
It should transform
'Plan &#x1C;b&#x1D;' -> "Plan <badUCSchar codepoint="1C"/>b<badUCSchar codepoint="1D"/>"
or change tpo produce your preferred kind of output
However, that would merely detect (hexadecimal) character entities not belnging to the valid XML chars. It still won't detect totally malformed entities like
&#x123XYZ;


$output = preg_replace('/&#x0*([0-8BCEF]|1[0-9A-F]|D[8-9][0-9A-F][0-9A-F]|1[1-9A-F][0-9A-F]{4}|[2-9A-F][0-9A-F]{4}|[2-9A-F][0-9A-F]{5}|[1-9A-F][0-9A-F{6,});/', '<badUCSchar codepoint="$1"/>', $input)

Open in new window

0
 
LVL 10

Author Closing Comment

by:ray-solomon
ID: 31580373
thehagman, Thank you!

It works flawlessly. I have not run into any problems with it.

I should note there was a left-bracket missing in the end of the regex, but I fixed it.

echo preg_replace('/&#x0*([0-8BCEF]|1[0-9A-F]|D[8-9][0-9A-F][0-9A-F]|1[1-9A-F][0-9A-F]{4}|[2-9A-F][0-9A-F]{4}|[2-9A-F][0-9A-F]{5}|[1-9A-F][0-9A-F]{6,});/', '', $contents);

Many thanks.
0

Featured Post

Easy Project Management (No User Manual Required)

Manage projects of all sizes how you want. Great for personal to-do lists, project milestones, team priorities and launch plans.
- Combine task lists, docs, spreadsheets, and chat in one
- View and edit from mobile/offline
- Cut down on emails

Join & Write a Comment

Popularity Can Be Measured Sometimes we deal with questions of popularity, and we need a way to collect opinions from our clients.  This article shows a simple teaching example of how we might elect a favorite color by letting our clients vote for …
Nothing in an HTTP request can be trusted, including HTTP headers and form data.  A form token is a tool that can be used to guard against request forgeries (CSRF).  This article shows an improved approach to form tokens, making it more difficult to…
The viewer will learn how to dynamically set the form action using jQuery.
The viewer will learn how to count occurrences of each item in an array.

747 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

15 Experts available now in Live!

Get 1:1 Help Now