Solved

XSL: how to remove special characters

Posted on 2014-02-14
9
646 Views
Last Modified: 2014-02-18
I'm trying to obtain the contents of Email but sometimes the response that I'm working with has special characters like: <Email>‡PERSON@TEST.COM‡</Email>

So how can I account for instances where email does not look like
<Email>PERSON@TEST.COM</Email>>
0
Comment
Question by:badtz7229
  • 3
  • 3
  • 2
  • +1
9 Comments
 
LVL 82

Expert Comment

by:leakim971
Comment Utility
why do you get this special char? can you remove them from the source instead at the end?
0
 

Author Comment

by:badtz7229
Comment Utility
No I cannot erase the source.
I need logic which will state retrieve values between special characters .
0
 
LVL 35

Expert Comment

by:mccarl
Comment Utility
Are you using version 1 or 2 of XSL? If you are unsure about this, look at the root element of your xsl and it should have a version="???" attribute.
0
 

Author Comment

by:badtz7229
Comment Utility
Xsl 1
0
Better Security Awareness With Threat Intelligence

See how one of the leading financial services organizations uses Recorded Future as part of a holistic threat intelligence program to promote security awareness and proactively and efficiently identify threats.

 
LVL 35

Accepted Solution

by:
mccarl earned 500 total points
Comment Utility
Xsl 1
Ok, so if you are constrained to using version 1.0 then it's not as nice (in version 2.0 you could use a regex to do this cleaner) but it is possible. The following removes any characters that AREN'T specified in that long string in the translate call. Currently it works fine for the example that you have above, but if you have any other characters that should be kept as part of the email, then just append them onto the long string in the below...
<?xml version="1.0"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
   <xsl:template match="/Email">
      <Email>
         <xsl:value-of select="translate(., translate(., 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789@.', ''), '')"/>
      </Email>
   </xsl:template>
</xsl:stylesheet>

Open in new window

Note, that if you would rather go the other way, and only remove characters that you specify (this may be useful if you know that the special characters will only ever be limited to certain ones) then you could do the following instead...
<?xml version="1.0"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
   <xsl:template match="/Email">
      <Email>
         <xsl:value-of select="translate(., '‡', '')"/>
      </Email>
   </xsl:template>
</xsl:stylesheet>

Open in new window

0
 
LVL 60

Expert Comment

by:Geert Bormans
Comment Utility
Just noticed this question
Some comments and suggestions

- the problem likely comes from a misinterpretation of the character encoding somewhere down the chain. ‡ smells like a double byte UTF-8 character being falsely interpreted as ISO-8859-1 or WIN-1252 (I bet the latter). Please check carefully that only the email fields are "infected". Errors like this happen when information from non XML files are merged in XML files without using XML tools. Patching the symptons might hide a deeper apin that can bite in a later stage. "Fix the source or the chain if you can" should be the number 1 advice

- though the character (I have not cheched, but I assume it is a quote type, a braclet type or another type of seperator) would still be annoying if it was interpreted right as a single character. So encoding error or not, you want it to be removed

- I would be carefull with mccarls first solution, because that involves parsing allowed email syntax, and the logic for a valid email adres is complex and forgiving... some characters are allowed in the domain but not in the local part or vise versa and email addresses can be pretty weird me@[ip6:123.123.123.123.123] or something. Hard to be complete using XSLT1

- mccarls second suggestion might be better, if you think you know the characters that can be removed. but you only want to remove them at start and end I assume. You should test a whole bunch of incoming XML, and there is a big chance you will never be complete. Note that translate() only works on single characters and translate on "‡" will cause "Â" and "‡" to be removed anywhere in the email, not necessarily in sequence

- personally I would start with analysing the character sequences in as many test documents you can get. I am very curious on how an email address such as hervé@me.com would appear, I have a suspicion that it appears with a double byte sequence such as à or  and another character. If that is the case, you don't need to only remove the two character sequences, but transform them into something else

- Also... if you are getting the two byte seperator "‡" you might also get a "{" or "<" seperator at the front and the closing equivalent at the end. So I would also investigate that and make a list of possible seperator sequences and pull them out.

- For this small test set (one email adres) both mccarls suggestions will work. More generally, I would make a conditional removal of the first two and last two characters, if the first is a "Â" or "Ã". Might be stronger in the long run

But, as I said before, fixing this one way or another without a proper analysis is like taking the symptoms away of a severe illness under the hood, like painting a car to hide the corrosion, without taking it away, it might open a fine can of worms
0
 

Author Closing Comment

by:badtz7229
Comment Utility
thank you . this worked.
and to the other person's point - yes indeed this is an issue on the other developers' side where they are not parsing their response correctly. unfortunately, they are not going to resolve this so i need to work with what i've got.
0
 
LVL 60

Expert Comment

by:Geert Bormans
Comment Utility
mmh my point was not necessarily that the others should fix it, my point was that you should investigate what you got carefully. mccarls solution will likely break on other but the most common cases. Anyway, some appreciation of the effort would have been nice :-)
0
 
LVL 35

Expert Comment

by:mccarl
Comment Utility
thank you . this worked.
Cheers, glad I could help! :)
0

Featured Post

Top 6 Sources for Identifying Threat Actor TTPs

Understanding your enemy is essential. These six sources will help you identify the most popular threat actor tactics, techniques, and procedures (TTPs).

Join & Write a Comment

This is a how to build your own CSS3 slideshow and when I say CSS3 I mean just CSS with no javascript in sight! There a few examples online of how to do this but most just show you an example without any explanation, others make it more complicat…
Many times as a report developer I've been asked to display normalized data such as three rows with values Jack, Joe, and Bob as a single comma-separated string such as 'Jack, Joe, Bob', and vice versa.  Here's how to do it. 
In this Micro Tutorial viewers will learn how to create navigation buttons that change on rollover, using CSS (Continuation of the CSS Image Sprite tutorial) Create a parent ID for all the list items       - Specify position: absolute and display: block…
In this Micro Tutorial viewers will learn how to create a CSS image sprite (In a later tutorial, viewers will learn how to use CSS and HTML to create a navigation menu using this sprite) Open a new Photoshop document with a width of (Icon width)x(N…

772 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

11 Experts available now in Live!

Get 1:1 Help Now