vbguy
asked on
parsing out non-iso characters in java
Hi,
I need to find a function that will parse out all non-iso characters from a string.
There is a situation where people cut and past from a WORD doc into an html field.
The string they cut and paste contains non iso characters.
Is there a way to do this?
Thanks
I need to find a function that will parse out all non-iso characters from a string.
There is a situation where people cut and past from a WORD doc into an html field.
The string they cut and paste contains non iso characters.
Is there a way to do this?
Thanks
That method should strictly be called 'stripNonIso88591Chars'
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
So if you create a String of all possible 'characters' and call that method, you will be left with only ISO8859-1 characters:
StringBuffer sb = new StringBuffer(1 << 16);
for(int i = 0;i <= 0xFFFF;i++) {
sb.append((char)i);
}
System.out.println(stripNo nIso88591C hars(sb.to String())) ;
StringBuffer sb = new StringBuffer(1 << 16);
for(int i = 0;i <= 0xFFFF;i++) {
sb.append((char)i);
}
System.out.println(stripNo
public String strip(String s)
{
String buffer result = new StringBuffer();
char[] chars = s.toCharArray();
for (int i=0; i<chars.length; i++)
{
char c = chars[i];
if (!Character.isISOControl() )
{
result.append(c);
}
}
return result.toString();
}
{
String buffer result = new StringBuffer();
char[] chars = s.toCharArray();
for (int i=0; i<chars.length; i++)
{
char c = chars[i];
if (!Character.isISOControl()
{
result.append(c);
}
}
return result.toString();
}
Hello,
seems that you have some solutions provided already, but as I think of what your actual problem is, I am temted to encourage you to take a look at java.net.URLEncoder and java.net.URLDecoder. This is just a guess, but might get you closer to your solution.
br,
-jT
seems that you have some solutions provided already, but as I think of what your actual problem is, I am temted to encourage you to take a look at java.net.URLEncoder and java.net.URLDecoder. This is just a guess, but might get you closer to your solution.
br,
-jT
oops, that should have actually been:
public String strip(String s)
{
String buffer result = new StringBuffer();
char[] chars = s.toCharArray();
for (int i=0; i<chars.length; i++)
{
char c = chars[i];
if (Character.isISOControl(c) )
{
result.append(c);
}
}
return result.toString();
}
public String strip(String s)
{
String buffer result = new StringBuffer();
char[] chars = s.toCharArray();
for (int i=0; i<chars.length; i++)
{
char c = chars[i];
if (Character.isISOControl(c)
{
result.append(c);
}
}
return result.toString();
}
>>if (Character.isISOControl(c) )
That will only strip control characters out
That will only strip control characters out
ASKER
Hi CEHJ
I was testing stripNonIso88591Chars.
It didnt seem to parse the following
¼. I though this was a non iso characters as well.
Let me know if I am wrong.
Thanks,
Arthur
I was testing stripNonIso88591Chars.
It didnt seem to parse the following
¼. I though this was a non iso characters as well.
Let me know if I am wrong.
Thanks,
Arthur
did u try the code I posted (ignore CEHJ's comments about it)
just noticed another typo :)
public String strip(String s)
{
StringBuffer result = new StringBuffer();
char[] chars = s.toCharArray();
for (int i=0; i<chars.length; i++)
{
char c = chars[i];
if (Character.isISOControl(c) )
{
result.append(c);
}
}
return result.toString();
}
public String strip(String s)
{
StringBuffer result = new StringBuffer();
char[] chars = s.toCharArray();
for (int i=0; i<chars.length; i++)
{
char c = chars[i];
if (Character.isISOControl(c)
{
result.append(c);
}
}
return result.toString();
}
Specifically what char encoding are you using for your form?
think i misinterprteted your requirements a little, what you need is more like this:
public static String strip(String s)
{
StringBuffer result = new StringBuffer(s.length());
char[] chars = s.toCharArray();
for (int i=0; i<chars.length; i++)
{
char c = chars[i];
if (c<=0xff && !Character.isISOControl(c)
{
result.append(c);
}
}
return result.toString();
}
Don't think its going to help you with your problem though :(
>>I though this was a non iso characters as well.
No, it's character 0xBC in ISO-8859-1.
No, it's character 0xBC in ISO-8859-1.
ASKER
Hi CEHJ,
Thanks. One final question,
How did you identify what the acutally character is?
For certain characters, I would like to replace it with something instead of deleting it.
So somehting like this:
public static String stripNonIso88591Chars(Stri ng s) {
StringBuffer sb = new StringBuffer(s);
final int NUM_CHARS = (1 << 8) + 1;
BitSet bs = new BitSet(NUM_CHARS);
for(int i = 0x20;i <= 0x7E;i++) {
bs.set(i);
}
for(int i = 0xA1;i <= 0xFF;i++) {
bs.set(i);
}
for(int i = sb.length() - 1;i >= 0;i--) {
if (bs.get(sb.charAt(i)) == false) {
if (sb.charAt(i) = ' “')
{
replace “ with "
}
else
{
sb.deleteCharAt(i);
}
How would the following be accomplished?
Thanks,
Arthur
Thanks. One final question,
How did you identify what the acutally character is?
For certain characters, I would like to replace it with something instead of deleting it.
So somehting like this:
public static String stripNonIso88591Chars(Stri
StringBuffer sb = new StringBuffer(s);
final int NUM_CHARS = (1 << 8) + 1;
BitSet bs = new BitSet(NUM_CHARS);
for(int i = 0x20;i <= 0x7E;i++) {
bs.set(i);
}
for(int i = 0xA1;i <= 0xFF;i++) {
bs.set(i);
}
for(int i = sb.length() - 1;i >= 0;i--) {
if (bs.get(sb.charAt(i)) == false) {
if (sb.charAt(i) = ' “')
{
replace “ with "
}
else
{
sb.deleteCharAt(i);
}
How would the following be accomplished?
Thanks,
Arthur
What would you like to replace it with?
ASKER
Hi,
It would be ascii code 34, the standard double qoute.
I am not sure what Word replaces it with, but it is not the standard one.
Thanks,
Arthur
It would be ascii code 34, the standard double qoute.
I am not sure what Word replaces it with, but it is not the standard one.
Thanks,
Arthur
Change
>>
for(int i = sb.length() - 1;i >= 0;i--) {
if (bs.get(sb.charAt(i)) == false) {
sb.deleteCharAt(i);
}
}
>>
to
for(int i = sb.length() - 1;i >= 0;i--) {
if (bs.get(sb.charAt(i)) == false) {
sb.setCharAt(i, '\"');
}
}
>>
for(int i = sb.length() - 1;i >= 0;i--) {
if (bs.get(sb.charAt(i)) == false) {
sb.deleteCharAt(i);
}
}
>>
to
for(int i = sb.length() - 1;i >= 0;i--) {
if (bs.get(sb.charAt(i)) == false) {
sb.setCharAt(i, '\"');
}
}
ASKER
Hi CEHJ,
The only problem with that is it would replace all non iso chars with ".
Is it possible to say if its a non iso char and the non iso char is “ (this is the back wards qoutes that word replaces the standard qoute with) replace it with ".
Thanks,
Arthur
The only problem with that is it would replace all non iso chars with ".
Is it possible to say if its a non iso char and the non iso char is “ (this is the back wards qoutes that word replaces the standard qoute with) replace it with ".
Thanks,
Arthur
One would need to know what character code is being used there. If your nickname is correct - maybe you can tell us using some vb?
ASKER
This is what I was looking for:
String mytest = "“ test “";
StringBuffer sb = new StringBuffer(mytest);
System.out.println ("this is my test " + sb);
for(int i =0; i<sb.length() ;i++)
{
int x_int = (int) sb.charAt(i);
if (x_int == 8220)
{
sb.setCharAt(i, '\"');
}
}
I figured out the 8220 by the following:
char letter = '“';
int x = (int) letter;
System.out.println ("this is the special char " + x);
Does that make sense?
String mytest = "“ test “";
StringBuffer sb = new StringBuffer(mytest);
System.out.println ("this is my test " + sb);
for(int i =0; i<sb.length() ;i++)
{
int x_int = (int) sb.charAt(i);
if (x_int == 8220)
{
sb.setCharAt(i, '\"');
}
}
I figured out the 8220 by the following:
char letter = '“';
int x = (int) letter;
System.out.println ("this is the special char " + x);
Does that make sense?
Sure does - it's the Unicode code for the "LEFT DOUBLE QUOTATION MARK"
Presumably, you'll need to trap the right one too (8221)?
>No, it's character 0xBC in ISO-8859-1.
0xBC != 8220
> I figured out the 8220
replacing that with something else is fine, but what if the user actually enters that iso char? And your question implies you need to deal with all non iso characters so how would you decide which others to ignore.
> char letter = '“';
You also should be testing the actual value returned from the browser.
If that code you posted solves your problem then feel free to close this question :)
0xBC != 8220
> I figured out the 8220
replacing that with something else is fine, but what if the user actually enters that iso char? And your question implies you need to deal with all non iso characters so how would you decide which others to ignore.
> char letter = '“';
You also should be testing the actual value returned from the browser.
If that code you posted solves your problem then feel free to close this question :)
>>0xBC != 8220
Please read the comments more carefully
Please read the comments more carefully
I've read it very carefully, have you ;)
You seem to be getting confused - why else would you be relating two unrelated things?
If so then explain what you are referring to?
Precisely what do you mean to say by
>>0xBC != 8220
?
>>0xBC != 8220
?
I'm still waiting for your clarification so I can answer your question ......
guess I'm not going to get it :-D
guess I'm not going to get it :-D
>>I'm still waiting for your clarification so I can answer your question ......
What clarification? The statement
>>0xBC != 8220
makes no sense whatever, which is why i assume you're confused
What clarification? The statement
>>0xBC != 8220
makes no sense whatever, which is why i assume you're confused
> No, it's character 0xBC in ISO-8859-1.
that statement.
please follow the discussion so as not to waste peoples time
that statement.
please follow the discussion so as not to waste peoples time
>>that statement.
Well we've moved on a long time ago since that statement. What's difficult to understand about
¼ == 0xBC in ISO8859-1
?
Well we've moved on a long time ago since that statement. What's difficult to understand about
¼ == 0xBC in ISO8859-1
?
following should provide what you need, and allow you to specify any characters that need replacing
// include any characters here you would like replaced
private static String replace = "\u00bc\u00bd\u00be";
public static String strip(String s)
{
StringBuffer result = new StringBuffer(s.length());
char[] chars = s.toCharArray();
for (int i=0; i<chars.length; i++)
{
char c = chars[i];
if (c<=0xff && !Character.isISOControl(c) )
{
if (replace.indexOf(c)==-1)
{
result.append(c);
}
else
{
result.append("\"");
}
}
}
return result.toString();
}
// include any characters here you would like replaced
private static String replace = "\u00bc\u00bd\u00be";
public static String strip(String s)
{
StringBuffer result = new StringBuffer(s.length());
char[] chars = s.toCharArray();
for (int i=0; i<chars.length; i++)
{
char c = chars[i];
if (c<=0xff && !Character.isISOControl(c)
{
if (replace.indexOf(c)==-1)
{
result.append(c);
}
else
{
result.append("\"");
}
}
}
return result.toString();
}
you can see the defined characters at:
http://en.wikipedia.org/wiki/ISO_8859-1
You just need to decide which ones you want to strip and which to replace, and tweak the above code accordingly.
Let me know if you have any questions.
http://en.wikipedia.org/wiki/ISO_8859-1
You just need to decide which ones you want to strip and which to replace, and tweak the above code accordingly.
Let me know if you have any questions.
ASKER
Hi objects,
Can you explain:
sb.charAt(i) <=0xff
and how do you know what "\u00bc\u00bd\u00be" represents in terms of characters?
Thanks,
Arthur
Can you explain:
sb.charAt(i) <=0xff
and how do you know what "\u00bc\u00bd\u00be" represents in terms of characters?
Thanks,
Arthur
> sb.charAt(i) <=0xff
checking if char value is less than or equal to 0xff.
> and how do you know what "\u00bc\u00bd\u00be" represents in terms of characters?
they are 1/4, 1/2, 3/4.
checking if char value is less than or equal to 0xff.
> and how do you know what "\u00bc\u00bd\u00be" represents in terms of characters?
they are 1/4, 1/2, 3/4.
>>checking if char value is less than or equal to 0xff.
Why would you do that?
Why would you do that?
ASKER
how did you know \u00bc = 1\4 ? is there a chart you can point me to? The link you gave didnt have that kind of representation.
Thanks.
Thanks.
> The link you gave didnt have that kind of representation.
yes it does.
eg. to find 0xBC look up Bx on the left, and xC on the top.
yes it does.
eg. to find 0xBC look up Bx on the left, and xC on the top.
I missed it the first time I looked at it too :)
http://www.proteanit.net/misc/iso8859-1.htm
But i'm a little puzzled vbguy. Why would you want to get rid of (if you do) 0xBC, since it *is* one of the ISO8859-1 chars?
But i'm a little puzzled vbguy. Why would you want to get rid of (if you do) 0xBC, since it *is* one of the ISO8859-1 chars?
If you do some tesing you may find the code I posted is a little more efficient than that code.
>>the code I posted is a little more efficient than that code
The only trouble is it won't work.
vbguy, if you want the best of both worlds and want to eliminate non-iso chars and do replacements at the same time you can do:
public static String stripNonIso88591Chars2(Str ing s) {
// Replace left and right double quotes with normal one
String targets = "\u201C\u201D";
String replacements = "\u0022\u0022";
StringBuffer sb = new StringBuffer(s);
final int NUM_CHARS = (1 << 8);
BitSet bs = new BitSet(NUM_CHARS);
for (int i = 0x20; i <= 0x7E; i++) {
bs.set(i);
}
for (int i = 0xA1; i <= 0xFF; i++) {
bs.set(i);
}
for (int i = sb.length() - 1; i >= 0; i--) {
char c = sb.charAt(i);
int ixFound = targets.indexOf(c);
if (ixFound > -1) {
sb.setCharAt(i, replacements.charAt(ixFoun d));
}
else if (bs.get(c) == false) {
sb.deleteCharAt(i);
}
}
return sb.toString();
}
The only trouble is it won't work.
vbguy, if you want the best of both worlds and want to eliminate non-iso chars and do replacements at the same time you can do:
public static String stripNonIso88591Chars2(Str
// Replace left and right double quotes with normal one
String targets = "\u201C\u201D";
String replacements = "\u0022\u0022";
StringBuffer sb = new StringBuffer(s);
final int NUM_CHARS = (1 << 8);
BitSet bs = new BitSet(NUM_CHARS);
for (int i = 0x20; i <= 0x7E; i++) {
bs.set(i);
}
for (int i = 0xA1; i <= 0xFF; i++) {
bs.set(i);
}
for (int i = sb.length() - 1; i >= 0; i--) {
char c = sb.charAt(i);
int ixFound = targets.indexOf(c);
if (ixFound > -1) {
sb.setCharAt(i, replacements.charAt(ixFoun
}
else if (bs.get(c) == false) {
sb.deleteCharAt(i);
}
}
return sb.toString();
}
:-)
public String stripNonIsoChars(String s) {
StringBuffer sb = new StringBuffer(s);
final int NUM_CHARS = 1 << 8;
BitSet bs = new BitSet(NUM_CHARS);
for(int i = 0x20;i < 0x7E;i++) {
bs.set(i);
}
for(int i = 0xA1;i < 0xFF;i++) {
bs.set(i);
}
for(int i = sb.length() - 1;i >= 0;i--) {
if (bs.get(sb.charAt(i)) == false) {
sb.deleteCharAt(i);
}
}
return sb.toString();
}