Start Free Trial

asked on

parsing out non-iso characters in java

Hi,

I need to find a function that will parse out all non-iso characters from a string.

There is a situation where people cut and past from a WORD doc into an html field.
The string they cut and paste contains non iso characters.

Is there a way to do this?

Thanks

Try something like:

public String stripNonIsoChars(String s) {
      StringBuffer sb = new StringBuffer(s);
      final int NUM_CHARS = 1 << 8;
      BitSet bs = new BitSet(NUM_CHARS);
      for(int i = 0x20;i < 0x7E;i++) {
            bs.set(i);
      }
      for(int i = 0xA1;i < 0xFF;i++) {
            bs.set(i);
      }
      for(int i = sb.length() - 1;i >= 0;i--) {
            if (bs.get(sb.charAt(i)) == false) {
                  sb.deleteCharAt(i);
            }
      }
      return sb.toString();
}

That method should strictly be called 'stripNonIso88591Chars'

ASKER CERTIFIED SOLUTION

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

So if you create a String of all possible 'characters' and call that method, you will be left with only ISO8859-1 characters:

StringBuffer sb = new StringBuffer(1 << 16);
for(int i = 0;i <= 0xFFFF;i++) {
sb.append((char)i);
}
System.out.println(stripNonIso88591Chars(sb.toString()));

public String strip(String s)
{
String buffer result = new StringBuffer();
char[] chars = s.toCharArray();
for (int i=0; i<chars.length; i++)
{
char c = chars[i];
if (!Character.isISOControl())
{
result.append(c);
}
}
return result.toString();
}

Hello,

seems that you have some solutions provided already, but as I think of what your actual problem is, I am temted to encourage you to take a look at java.net.URLEncoder and java.net.URLDecoder. This is just a guess, but might get you closer to your solution.

br,
-jT

oops, that should have actually been:

public String strip(String s)
{
String buffer result = new StringBuffer();
char[] chars = s.toCharArray();
for (int i=0; i<chars.length; i++)
{
char c = chars[i];
if (Character.isISOControl(c))
{
result.append(c);
}
}
return result.toString();
}

>>if (Character.isISOControl(c))

That will only strip control characters out

ASKER

Hi CEHJ
I was testing stripNonIso88591Chars.

It didnt seem to parse the following
¼. I though this was a non iso characters as well.

Let me know if I am wrong.

Thanks,
Arthur

did u try the code I posted (ignore CEHJ's comments about it)

just noticed another typo :)

public String strip(String s)
{
StringBuffer result = new StringBuffer();
char[] chars = s.toCharArray();
for (int i=0; i<chars.length; i++)
{
char c = chars[i];
if (Character.isISOControl(c))
{
result.append(c);
}
}
return result.toString();
}

Specifically what char encoding are you using for your form?

think i misinterprteted your requirements a little, what you need is more like this:

public static String strip(String s)
{
StringBuffer result = new StringBuffer(s.length());
char[] chars = s.toCharArray();
for (int i=0; i<chars.length; i++)
{
char c = chars[i];
if (c<=0xff && !Character.isISOControl(c))
{
result.append(c);
}
}
return result.toString();
}

Don't think its going to help you with your problem though :(

>>I though this was a non iso characters as well.

No, it's character 0xBC in ISO-8859-1.

ASKER

Hi CEHJ,

Thanks. One final question,

How did you identify what the acutally character is?
For certain characters, I would like to replace it with something instead of deleting it.
So somehting like this:

public static String stripNonIso88591Chars(String s) {
StringBuffer sb = new StringBuffer(s);
final int NUM_CHARS = (1 << 8) + 1;
BitSet bs = new BitSet(NUM_CHARS);
for(int i = 0x20;i <= 0x7E;i++) {
bs.set(i);
}
for(int i = 0xA1;i <= 0xFF;i++) {
bs.set(i);
}
for(int i = sb.length() - 1;i >= 0;i--) {
if (bs.get(sb.charAt(i)) == false) {
if (sb.charAt(i) = ' “')
{
replace “ with "
}
else
{
sb.deleteCharAt(i);
}

How would the following be accomplished?

Thanks,
Arthur

What would you like to replace it with?

ASKER

Hi,
It would be ascii code 34, the standard double qoute.
I am not sure what Word replaces it with, but it is not the standard one.

Thanks,
Arthur

Change

>>
for(int i = sb.length() - 1;i >= 0;i--) {
if (bs.get(sb.charAt(i)) == false) {
sb.deleteCharAt(i);
}
}
>>

to

for(int i = sb.length() - 1;i >= 0;i--) {
if (bs.get(sb.charAt(i)) == false) {
sb.setCharAt(i, '\"');
}
}

ASKER

Hi CEHJ,

The only problem with that is it would replace all non iso chars with ".
Is it possible to say if its a non iso char and the non iso char is “ (this is the back wards qoutes that word replaces the standard qoute with) replace it with ".

Thanks,
Arthur

One would need to know what character code is being used there. If your nickname is correct - maybe you can tell us using some vb?

ASKER

This is what I was looking for:
String mytest = "“ test “";
StringBuffer sb = new StringBuffer(mytest);
System.out.println ("this is my test " + sb);

for(int i =0; i<sb.length() ;i++)
{
      int x_int = (int) sb.charAt(i);

       if (x_int == 8220)
       {
              sb.setCharAt(i, '\"');

       }

      }

I figured out the 8220 by the following:
char letter = '“';
       int x = (int) letter;
      System.out.println ("this is the special char " + x);

Does that make sense?

Sure does - it's the Unicode code for the "LEFT DOUBLE QUOTATION MARK"

Presumably, you'll need to trap the right one too (8221)?

>No, it's character 0xBC in ISO-8859-1.

0xBC != 8220

> I figured out the 8220

replacing that with something else is fine, but what if the user actually enters that iso char? And your question implies you need to deal with all non iso characters so how would you decide which others to ignore.

> char letter = '“';

You also should be testing the actual value returned from the browser.

If that code you posted solves your problem then feel free to close this question :)

>>0xBC != 8220

Please read the comments more carefully

I've read it very carefully, have you ;)

You seem to be getting confused - why else would you be relating two unrelated things?

If so then explain what you are referring to?

Precisely what do you mean to say by

>>0xBC != 8220

?

I'm still waiting for your clarification so I can answer your question ......
guess I'm not going to get it :-D

>>I'm still waiting for your clarification so I can answer your question ......

What clarification? The statement

>>0xBC != 8220

makes no sense whatever, which is why i assume you're confused

> No, it's character 0xBC in ISO-8859-1.

that statement.
please follow the discussion so as not to waste peoples time

>>that statement.

Well we've moved on a long time ago since that statement. What's difficult to understand about

¼ == 0xBC in ISO8859-1

?

following should provide what you need, and allow you to specify any characters that need replacing

// include any characters here you would like replaced
private static String replace = "\u00bc\u00bd\u00be";

public static String strip(String s)
{
StringBuffer result = new StringBuffer(s.length());
char[] chars = s.toCharArray();
for (int i=0; i<chars.length; i++)
{
char c = chars[i];
if (c<=0xff && !Character.isISOControl(c))
{
if (replace.indexOf(c)==-1)
{
result.append(c);
}
else
{
result.append("\"");
}
}
}
return result.toString();
}

you can see the defined characters at:

http://en.wikipedia.org/wiki/ISO_8859-1

You just need to decide which ones you want to strip and which to replace, and tweak the above code accordingly.

Let me know if you have any questions.

ASKER

Hi objects,

Can you explain:

sb.charAt(i) <=0xff

and how do you know what "\u00bc\u00bd\u00be" represents in terms of characters?

Thanks,
Arthur

> sb.charAt(i) <=0xff

checking if char value is less than or equal to 0xff.

> and how do you know what "\u00bc\u00bd\u00be" represents in terms of characters?

they are 1/4, 1/2, 3/4.

>>checking if char value is less than or equal to 0xff.

Why would you do that?

ASKER

how did you know \u00bc = 1\4 ? is there a chart you can point me to? The link you gave didnt have that kind of representation.

Thanks.

> The link you gave didnt have that kind of representation.

yes it does.
eg. to find 0xBC look up Bx on the left, and xC on the top.

I missed it the first time I looked at it too :)

http://www.proteanit.net/misc/iso8859-1.htm

But i'm a little puzzled vbguy. Why would you want to get rid of (if you do) 0xBC, since it *is* one of the ISO8859-1 chars?

If you do some tesing you may find the code I posted is a little more efficient than that code.

>>the code I posted is a little more efficient than that code

The only trouble is it won't work.

vbguy, if you want the best of both worlds and want to eliminate non-iso chars and do replacements at the same time you can do:

      public static String stripNonIso88591Chars2(String s) {
            // Replace left and right double quotes with normal one
            String targets =                  "\u201C\u201D";
            String replacements = "\u0022\u0022";
            StringBuffer sb = new StringBuffer(s);
            final int NUM_CHARS = (1 << 8);
            BitSet bs = new BitSet(NUM_CHARS);
            for (int i = 0x20; i <= 0x7E; i++) {
                  bs.set(i);
            }
            for (int i = 0xA1; i <= 0xFF; i++) {
                  bs.set(i);
            }
            for (int i = sb.length() - 1; i >= 0; i--) {
                  char c = sb.charAt(i);
                  int ixFound = targets.indexOf(c);
                  if (ixFound > -1) {
                        sb.setCharAt(i, replacements.charAt(ixFound));
                  }
                  else if (bs.get(c) == false) {
                        sb.deleteCharAt(i);
                  }
            }
            return sb.toString();
      }

:-)