Solved

parsing out non-iso characters in java

Posted on 2004-08-27
46
222 Views
Last Modified: 2010-03-31
Hi,

I need to find a function that will parse out all non-iso characters from a string.

There is a situation where people cut and past from a WORD doc into an html field.
The string they cut and paste contains non iso characters.

Is there a way to do this?

Thanks
0
Comment
Question by:vbguy
  • 20
  • 18
  • 7
  • +1
46 Comments
 
LVL 86

Expert Comment

by:CEHJ
ID: 11913965
Try something like:

public String stripNonIsoChars(String s) {
      StringBuffer sb = new StringBuffer(s);
      final int NUM_CHARS = 1 << 8;
      BitSet bs = new BitSet(NUM_CHARS);
      for(int i = 0x20;i < 0x7E;i++) {
            bs.set(i);
      }
      for(int i = 0xA1;i < 0xFF;i++) {
            bs.set(i);
      }
      for(int i = sb.length() - 1;i >= 0;i--) {
            if (bs.get(sb.charAt(i)) == false) {
                  sb.deleteCharAt(i);
            }
      }
      return sb.toString();
}
0
 
LVL 86

Expert Comment

by:CEHJ
ID: 11914116
That method should strictly be called 'stripNonIso88591Chars'
0
 
LVL 86

Accepted Solution

by:
CEHJ earned 500 total points
ID: 11914348
... and there are a couple of inaccuracies, so it should be:

      public static String stripNonIso88591Chars(String s) {
    StringBuffer sb = new StringBuffer(s);
    final int NUM_CHARS = (1 << 8) + 1;
    BitSet bs = new BitSet(NUM_CHARS);
    for(int i = 0x20;i <= 0x7E;i++) {
         bs.set(i);
    }
    for(int i = 0xA1;i <= 0xFF;i++) {
         bs.set(i);
    }
    for(int i = sb.length() - 1;i >= 0;i--) {
         if (bs.get(sb.charAt(i)) == false) {
              sb.deleteCharAt(i);
         }
    }
    return sb.toString();
}
0
Announcing the Most Valuable Experts of 2016

MVEs are more concerned with the satisfaction of those they help than with the considerable points they can earn. They are the types of people you feel privileged to call colleagues. Join us in honoring this amazing group of Experts.

 
LVL 86

Expert Comment

by:CEHJ
ID: 11914414
So if you create a String of all possible 'characters' and call that method, you will be left with only ISO8859-1 characters:

StringBuffer sb = new StringBuffer(1 << 16);
for(int i = 0;i <= 0xFFFF;i++) {
      sb.append((char)i);
}
System.out.println(stripNonIso88591Chars(sb.toString()));
0
 
LVL 92

Expert Comment

by:objects
ID: 11918490
public String strip(String s)
{
   String buffer result = new StringBuffer();
   char[] chars = s.toCharArray();
   for (int i=0; i<chars.length; i++)
   {
      char c = chars[i];
      if (!Character.isISOControl())
      {
         result.append(c);
      }
   }
   return result.toString();
}
0
 
LVL 1

Expert Comment

by:talvio
ID: 11919578
Hello,

seems that you have some solutions provided already, but as I think of what your actual problem is, I am temted to encourage you to take a look at java.net.URLEncoder and java.net.URLDecoder. This is just a guess, but might get you closer to your solution.

br,
-jT
0
 
LVL 92

Expert Comment

by:objects
ID: 11919647
oops, that should have actually been:

public String strip(String s)
{
   String buffer result = new StringBuffer();
   char[] chars = s.toCharArray();
   for (int i=0; i<chars.length; i++)
   {
      char c = chars[i];
      if (Character.isISOControl(c))
      {
         result.append(c);
      }
   }
   return result.toString();
}
0
 
LVL 86

Expert Comment

by:CEHJ
ID: 11920199
>>if (Character.isISOControl(c))

That will only strip control characters out
0
 

Author Comment

by:vbguy
ID: 11927985
Hi CEHJ
I was testing stripNonIso88591Chars.

It didnt seem to parse the following
¼. I though this was a non iso characters as well.

Let me know if I am wrong.

Thanks,
Arthur
0
 
LVL 92

Expert Comment

by:objects
ID: 11928153
did u try the code I posted (ignore CEHJ's comments about it)
0
 
LVL 92

Expert Comment

by:objects
ID: 11928206
just noticed another typo :)

public String strip(String s)
{
   StringBuffer result = new StringBuffer();
   char[] chars = s.toCharArray();
   for (int i=0; i<chars.length; i++)
   {
      char c = chars[i];
      if (Character.isISOControl(c))
      {
         result.append(c);
      }
   }
   return result.toString();
}
0
 
LVL 92

Expert Comment

by:objects
ID: 11928384
Specifically what char encoding are you using for your form?
0
 
LVL 92

Expert Comment

by:objects
ID: 11929277


think i misinterprteted your requirements a little, what you need is more like this:

public static String strip(String s)
{
   StringBuffer result = new StringBuffer(s.length());
   char[] chars = s.toCharArray();
   for (int i=0; i<chars.length; i++)
   {
      char c = chars[i];
      if (c<=0xff && !Character.isISOControl(c))
      {
         result.append(c);
      }
   }
   return result.toString();
}
0
 
LVL 92

Expert Comment

by:objects
ID: 11929375
Don't think its going to help you with your problem though :(
0
 
LVL 86

Expert Comment

by:CEHJ
ID: 11930971
>>I though this was a non iso characters as well.

No, it's character 0xBC in ISO-8859-1.
0
 

Author Comment

by:vbguy
ID: 11934299
Hi CEHJ,

Thanks. One final question,

How did you identify what the acutally character is?
For certain characters, I would like to replace it with something instead of deleting it.
So somehting like this:

  public static String stripNonIso88591Chars(String s) {
    StringBuffer sb = new StringBuffer(s);
    final int NUM_CHARS = (1 << 8) + 1;
    BitSet bs = new BitSet(NUM_CHARS);
    for(int i = 0x20;i <= 0x7E;i++) {
         bs.set(i);
    }
    for(int i = 0xA1;i <= 0xFF;i++) {
         bs.set(i);
    }
    for(int i = sb.length() - 1;i >= 0;i--) {
         if (bs.get(sb.charAt(i)) == false) {
              if (sb.charAt(i) = ' “')
              {
                 replace  “ with "
               }
            else
             {
              sb.deleteCharAt(i);
                }

How would the following be accomplished?

Thanks,
Arthur
0
 
LVL 86

Expert Comment

by:CEHJ
ID: 11934325
What would you like to replace it with?
0
 

Author Comment

by:vbguy
ID: 11934410
Hi,
It would be ascii code 34, the standard double qoute.
I am not sure what Word replaces it with, but it is not the standard one.

Thanks,
Arthur
0
 
LVL 86

Expert Comment

by:CEHJ
ID: 11934485
Change

>>
    for(int i = sb.length() - 1;i >= 0;i--) {
         if (bs.get(sb.charAt(i)) == false) {
              sb.deleteCharAt(i);
         }
    }
>>

to

    for(int i = sb.length() - 1;i >= 0;i--) {
         if (bs.get(sb.charAt(i)) == false) {
              sb.setCharAt(i, '\"');
         }
    }
0
 

Author Comment

by:vbguy
ID: 11934685
Hi CEHJ,

The only problem with that is it would replace all non iso chars with ".
Is it possible to say if its a non iso char and the non iso char is  “  (this is the back wards qoutes that word replaces the standard qoute with)  replace it with ".

Thanks,
Arthur
0
 
LVL 86

Expert Comment

by:CEHJ
ID: 11934814
One would need to know what character code is being used there. If your nickname is correct - maybe you can tell us using some vb?
0
 

Author Comment

by:vbguy
ID: 11935442
This is what I was looking for:
String mytest = "“ test “";
StringBuffer sb = new StringBuffer(mytest);
    System.out.println ("this is my test " + sb);

    for(int i =0; i<sb.length() ;i++)
    {
      int x_int = (int) sb.charAt(i);

        if (x_int == 8220)
        {
              sb.setCharAt(i, '\"');

        }

      }

I figured out the 8220 by the following:
  char letter = '“';
          int x = (int) letter;
          System.out.println ("this is the special char " + x);

Does that make sense?


0
 
LVL 86

Expert Comment

by:CEHJ
ID: 11935561
Sure does - it's the Unicode code for the "LEFT DOUBLE QUOTATION MARK"
0
 
LVL 86

Expert Comment

by:CEHJ
ID: 11935636
Presumably, you'll need to trap the right one too (8221)?
0
 
LVL 92

Expert Comment

by:objects
ID: 11936869
>No, it's character 0xBC in ISO-8859-1.

0xBC != 8220

> I figured out the 8220

replacing that with something else is fine, but what if the user actually enters that iso char? And your question implies you need to deal with all non iso characters so how would you decide which others to ignore.

> char letter = '“';

You also should be testing the actual value returned from the browser.


If that code you posted solves your problem then feel free to close this question :)
0
 
LVL 86

Expert Comment

by:CEHJ
ID: 11936918
>>0xBC != 8220

Please read the comments more carefully
0
 
LVL 92

Expert Comment

by:objects
ID: 11936944
I've read it very carefully, have you ;)
0
 
LVL 86

Expert Comment

by:CEHJ
ID: 11936973
You seem to be getting confused - why else would you be relating two unrelated things?
0
 
LVL 92

Expert Comment

by:objects
ID: 11937009
If so then explain what you are referring to?
0
 
LVL 86

Expert Comment

by:CEHJ
ID: 11937024
Precisely what do you mean to say by

>>0xBC != 8220

?
0
 
LVL 92

Expert Comment

by:objects
ID: 11937130
I'm still waiting for your clarification so I can answer your question ......
guess I'm not going to get it :-D
0
 
LVL 86

Expert Comment

by:CEHJ
ID: 11937271
>>I'm still waiting for your clarification so I can answer your question ......

What clarification? The statement

>>0xBC != 8220

makes no sense whatever, which is why i assume you're confused
0
 
LVL 92

Expert Comment

by:objects
ID: 11937325
> No, it's character 0xBC in ISO-8859-1.

that statement.
please follow the discussion so as not to waste peoples time
0
 
LVL 86

Expert Comment

by:CEHJ
ID: 11937342
>>that statement.

Well we've moved on a long time ago since that statement. What's difficult to understand about
 
¼ == 0xBC in ISO8859-1

?
0
 
LVL 92

Expert Comment

by:objects
ID: 11937437
following should provide what you need, and allow you to specify any characters that need replacing


// include any characters here you would like replaced
private static String replace = "\u00bc\u00bd\u00be";  

public static String strip(String s)
{
   StringBuffer result = new StringBuffer(s.length());
   char[] chars = s.toCharArray();
   for (int i=0; i<chars.length; i++)
   {
      char c = chars[i];
      if (c<=0xff && !Character.isISOControl(c))
      {
         if (replace.indexOf(c)==-1)
         {
              result.append(c);
         }
         else
         {
             result.append("\"");
         }
      }
   }
   return result.toString();
}
0
 
LVL 92

Expert Comment

by:objects
ID: 11938154
you can see the defined characters at:

http://en.wikipedia.org/wiki/ISO_8859-1

You just need to decide which ones you want to strip and which to replace, and tweak the above code accordingly.

Let me know if you have any questions.
0
 

Author Comment

by:vbguy
ID: 11947184
Hi objects,

Can you explain:

sb.charAt(i) <=0xff

and how do you know what "\u00bc\u00bd\u00be" represents in terms of characters?

Thanks,
Arthur
0
 
LVL 92

Expert Comment

by:objects
ID: 11947244
> sb.charAt(i) <=0xff

checking if char value is less than or equal to 0xff.

> and how do you know what "\u00bc\u00bd\u00be" represents in terms of characters?

they are 1/4, 1/2, 3/4.
0
 
LVL 86

Expert Comment

by:CEHJ
ID: 11947365
>>checking if char value is less than or equal to 0xff.

Why would you do that?
0
 

Author Comment

by:vbguy
ID: 11947612
how did you know \u00bc = 1\4 ? is there a chart you can point me to?  The link you gave didnt have that kind of representation.

Thanks.
0
 
LVL 92

Expert Comment

by:objects
ID: 11947643
> The link you gave didnt have that kind of representation.

yes it does.
eg. to find 0xBC look up Bx on the left, and xC on the top.
0
 
LVL 92

Expert Comment

by:objects
ID: 11947648
I missed it the first time I looked at it too :)
0
 
LVL 86

Expert Comment

by:CEHJ
ID: 11947667
http://www.proteanit.net/misc/iso8859-1.htm

But i'm a little puzzled vbguy. Why would you want to get rid of (if you do) 0xBC, since it *is* one of the ISO8859-1 chars?
0
 
LVL 92

Expert Comment

by:objects
ID: 12001072
If you do some tesing you may find the code I posted is a little more efficient than that code.
0
 
LVL 86

Expert Comment

by:CEHJ
ID: 12001209
>>the code I posted is a little more efficient than that code

The only trouble is it won't work.

vbguy, if you want the best of both worlds and want to eliminate non-iso chars and do replacements at the same time you can do:


      public static String stripNonIso88591Chars2(String s) {
            // Replace left and right double quotes with normal one
            String targets =                  "\u201C\u201D";
            String replacements = "\u0022\u0022";
            StringBuffer sb = new StringBuffer(s);
            final int NUM_CHARS = (1 << 8);
            BitSet bs = new BitSet(NUM_CHARS);
            for (int i = 0x20; i <= 0x7E; i++) {
                  bs.set(i);
            }
            for (int i = 0xA1; i <= 0xFF; i++) {
                  bs.set(i);
            }
            for (int i = sb.length() - 1; i >= 0; i--) {
                  char c = sb.charAt(i);
                  int ixFound = targets.indexOf(c);
                  if (ixFound > -1) {
                        sb.setCharAt(i, replacements.charAt(ixFound));
                  }
                  else if (bs.get(c) == false) {
                        sb.deleteCharAt(i);
                  }
            }
            return sb.toString();
      }
0
 
LVL 86

Expert Comment

by:CEHJ
ID: 12001226
:-)
0

Featured Post

Free Tool: Postgres Monitoring System

A PHP and Perl based system to collect and display usage statistics from PostgreSQL databases.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Java contains several comparison operators (e.g., <, <=, >, >=, ==, !=) that allow you to compare primitive values. However, these operators cannot be used to compare the contents of objects. Interface Comparable is used to allow objects of a cl…
Basic understanding on "OO- Object Orientation" is needed for designing a logical solution to solve a problem. Basic OOAD is a prerequisite for a coder to ensure that they follow the basic design of OO. This would help developers to understand the b…
Video by: Michael
Viewers learn about how to reduce the potential repetitiveness of coding in main by developing methods to perform specific tasks for their program. Additionally, objects are introduced for the purpose of learning how to call methods in Java. Define …
Viewers will learn how to properly install Eclipse with the necessary JDK, and will take a look at an introductory Java program. Download Eclipse installation zip file: Extract files from zip file: Download and install JDK 8: Open Eclipse and …

809 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question