Solved

parsing out non-iso characters in java

Posted on 2004-08-27
46
219 Views
Last Modified: 2010-03-31
Hi,

I need to find a function that will parse out all non-iso characters from a string.

There is a situation where people cut and past from a WORD doc into an html field.
The string they cut and paste contains non iso characters.

Is there a way to do this?

Thanks
0
Comment
Question by:vbguy
  • 20
  • 18
  • 7
  • +1
46 Comments
 
LVL 86

Expert Comment

by:CEHJ
Comment Utility
Try something like:

public String stripNonIsoChars(String s) {
      StringBuffer sb = new StringBuffer(s);
      final int NUM_CHARS = 1 << 8;
      BitSet bs = new BitSet(NUM_CHARS);
      for(int i = 0x20;i < 0x7E;i++) {
            bs.set(i);
      }
      for(int i = 0xA1;i < 0xFF;i++) {
            bs.set(i);
      }
      for(int i = sb.length() - 1;i >= 0;i--) {
            if (bs.get(sb.charAt(i)) == false) {
                  sb.deleteCharAt(i);
            }
      }
      return sb.toString();
}
0
 
LVL 86

Expert Comment

by:CEHJ
Comment Utility
That method should strictly be called 'stripNonIso88591Chars'
0
 
LVL 86

Accepted Solution

by:
CEHJ earned 500 total points
Comment Utility
... and there are a couple of inaccuracies, so it should be:

      public static String stripNonIso88591Chars(String s) {
    StringBuffer sb = new StringBuffer(s);
    final int NUM_CHARS = (1 << 8) + 1;
    BitSet bs = new BitSet(NUM_CHARS);
    for(int i = 0x20;i <= 0x7E;i++) {
         bs.set(i);
    }
    for(int i = 0xA1;i <= 0xFF;i++) {
         bs.set(i);
    }
    for(int i = sb.length() - 1;i >= 0;i--) {
         if (bs.get(sb.charAt(i)) == false) {
              sb.deleteCharAt(i);
         }
    }
    return sb.toString();
}
0
 
LVL 86

Expert Comment

by:CEHJ
Comment Utility
So if you create a String of all possible 'characters' and call that method, you will be left with only ISO8859-1 characters:

StringBuffer sb = new StringBuffer(1 << 16);
for(int i = 0;i <= 0xFFFF;i++) {
      sb.append((char)i);
}
System.out.println(stripNonIso88591Chars(sb.toString()));
0
 
LVL 92

Expert Comment

by:objects
Comment Utility
public String strip(String s)
{
   String buffer result = new StringBuffer();
   char[] chars = s.toCharArray();
   for (int i=0; i<chars.length; i++)
   {
      char c = chars[i];
      if (!Character.isISOControl())
      {
         result.append(c);
      }
   }
   return result.toString();
}
0
 
LVL 1

Expert Comment

by:talvio
Comment Utility
Hello,

seems that you have some solutions provided already, but as I think of what your actual problem is, I am temted to encourage you to take a look at java.net.URLEncoder and java.net.URLDecoder. This is just a guess, but might get you closer to your solution.

br,
-jT
0
 
LVL 92

Expert Comment

by:objects
Comment Utility
oops, that should have actually been:

public String strip(String s)
{
   String buffer result = new StringBuffer();
   char[] chars = s.toCharArray();
   for (int i=0; i<chars.length; i++)
   {
      char c = chars[i];
      if (Character.isISOControl(c))
      {
         result.append(c);
      }
   }
   return result.toString();
}
0
 
LVL 86

Expert Comment

by:CEHJ
Comment Utility
>>if (Character.isISOControl(c))

That will only strip control characters out
0
 

Author Comment

by:vbguy
Comment Utility
Hi CEHJ
I was testing stripNonIso88591Chars.

It didnt seem to parse the following
¼. I though this was a non iso characters as well.

Let me know if I am wrong.

Thanks,
Arthur
0
 
LVL 92

Expert Comment

by:objects
Comment Utility
did u try the code I posted (ignore CEHJ's comments about it)
0
 
LVL 92

Expert Comment

by:objects
Comment Utility
just noticed another typo :)

public String strip(String s)
{
   StringBuffer result = new StringBuffer();
   char[] chars = s.toCharArray();
   for (int i=0; i<chars.length; i++)
   {
      char c = chars[i];
      if (Character.isISOControl(c))
      {
         result.append(c);
      }
   }
   return result.toString();
}
0
 
LVL 92

Expert Comment

by:objects
Comment Utility
Specifically what char encoding are you using for your form?
0
 
LVL 92

Expert Comment

by:objects
Comment Utility


think i misinterprteted your requirements a little, what you need is more like this:

public static String strip(String s)
{
   StringBuffer result = new StringBuffer(s.length());
   char[] chars = s.toCharArray();
   for (int i=0; i<chars.length; i++)
   {
      char c = chars[i];
      if (c<=0xff && !Character.isISOControl(c))
      {
         result.append(c);
      }
   }
   return result.toString();
}
0
 
LVL 92

Expert Comment

by:objects
Comment Utility
Don't think its going to help you with your problem though :(
0
 
LVL 86

Expert Comment

by:CEHJ
Comment Utility
>>I though this was a non iso characters as well.

No, it's character 0xBC in ISO-8859-1.
0
 

Author Comment

by:vbguy
Comment Utility
Hi CEHJ,

Thanks. One final question,

How did you identify what the acutally character is?
For certain characters, I would like to replace it with something instead of deleting it.
So somehting like this:

  public static String stripNonIso88591Chars(String s) {
    StringBuffer sb = new StringBuffer(s);
    final int NUM_CHARS = (1 << 8) + 1;
    BitSet bs = new BitSet(NUM_CHARS);
    for(int i = 0x20;i <= 0x7E;i++) {
         bs.set(i);
    }
    for(int i = 0xA1;i <= 0xFF;i++) {
         bs.set(i);
    }
    for(int i = sb.length() - 1;i >= 0;i--) {
         if (bs.get(sb.charAt(i)) == false) {
              if (sb.charAt(i) = ' “')
              {
                 replace  “ with "
               }
            else
             {
              sb.deleteCharAt(i);
                }

How would the following be accomplished?

Thanks,
Arthur
0
 
LVL 86

Expert Comment

by:CEHJ
Comment Utility
What would you like to replace it with?
0
 

Author Comment

by:vbguy
Comment Utility
Hi,
It would be ascii code 34, the standard double qoute.
I am not sure what Word replaces it with, but it is not the standard one.

Thanks,
Arthur
0
 
LVL 86

Expert Comment

by:CEHJ
Comment Utility
Change

>>
    for(int i = sb.length() - 1;i >= 0;i--) {
         if (bs.get(sb.charAt(i)) == false) {
              sb.deleteCharAt(i);
         }
    }
>>

to

    for(int i = sb.length() - 1;i >= 0;i--) {
         if (bs.get(sb.charAt(i)) == false) {
              sb.setCharAt(i, '\"');
         }
    }
0
 

Author Comment

by:vbguy
Comment Utility
Hi CEHJ,

The only problem with that is it would replace all non iso chars with ".
Is it possible to say if its a non iso char and the non iso char is  “  (this is the back wards qoutes that word replaces the standard qoute with)  replace it with ".

Thanks,
Arthur
0
 
LVL 86

Expert Comment

by:CEHJ
Comment Utility
One would need to know what character code is being used there. If your nickname is correct - maybe you can tell us using some vb?
0
 

Author Comment

by:vbguy
Comment Utility
This is what I was looking for:
String mytest = "“ test “";
StringBuffer sb = new StringBuffer(mytest);
    System.out.println ("this is my test " + sb);

    for(int i =0; i<sb.length() ;i++)
    {
      int x_int = (int) sb.charAt(i);

        if (x_int == 8220)
        {
              sb.setCharAt(i, '\"');

        }

      }

I figured out the 8220 by the following:
  char letter = '“';
          int x = (int) letter;
          System.out.println ("this is the special char " + x);

Does that make sense?


0
 
LVL 86

Expert Comment

by:CEHJ
Comment Utility
Sure does - it's the Unicode code for the "LEFT DOUBLE QUOTATION MARK"
0
What Should I Do With This Threat Intelligence?

Are you wondering if you actually need threat intelligence? The answer is yes. We explain the basics for creating useful threat intelligence.

 
LVL 86

Expert Comment

by:CEHJ
Comment Utility
Presumably, you'll need to trap the right one too (8221)?
0
 
LVL 92

Expert Comment

by:objects
Comment Utility
>No, it's character 0xBC in ISO-8859-1.

0xBC != 8220

> I figured out the 8220

replacing that with something else is fine, but what if the user actually enters that iso char? And your question implies you need to deal with all non iso characters so how would you decide which others to ignore.

> char letter = '“';

You also should be testing the actual value returned from the browser.


If that code you posted solves your problem then feel free to close this question :)
0
 
LVL 86

Expert Comment

by:CEHJ
Comment Utility
>>0xBC != 8220

Please read the comments more carefully
0
 
LVL 92

Expert Comment

by:objects
Comment Utility
I've read it very carefully, have you ;)
0
 
LVL 86

Expert Comment

by:CEHJ
Comment Utility
You seem to be getting confused - why else would you be relating two unrelated things?
0
 
LVL 92

Expert Comment

by:objects
Comment Utility
If so then explain what you are referring to?
0
 
LVL 86

Expert Comment

by:CEHJ
Comment Utility
Precisely what do you mean to say by

>>0xBC != 8220

?
0
 
LVL 92

Expert Comment

by:objects
Comment Utility
I'm still waiting for your clarification so I can answer your question ......
guess I'm not going to get it :-D
0
 
LVL 86

Expert Comment

by:CEHJ
Comment Utility
>>I'm still waiting for your clarification so I can answer your question ......

What clarification? The statement

>>0xBC != 8220

makes no sense whatever, which is why i assume you're confused
0
 
LVL 92

Expert Comment

by:objects
Comment Utility
> No, it's character 0xBC in ISO-8859-1.

that statement.
please follow the discussion so as not to waste peoples time
0
 
LVL 86

Expert Comment

by:CEHJ
Comment Utility
>>that statement.

Well we've moved on a long time ago since that statement. What's difficult to understand about
 
¼ == 0xBC in ISO8859-1

?
0
 
LVL 92

Expert Comment

by:objects
Comment Utility
following should provide what you need, and allow you to specify any characters that need replacing


// include any characters here you would like replaced
private static String replace = "\u00bc\u00bd\u00be";  

public static String strip(String s)
{
   StringBuffer result = new StringBuffer(s.length());
   char[] chars = s.toCharArray();
   for (int i=0; i<chars.length; i++)
   {
      char c = chars[i];
      if (c<=0xff && !Character.isISOControl(c))
      {
         if (replace.indexOf(c)==-1)
         {
              result.append(c);
         }
         else
         {
             result.append("\"");
         }
      }
   }
   return result.toString();
}
0
 
LVL 92

Expert Comment

by:objects
Comment Utility
you can see the defined characters at:

http://en.wikipedia.org/wiki/ISO_8859-1

You just need to decide which ones you want to strip and which to replace, and tweak the above code accordingly.

Let me know if you have any questions.
0
 

Author Comment

by:vbguy
Comment Utility
Hi objects,

Can you explain:

sb.charAt(i) <=0xff

and how do you know what "\u00bc\u00bd\u00be" represents in terms of characters?

Thanks,
Arthur
0
 
LVL 92

Expert Comment

by:objects
Comment Utility
> sb.charAt(i) <=0xff

checking if char value is less than or equal to 0xff.

> and how do you know what "\u00bc\u00bd\u00be" represents in terms of characters?

they are 1/4, 1/2, 3/4.
0
 
LVL 86

Expert Comment

by:CEHJ
Comment Utility
>>checking if char value is less than or equal to 0xff.

Why would you do that?
0
 

Author Comment

by:vbguy
Comment Utility
how did you know \u00bc = 1\4 ? is there a chart you can point me to?  The link you gave didnt have that kind of representation.

Thanks.
0
 
LVL 92

Expert Comment

by:objects
Comment Utility
> The link you gave didnt have that kind of representation.

yes it does.
eg. to find 0xBC look up Bx on the left, and xC on the top.
0
 
LVL 92

Expert Comment

by:objects
Comment Utility
I missed it the first time I looked at it too :)
0
 
LVL 86

Expert Comment

by:CEHJ
Comment Utility
http://www.proteanit.net/misc/iso8859-1.htm

But i'm a little puzzled vbguy. Why would you want to get rid of (if you do) 0xBC, since it *is* one of the ISO8859-1 chars?
0
 
LVL 92

Expert Comment

by:objects
Comment Utility
If you do some tesing you may find the code I posted is a little more efficient than that code.
0
 
LVL 86

Expert Comment

by:CEHJ
Comment Utility
>>the code I posted is a little more efficient than that code

The only trouble is it won't work.

vbguy, if you want the best of both worlds and want to eliminate non-iso chars and do replacements at the same time you can do:


      public static String stripNonIso88591Chars2(String s) {
            // Replace left and right double quotes with normal one
            String targets =                  "\u201C\u201D";
            String replacements = "\u0022\u0022";
            StringBuffer sb = new StringBuffer(s);
            final int NUM_CHARS = (1 << 8);
            BitSet bs = new BitSet(NUM_CHARS);
            for (int i = 0x20; i <= 0x7E; i++) {
                  bs.set(i);
            }
            for (int i = 0xA1; i <= 0xFF; i++) {
                  bs.set(i);
            }
            for (int i = sb.length() - 1; i >= 0; i--) {
                  char c = sb.charAt(i);
                  int ixFound = targets.indexOf(c);
                  if (ixFound > -1) {
                        sb.setCharAt(i, replacements.charAt(ixFound));
                  }
                  else if (bs.get(c) == false) {
                        sb.deleteCharAt(i);
                  }
            }
            return sb.toString();
      }
0
 
LVL 86

Expert Comment

by:CEHJ
Comment Utility
:-)
0

Featured Post

Why You Should Analyze Threat Actor TTPs

After years of analyzing threat actor behavior, it’s become clear that at any given time there are specific tactics, techniques, and procedures (TTPs) that are particularly prevalent. By analyzing and understanding these TTPs, you can dramatically enhance your security program.

Join & Write a Comment

Suggested Solutions

An old method to applying the Singleton pattern in your Java code is to check if a static instance, defined in the same class that needs to be instantiated once and only once, is null and then create a new instance; otherwise, the pre-existing insta…
Java functions are among the best things for programmers to work with as Java sites can be very easy to read and prepare. Java especially simplifies many processes in the coding industry as it helps integrate many forms of technology and different d…
Viewers will learn about basic arrays, how to declare them, and how to use them. Introduction and definition: Declare an array and cover the syntax of declaring them: Initialize every index in the created array: Example/Features of a basic arr…
This theoretical tutorial explains exceptions, reasons for exceptions, different categories of exception and exception hierarchy.

743 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

12 Experts available now in Live!

Get 1:1 Help Now