Link to home
Start Free TrialLog in
Avatar of Java_Problem
Java_Problem

asked on

International Character Restriction in JSP page

Hello ALL,

Is there a way by which I can restrict user not to enter French Character set in a textbox. I want to avoid sending any french character set to the server and restrict with a message that the character set is not supported

BTW - I am not very clear on character sets as such. I know one needs fonts to have the capability to enter french characters - Am i correct  or ??

Please advise with some example

Thanks a lot
Avatar of suprapto45
suprapto45
Flag of Singapore image

As far as I know, if you would NOT allow French Character, you will NOT allow all the other foreign languages (Unicode).

You can either:
1. Manually check if every character(s) in textbox is from a-z
2. Check whether it is unicode or not (not sure if this is possible or not)

David
Avatar of mrcoffee365
There's an "accept-charset" attribute for forms, but in my testing it had no effect on what the Mozilla or Firefox sent the server, or what the server accepted.  This was true even if the charset for the page was set to us-ascii.  Those browsers are honoring what the user has as the set of valid charsets, rather than restricting based on the charset given for the page.

IE, though, turned all of the non us-ascii characters into question marks (?).  So this text in a textarea input box:

Le peuple français proclame solennellement son attachement aux Droits de l'homme et aux principes de la souveraineté nationale tels qu'ils ont été définis par la Déclaration de 1789.

became this text at the server:

Le peuple fran?ais proclame solennellement son attachement aux
Droits de l'homme et aux principes de la souverainet? nationale
tels qu'ils ont ?t? d?finis par la D?claration de 1789.

>> I know one needs fonts to have the capability to enter french characters - Am i correct  or ??
No, you can enter the HTML entity  for the international character you want.  For example, still in IE with the form and page charsets set to us-ascii, this was entered in the form textarea:

français

and this was received at the server:

français

Notice the lack of question marks.  So apparently IE applies the charset restriction only if the letters are entered as fonted characters, not as HTML entities.

Which is all a long-winded way of saying that suprapto45 is right, you'll have to read the characters on the server, and reject the ones you don't want.

The content is encoded, so that français  looks like fran%26%23231%3Bais .  Encode it, then you could read the input text byte by byte, and check for the occurrence of the %26 encoding (&) followed by %23 (#).  read the number up to the %3B (;) and check if that is on your proscribed charset list.  In this case, that would be 231, which is the Unicode for c cedille.

If you want to catch the user before the text is posted to the server, then you'll have to do this in Javascript.  Or perhaps you could use Ajax to send the text to the server, get a rejection message, and display it without the browser redisplaying the form's page.
Avatar of Java_Problem
Java_Problem

ASKER

So what I understood that there is

1. Not much that I can do as most of the times it's browser that is ignoring any encoding whatsoever is set from the page

2. Option of Ajax is open (For which I have no idea at all)

BTW: When do you need fonts then ??

Is there any way (easier or simpler to solve this)
>>1. Not much that I can do as most of the times it's browser that is ignoring any encoding whatsoever is set from the page

You can't be very effective from the server because the browser is going to mostly ignore your settings.  In fact, what you really want is for the browser to have a popup that says "illegal character" right?  Or, maybe to prevent the user from typing an illegal character in a text box or text area?  There isn't a way, that I know of, from the server (i.e., in the page HTML).

With fancy Javascript, you could probably do this.  Put Javascript controls on every text field, which reads the character typed and either refuses to put it in the input box or gives an alert popup.

>>2. Option of Ajax is open (For which I have no idea at all)
Ajax is just a way to have Javascript send the input text to the server, so you could write code on the server to check the input text before the user hits the Submit button.  Still requires a Javascript control on every text input box.

>>BTW: When do you need fonts then ??
I already answered this:  The user can use fonts, or not.  Either way, they can enter international characters in the input text boxes.

The fonts are declared in the browser and exist on the user's OS.  Did you have a different question here?

>>Is there any way (easier or simpler to solve this)
If we had known of one, we would have offered it.  Maybe another expert knows of one.


>>Is there any way (easier or simpler to solve this)
Did you mean on the server side?  You could use the apache commons lang CharUtils.isAscii method (assuming that ASCII is the only charset you want to support).  You could try this nifty jar from Mozilla for detecting the charset of a string (HTML pages in their case, but the same algorithm would work on untagged text):
http://jchardet.sourceforge.net/
You could do java String pattern matching with the list of acceptable ASCII characters.

For client-side checking, the above answers stand.
Well, I guess I am still circling around.

I can't understand how people type international characters from their keyboard. I understand when you explained on français and français, but the thought of someone can see that little something under c on their keyboard disturbs me. How can some one key that from their keyboard. I don't have that in mine.

It will be phenominally difficult for candanians to type in frança or remember all combinations by itself.

That brings me to last part of your answer -

"The fonts are declared in the browser and exist on the user's OS"

Where exactly ?? Are all fonts are there ?? Please walk me through so that I can enter français with that c ??

I am still far from understanding international character set. Your help is very valuable - Thanks for all that, please do clarify on my confusion, Great Help
SOLUTION
Avatar of mrcoffee365
mrcoffee365
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
You're looking at the problem the wrong way: your solution is not to prevent people from entering what you perceive as unusual characters, but rather that you work out how to accept absolutely everything the users throw at you.

If you restrict character input, you are creating a 1970's application, simple as that. You can, of course, restrict data to ASCII, or even vowels if you wish, and it's easily done using JavaScript, but it's something you really do not need to do. If the user can enter the characters, then accept the data. Your should NEVER make the machine override what the human has decided to do.

As for the characters supported by the fonts: all (actually most) Windows fonts contain the characters in Codepage 1252 (often called Windows ANSI even though it's not ANSI-compatible) which covers all the langiuages in western Europe, including French. Of course, you cant' assume that a US PC will have the necessary fonts for Russia and Thailand and Japan, but you can reasonably assume that if you receive Japanese data from a US PC, that this is what the user intended to provide.

For many years, the Japanese and Chinese folk had been complaining about the lack of ideographs for their names. Now that Unicode has added the extended ideographs off the basic multilingual plane, you're talking about taking a giant step backwards and preventing 90% of the population of the planet from typing what they perceive is normal text. Making the assumption that you know better than the user is simply bad design.

So, the solution is simple: a String is a String is a String, irrespective of what is in it. Tag all your pages as UTF-8 and you can forget about what's in the strings. The reality is that if the user puts in garbage, then that's what you give back: your job isn't to filter the content.
With all due respect, I totally agree with your point. I may have incorrect design or architecture in the first place itself.

Now, I got confused and need some help/directions in your comments -

"So, the solution is simple: a String is a String is a String, irrespective of what is in it. Tag all your pages as UTF-8 and you can forget about what's in the strings"

Do you suggest to put UTF-8 in all JSP pages and that will solve problem. Is this correct ?? Please elaborate on this with small example

Thanks again


OK, let's look at the idea of a String. If you look at this page, you'll see something like:
      <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">

Now, 8869-1 is fine for western Europe (and the Americas), although not much good if you want to write something in Polish, or Russian, or Japanese, or Traditional or Simplified Chinese, or ...

There is a universal character set called Unicode which has a number of incarnations, but we'll focus on UTF-8 (UTF=Unicode Transformation Format) which has the really nice feature that any ASCII data is exactly the same as the UTF-8 equivalent. By the way, I want to stress that, irrespective of what you have read elsewhere, there is NO SUCH THING as 8-bit ASCII: ASCII, USASCII, ISO-646-IRV are all the same thing, and encompasses the 7-bit characters A-Z, a-z, 0-9 and a few punctuation characters. I won't bother going into how to convert other characters into UTF-8, but you should understand that each of these characters takes two or three bytes.

Anyway, this isn't really the issue. When you create your pages, set the meta tag to UTF-8:
     <meta http-equiv="content-type" content="text-html; charset=utf-8">

You can now simply allow the user input ANYTHING in the text input fields (obviously, you could have special restrictions on numeric or date fields, etc.). This means that even if you have English pages, the user can input a name with accents, or in a non-English alphabet. Of course, your servlet has to behave correctly. Since all Java uses Unicode, in another format known as UTF-16, you'll have to convert your input data to this. This means that you'll have to determine the character encoding of your request BEFORE referencing the parameters. This is done by doing something similar to:
      public void doGet(HttpServletRequest req, HttpServletResponse rsp) {
            // First, see if there's anything in the request
            String enc = req.getCharacterEncoding();

            // If the request has no encoding info, search everywhere else
            if (null == enc) {
                  // Look for a charset in the Content-Type header
                  String cType = (String) req.getHeader("Content-Type");
                  if (null != cType) {
                        int iX = cType.indexOf("charset=");
                        if (iX >= 0 && cType.length() > iX + 8) {
                              enc = cType.substring(iX + 8).trim();
                        }
                  }
            }

            // Still none: default to JVM settings
            if (null == enc) {
                  enc = "ISO8859_1";
            }
            try {
                  ((ServletRequest) req).setCharacterEncoding(enc);
            } catch (Exception ex) {
                  System.out.println("Something funny happened: " + ex.getMessage());
            }
      }
Fair Enough,

So what exactly happen here (and correct me if I am wrong)

1. I write out JSP page with utf-8 set. This allows users to enter whatever
2. This gets submitted to servlet, which checks you have encoding set from the page or not.
3. If not, servlet will set the enc as default ASCII, else takes encoding whatever you send.

I am not sure if it solves the problem of not sending junk to the server. In fact the code will allow if the page has utf-8 and the backend database will not understand the user's credentials, can not validate anything on it, generates error code that gets returned to the client.

Would you consider this worth a call to the server or you restrict the user at the client not to enter international correct.

Please do not get me wrong, but I have a clear objective to restrict users entering french and spanish characters all because the database backend system does not support international character set (i should have given some background initially itself)

Please advise with your thoughts - Thanks again

ASKER CERTIFIED SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
I guess the above is OK for names, but if you are providing feedback or comment or address information, you'll need to allow a lot more characters, probably an absolute minimum of:
     regexp = /[0-9A-Za-z (),\.';#-"]/;

But I still think whoever told you that your DB doesn't support these characters either doesn't know the DB or is lying :-)