Start Free Trial

asked on

Need width of a string, how? (for asian fonts)

I am trying to write an email containing data in a table. The email will be read in a mail viewer that uses a fixed-width font so I would like to line up the data in columns.

If I were dealing with ASCII only data that would be easy as each printing characters has a width of exactly and always 1.

However I am writing out Japanese characters as data. In Japanese most characters have a width of two, but some have a width of only 1.

Is there anyway for me to figure out (assuming I am using a fixed-width font):

#1 the width of a string

OR

#2 if a character is singled-width or double width (the I could just loop thorugh all the characters in a string to figure out the width).

ASKER CERTIFIED SOLUTION

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

ASKER

Please show me how. Further more how do I get a FontMetric object since it is an abstract class?

FontMetrics fm = comp.getFontMetrics(font);
int width = fm.stringWidth(s);

ASKER

I don't have any components to call getFontMetric() on ... This application has no GUI.

Then how is the string getting displayed?

ASKER

Read the question again. I am sending emails. Nothing needs to be displayed. The application is run from the console or called externally from another program.

Sorry I thought your application was the mail viewer :)

There is no way you can find the width then, in fact the width will be different depending on what application is viewing it and what font they have installed in their system.

ASKER

Yes you are absolutely right :) But ...

The mail viewer will be using a fixed-width font. So as long as I line things up on my end the will line up in the mail viewer.

You might still be able to use a FontMetrics to calculate your width - something like
FontMetrics fm = Toolkit.getDefaultToolkit().getFontMetrics(new Font(...));

ASKER

Yes, that would work but that method is deprecated ...

What is the "new" (non-deprecated) way of getting a FontMetric object?

The API docs talk of getting a LineMetrics object but that object does not have a stringWidth() method ....

Did the java people forget to include such a method?

LineMetrics are not used for caclulating rendered string width, thats what FontMetrics are for.

ASKER

Ok, so FontMetrics it is.

I still can't figure out how to get a FontMetrics object though.

FontMetrics fm = Toolkit.getDefaultToolkit().getFontMetrics(new Font(...));

Would seem the way to go but this method it deprecated.

You could create an image and get the font metrics objects from the associated Graphics object.

ASKER

I agree, but there must be a way to get a FontMetrics object withouth creating anything I don't need?

Or is this a bug in Java (i.e. they forgot a method to get a FontMetrics object when one has no visual components).

The size of a font is dependant on the attributes of what you are rendering to, thus you need to know what you are rendering to get a FontMetrics instance.

> Or is this a bug in Java

No its not a bug. A method exists but it has been deprecated for the above reasons.

But through the LineMetrics also you can get the string width, right?

ASKER

ksivananth:

No, LineMetrics does not have a corresponding stringWidth() method

objects:

Hum ... To know how wide a string is all one needs to know is what Font is being used, no?

There is no need for a Component.

So there should be a way to get the width of a string knowing only which Font will be used to render it.

Or are things more complicated than I think?

> Or are things more complicated than I think?

Things are more complicated than you think :)
The width will vary depending on the device it is being rendered to.
This may not be a problem in your case as you not only do not know know what device the font will be rendered on, but you also do not know the details of the actual font being used. So whatever you use it is not going to be accurate anyway.

ASKER

You're right.

I had forgotten about that since in my case I don't care about the actual size, but the relative size. i.e. I just want to know if a character takes up 1 "space" or 2, or if two Strings are the width or not, if not I pad them with spaces until they are.

I guess I'll get a FontMetrics object from a dummy Component and use that.

Things where indeed not as simple as I had imagined (or hope ... ;)

Points go to objects unless there are object(ions)s? :)

Are you using my suggestion?

ASKER

No because it is deprecated. Though it *is* the solution I *would* like to use since it means I don't have to create a dummy component ...

But objects explained pretty well why that method got deprecated and why it makes no sense to get a FontMetrics object without something that will actually display that font.

I'm thinking the only direction to go is taking your value, and doing a length() on it. Taking that, then turning your string into, say ByteArrayInputStream using getBytes(). You can then take each character individually using their byte values. You'll probably have to use your soultion #2 basis, but that way you'll end up with the data you want.

Interesting suggestion - but do the half-width characters correspond to single-byte characters? I think not, and I think objects has earnt the points. Have fun implementing, totsubo. It's kind of a strange thing to be doing... using AWT in a console app...

ASKER

burtdav:

It might seems strange but it's not :)

The app is run from a cron job that picks up invoicing data from a database then automatically generates the emails to send out.

I want the invoicing data to line up properly, in a tabular form.

True, that if the user uses a mail viewer that doesn't use a fixed-width font this solution has no impact. But for those that do it will make for a nicely-formatted email :)

It's a lot of work for little pay-back it's what the customer wants ... :(

I was referring to the simple fact of using windowing components in a non-windowed app being weird. Well, it's the customer who pays the bills, isn't it? I hope you're able to provide value even in this case.
Cheers!

ASKER

I agree it's weird.

As for value ... I guess at the rate I get paid the company is getting value.

But I still have a long way to go before I can say my Java programs are well-written. Lots more practice needed.

ASKER

Object:

I had run some quick tests and your method seemed to be working, but now that I am using it on real data it doesn't anymore. Can you help?

I use the below function many times as I built up a line to make the columns line up. I padd with "+" at the end of the line until I reach the begining of the next column.

Font font = new Font("Courier", Font.PLAIN, 2);
FontMetrics fm = Toolkit.getDefaultToolkit().getFontMetrics(font);
int LINE_LENGTH = 72;
String mpc, title, dsc, qty, price, total;

line = mpc;
line = pad(line, 10) + title;
line = pad(line, 50) + dsc;
line = pad(line, 60) + qty;
line = pad(line, 63) + price;
line = pad(line, LINE_LENGTH - 5) + total;

String pad(String s, int len, boolean) {
while (fm.stringWidth(s) < len) {s += "+";}
System.out.println(s + " (this line is " + fm.stringWidth(s) + " wide)");
}

Here is some sample output. If you copy-paste these lines into a japanese text editor using a fixed-width font the lines do not line up but Java says they are the same width:

AIO-048++ジャイアントブッツ+++++++++++++++++++GOODS++1++5800++++++5800 (this line is 77 wide)
DOLL-010+少女いたずら　こんなコトするの初めてだよ (DVD)
++++++++++++++++++++++++++++++++++++++++++++++++++DVD+++++++1++3700++++ 3700 (this line is 77 wide)
DDGB-016+拷問診察室　美少女クリニック　16++++++++DVD+++++++1++4800++++++4800 (this line is 77 wide)

Help! :)

> Font font = new Font("Courier", Font.PLAIN, 2);

Shouldn't you be using a Japanese font?

ASKER

Good point. I changed my code to use a japanese font but I still have the same problem:

Font font = new Font("FixedSys", Font.PLAIN, 16);
FontMetrics fm = Toolkit.getDefaultToolkit().getFontMetrics(font);

System.out.println("AAAAAAAAAA" + " (this line is " + fm.stringWidth("AAAAAAAAAA") + " characters)");
System.out.println("1234567890" + " (" + fm.stringWidth("1234567890") + " wide)");
System.out.println("あああああ" + " (" + fm.stringWidth("あああああ") + " wide)");
System.out.println("１２３４５" + " (" + fm.stringWidth("１２３４５") + " wide)");
System.out.println("私は長いァ" + " (" + fm.stringWidth("１２３４５") + " wide)");

OUTPUT:

AAAAAAAAAA (110 wide)
1234567890 (90 wide)
あああああ (80 wide)
１２３４５ (80 wide)
私は長いァ (80 wide)

Thought it seems that the japanese characters are always 16 pixels wide ...

(As a test) have you tried displaying then using Java to see if they do in fact line up then or not.

ASKER

No, as wether they line up in Java or not is not important.

One of the specs is that in the mail viewer one ASCII character takes up one space and one full-width japanese character takes up two spaces.

Unfortunately there are also japanese half-width characters that take up one space, so I can't jsut check to see if a character falls in the ASCII range or not :(

I've tried to find a list of the unicode ranges for half-width chracters but with no luck. As far as I can tell they are all over the place ...

> One of the specs is that in the mail viewer one ASCII
> character takes up one space and one full-width japanese character takes up two spaces.

Does the Java fixed width font follow the same rules?

ASKER

Yes, as far as I can tell it does. All fonts that support japanese that I have tested follow the same rules.

Then testing if it lines up in Java should be useful then.
As it should line up in Java.

ASKER

No, I guess I didn't quite catch your question.

The test case I gave shows that in Java the chracters do not line up:

AAAAAAAAAA (110 wide)
1234567890 (90 wide)

What I meant in my answer to your question was this:

If a chracter is double-width in the email viewer, it will be double-width in Java, and the same for half-width characters.

But wheras in the email viewer all half-width characters have the same width (and the same for the full-width chracters) I have yet to find in Java a truly fixed-width font where all half-width characters have the same width (in pixels) when using fm.stringWidth() to measure the width.

Can you populate a boolean[] reference array with false for single-width characters and true for double-width characters? Like this:
// in a class
private static boolean[] charIsDoubleWidth;
// in a constructor or method, before it needs to be used
if (charIsDoubleWidth == null) {
charIsDoubleWidth = new boolean[65536];
for (int i = 0; i < charIsDoubleWidth.length; i++) {
charIsDoubleWidth[i] = (i > 0xff && i != 0x1234 && i != 0x1235 // ...
);
}
}
It would be somehow better to initialise it with an aggregate (public static final boolean[] cidw={false,...}), but that would be prohibitively huge.
Then testing a character is as simple as evaluating charIsDoubleWidth[charToTest]. But if you don't know that list, or if it's impractical to express in terms of exceptions like I've tried to show, then this is not your solution.

I'm getting confused. So are you saying that in the email viewer all japanese characters have the same width.
But the message may contain a mix of japanese and ascii characters.

ASKER

Objects:

You've almost got it. The text can contain a mix of japanese and ascii characters, *and* to make matters more complicated some japanese chracters that up the same space as ASCII characters whereas others (most) take up twice as much space.

burtdav:

Your suggestion is good but how do you know if a character is half or double width? All chracters in the ASCII range are half-width but not all characters above that are full width ...

My rule above accounts for that: "i > 0xff" says that double-width characters are all above '\u00ff', and the "!="s after that specify single-width characters; read it like this:
charIsDoubleWidth[i] = (i > 0xff && i != 0x1234 && i != 0x1235 ...)
double width if (above ascii range BUT not '\u1234' AND not '\u1235' etc.)
You could do this if it was practical to type in the character codes of all the exceptions. You could do this using a target mail client: generate an email with characters next to character codes on separate lines, and it will be easy to differentiate between the two types.

ASKER

I agree that your solution would work the only problem is that I don't know what all the half-width characters are ...

I can guess at most of them (all the half-width kana) but there are some I don't know about. There are many half-width punctuation marks and graphics that I don't know about.

I've tried looking for a chart of these but can't find one.

You can make one by generating a (fairly long) email...
public class ListCharacters;
public static void main(String[]args){
PrintWriter out = new PrintWriter(new FileOutputStream("myoutputfile.txt"));
for (char c = 1; c <= '\uffff'; c++) {
out.print(c + " " + (int)c);
}
out.flush();
out.close();
}}
Hopefully that will make a unicode file you can copy into an email and view in your email client - widths should become apparent.

On a side note, what character encoding are you using to mix japanese and ascii character.

objects, I think that characters 0x00 through to 0x7f are fairly consistent between most modern encodings; so 0x0041 in an asian character set would represent 'A'.

ASKER

burtdav:

You would like me to go through 65,535 character by hand?

Objects:

I'm using iso-2022-jp and though I am not an expert I believe that for all japanese encodings anything below char(256) is single-width.

Can you use the FontMetrics to determine which characters are single width and which are double width?

It's just an idea: if there are relatively few single-width characters, you can set up rules for finding them like I've explained. You would only have to search through the limited range of characters that are actually used. If it's still a mixture (ie a lot of single-width characters, and not just in a few ranges), then obviously it's not practical.

Again, you might be able to set up the same kind of thing as a literal array using FontMetrics - you could have a once-off java program to produce the code for that array by checking the width using a FontMetrics in a graphical context.

Thinking even more outside the square, can you use tab characters or html tables to do the formatting? Though I don't suppose you'd be here if you could.

ASKER

burtdav:

If I could used tabs I wouldn't be here :)

objects:

Using *anything* to find the width of a character would be fine. But as I showed with my little test characters which have the same width in the email viewer (i.e. one "space") don't give the same width using FontMetrics ...

The following two strings take up the same width in the viewer but FontMetrics reports two different widths:

AAAAAAAAAA (110 wide)
1234567890 (90 wide)

I realise that Java font you are has varying widths, but you may still be able to use to distinguish whether a character is single or double width. ie. you don't use the width directly, you just use it to determine if its a single or double width char.
You could then count how many single and double width characters there are and calculate width simply based on these numbers.

width = n * w (s + (2 * d))
where
n = number of characters
w = single char width in email viewer
s = # of single width chars
d = # of double width chars

ASKER

"ie. you don't use the width directly, you just use it to determine if its a single or double width char"

That's what I've been trying to do all along :) So how does one use a chracters width to decide if it's single or double sized? I think I see where you are going with this but I just want to make sure ...

int w = getFontMetricsWidth(charToTest);
// compare to arbitrary width in this font below which all characters are "single-width" and above which all characters are "double-width"
boolean charIsDoubleWidth = w < 150;

I'm assuming that a double width char will be about twice as wide as a single width characters.
eg.

width= 9 -> single
width=19 -> double
width=11 -> single
width=18 -> double
width=22 -> double

ASKER

Yup, that's the hack I finally came up with last night at 2am. I'm assuming that any character that has a width < 16 is single and >=16 is double.

int getWidth(String s) {
int l, width = 0;
char c;
Character character;
for (int i = 0; i < s.length(); i++) {
c = s.charAt(i);
character = new Character(c);
l = fm.stringWidth(character.toString());
if (l == 16) width += 2;
else width++;
}
return width;
}

Seems I got lucky and the font I picked uses the same width for all double-width characters (16) and it's only the single-width characters that have variable widths.

Horrible hack and I was hoping for a better solution but I guess there might not be one.

You can make it a bit safer by changing (l == 16) to (l >= 16) or maybe even (l >= 15).

You can also safe some time by getting rid of all reference to Character and changing (fm.stringWidth(character.toString())) to (fm.stringWidth(String.valueOf(c)))

> l = fm.stringWidth(character.toString());

Theres a charWidth() function you can use instead of creating a string.

ASKER

Thanks for the optimisation tips, they've been incorporated.

Optimisation was the last thing on my mind last night. Just getting the bloody thing to work was an achievement :)

// I'm curious about the character set... what does this method display for the font you're using?
private void printChangeCount() {
boolean new, old;
int count;
for (int c = 1; c < 65535; c++) {
new = fm.charWidth(c) >= 16;
if (new ^ old) { // I hope this is correct to XOR 2 booleans; if not, ((!(new&&old))&&(new||old))
count++;
}
old = new;
}
System.out.println(count);
}

ASKER

it prints out 8652

What does your function check?

Also how can I print out all the single-width characters?

I can't find a way to convert an int to a char or Character ...

> I can't find a way to convert an int to a char or Character ...

char c = (char) i;

why do you need that?

It checks how often a wide char is next to a narrow one or vice-versa, thus measuring how many contiguous blocks of narrow characters there are. There are 4326. That's a lot, and I can conjecture that they may be well-mixed within the used range of characters. How many of the characters are actually used?

ASKER

objects:

I was just curious as to what the half-width characters were so I wanted to print them out.

I still can't figure out what burtdav's function does though.

*and* I was able to finally find a table giving the widths for characters. PHP has a mb_strwidth() function that returns the width of a string. They use these values:

Unicode range Character width
---------------------------------
U+0000 - U+0019 0
U+0020 - U+1FFF 1
U+2000 - U+FF60 2
U+FF61 - U+FF9F 1
U+FFA0 - 2

Now, I know this is a simple question, but how does one check the unicode value of a char?

Would I just do:

int getWidth(String s) {
int width = 0, c;
for (int i = 0; i < s.length(); i++) {
c = (int)s.charAt(i);
if (c >= 0x0020 && c <= 0x1FFF) {
width++;
}
else if (c >= 0xFF61 && c <= 0xFF9F) {
width++;
}
else if (c >= 0x2000 && c <= 0xFF60) {
width += 2;
}
else if (c >= 0xFFA0) {
width += 2;
}
}
return width;
}

Yes that looks reasonable.

There's no need to cast to int (assigning to c) - that cast is implicit.
It might be "nicer" to declare c as char anyway, and compare with char literals: if (c >= '\u0020' && c <= '\u1FFF') etc
As char is an integer data type, char and int are almost interchangable. (The only exception is that you can't implicitly cast int to char, because char is smaller.)

My function adds 1 to its count every time it finds a character with width >= 16 next to a narrower character, ie if '\uff60' is wide and '\uff61' is narrow. If that table was going to produce the same results as your FontMetrics width >= 16 check, that function would return 3. So, either the table's wrong for this charset, or the width method is very unreliable.

ASKER

I'm not sure what the reason is for the PHP table and your program's output not agreeing, but I would probably say it has to do with the charset.

Do you know of any font that use the same code-space as Unicode? If so I could re-run your test using that font to see if it matches the table.

MS Word (2000+ i think) comes with a "universal font" as an option under localisation in the install. I think that's a unicode font. Another issue might just be the arbitrary 16 point limit we've been using - a little higher or lower would change the results, maybe dramatically.