fungho
asked on
Algorithm of files comparison
I am now writing a java program which compares 10 files for producing the difference, so that the user can change the difference easily. This is similar to a software WinMerge which can compare 2 files only. I want to ask whether there is any similar algorithm that some people have met and found some in web.
Moreover, this program needs the comparison of different language, it seems that different language should have different algorithm.?At this time, I need to support the English and Chinese (in unicode). However, I want to know whether I need to implement in 2 algorithm for these 2 languages. How about the other?
Moreover, if there is Chinese and English at the same time in the file, how to deal with the comparison?
Thanks!
Moreover, this program needs the comparison of different language, it seems that different language should have different algorithm.?At this time, I need to support the English and Chinese (in unicode). However, I want to know whether I need to implement in 2 algorithm for these 2 languages. How about the other?
Moreover, if there is Chinese and English at the same time in the file, how to deal with the comparison?
Thanks!
listening
ASKER
listening?
rajesh doesn't know the answer, but is interested in the answer...
Rajesh, this is not neccessary, just press on the subscribe button to receive e-mail notifications.
Rajesh, this is not neccessary, just press on the subscribe button to receive e-mail notifications.
And, fungho, as for the different languages in Unicode, there's not really a problem. Each "letter" or better said, a "symbol" is represented by a char value. Just to the comparison on char-values. This way, the Latin letter "a" will be recognized being different from a Chinese symbol.
Say, you have two files:
--- file one ---
Hello, World!
--- file one ---
and
--- file two ---
Hello, <Chinese Symbol for Chinese>World!
--- file two ---
then the symbol in front of the big Latin letter 'W' (the chinese symbol for "Chinese") is treated as a single "letter". So, this way, the second file just differs in the additional letter <Chinese Symbol for Chinese>.
Say, you have two files:
--- file one ---
Hello, World!
--- file one ---
and
--- file two ---
Hello, <Chinese Symbol for Chinese>World!
--- file two ---
then the symbol in front of the big Latin letter 'W' (the chinese symbol for "Chinese") is treated as a single "letter". So, this way, the second file just differs in the additional letter <Chinese Symbol for Chinese>.
1. what exactly are you trying to achive?
2. why different algorithns for different languages?
2. why different algorithns for different languages?
ASKER
It needs different algorithm is that for English, there is space between each word, however, for the Chinese, there is no space. So I need to cut the Chinese word by characters.
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
Thanks! dnoelpp! In fact, I have solved the problem by myself. But I check the character whether it is english or chinese by using this method: ( < "\u00FF")... I will use your method tomorrow. Besides I have a question also, if there is a language which starts from right to left, at the same time, there is number and english embeded in it, so some words needs to be read from left to right? how to deal with this? This does not mind whether you know or not... Thanks!
That's a problem for Swing. Some Swing components, say, JLabel have to cope with it. Imagine an editor where you edit arabic and english text at the same time. What happens if you move the cursor from the english part to the arabic part.
Strings or Unicode sequence, however, they follow the principle that the "reading" order is important, not the placing. An example. Two words, "english" and "arabic" are in a string, and "arabic" is printed from right to left. Then the string would be displayed as such:
english cibara
But the string sequence is as follows: 'e' 'n' 'g' 'l' 'i' 's' 'h' ' ' 'a' 'r' 'a' 'b' 'i' 'c'
Strings or Unicode sequence, however, they follow the principle that the "reading" order is important, not the placing. An example. Two words, "english" and "arabic" are in a string, and "arabic" is printed from right to left. Then the string would be displayed as such:
english cibara
But the string sequence is as follows: 'e' 'n' 'g' 'l' 'i' 's' 'h' ' ' 'a' 'r' 'a' 'b' 'i' 'c'
ASKER
Then does the java swing deal with the situation that the arabic comes from right to left? or I need to deal with it?
Swing does. Please read this article:
http://java.sun.com/products/jfc/tsc/articles/bidi/index.html
The keyword is "bidi" (short for bi-directional text)
By the way, this could be interesting for you (collating and sorting in other languages than English):
http://java.sun.com/docs/books/tutorial/i18n/text/index.html
http://java.sun.com/products/jfc/tsc/articles/bidi/index.html
The keyword is "bidi" (short for bi-directional text)
By the way, this could be interesting for you (collating and sorting in other languages than English):
http://java.sun.com/docs/books/tutorial/i18n/text/index.html
ASKER
Thanks for your help! I will read it later!