Solved

Algorithm of files comparison

Posted on 2001-08-06
12
848 Views
Last Modified: 2008-02-26
I am now writing a java program which compares 10 files for producing the difference, so that the user can change the difference easily. This is similar to a software WinMerge which can compare 2 files only. I want to ask whether there is any similar algorithm that some people have met and found some in web.
Moreover, this program needs the comparison of different language, it seems that different language should have different algorithm.?At this time, I need to support the English and Chinese (in unicode). However, I want to know whether I need to implement in 2 algorithm for these 2 languages. How about the other?

Moreover, if there is Chinese and English at the same time in the file, how to deal with the comparison?


Thanks!
0
Comment
Question by:fungho
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
12 Comments
 
LVL 10

Expert Comment

by:rajesh_bala
ID: 6355379
listening
0
 

Author Comment

by:fungho
ID: 6355968
listening?
0
 
LVL 3

Expert Comment

by:dnoelpp
ID: 6358338
rajesh doesn't know the answer, but is interested in the answer...

Rajesh, this is not neccessary, just press on the subscribe button to receive e-mail notifications.
0
Independent Software Vendors: We Want Your Opinion

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 
LVL 3

Expert Comment

by:dnoelpp
ID: 6358346
And, fungho, as for the different languages in Unicode, there's not really a problem. Each "letter" or better said, a "symbol" is represented by a char value. Just to the comparison on char-values. This way, the Latin letter "a" will be recognized being different from a Chinese symbol.

Say, you have two files:

--- file one ---
Hello, World!
--- file one ---

and

--- file two ---
Hello, <Chinese Symbol for Chinese>World!
--- file two ---

then the symbol in front of the big Latin letter 'W' (the chinese symbol for "Chinese") is treated as a single "letter". So, this way, the second file just differs in the additional letter <Chinese Symbol for Chinese>.
0
 
LVL 4

Expert Comment

by:omry_y
ID: 6358494
1. what exactly are you trying to achive?
2. why different algorithns for different languages?
0
 

Author Comment

by:fungho
ID: 6367421
It needs different algorithm is that for English, there is space between each word, however, for the Chinese, there is no space. So I need to cut the Chinese word by characters.
0
 
LVL 3

Accepted Solution

by:
dnoelpp earned 200 total points
ID: 6367610
Now I understand your problem a little better. Do you want to do sort of an overview: number of each word, so say in a text the word "I" is counted 4 times, "a" 5 times, "times" 3 times, etc, do you understand? To achieve this, we can use a sorted hashmap for the words, but that's not the problem. You want to know how to extract words out of the text and consider the differences in punctuation in Chinese and English.

There's the class Character.UnicodeBlock. One of the instances is CJK_UNIFIED_IDEOGRAPHS. I think that this is the instance for all Chinese symbols.

So, to find out whether a character is a Chinese symbol, use this:

Character.UnicodeBlock.of(c) == Character.UnicodeBlock.CJK_UNIFIED_IDEOGRAPHS.

When a character is chinese, just dump it in the hashmap for counting.

Then, to find out whether the character is a Latin letter, use this:

Character.UnicodeBlock.of(c) == Character.UnicodeBlock.BASIC_LATIN.

I am not sure, however, whether BASIC_LATIN covers accented letters like a e i ? u c, that's up to you to find out.

When you have a latin letter, it can be a part of a word. How to find out? I suggest to partition the Unicode character set into three parts: 1. Chinese symbols 2. Latin letters 3. None of the first two.

Algorithm:

Read the file character-wise. Start with an empty word (for English). For each character do:

1. Chinese symbol? Dump it into the hash map!
2. Letter? Add it to the word.
3. None of the two? Is the word emtpy, don't do anything. Else dump the word to the hash map and make it empty.

Link:

www.unicode.org
0
 

Author Comment

by:fungho
ID: 6368652
Thanks! dnoelpp! In fact, I have solved the problem by myself. But I check the character whether it is english or chinese by using this method: ( < "\u00FF")... I will use your method tomorrow. Besides I have a question also, if there is a language which starts from right to left, at the same time, there is number and english embeded in it, so some words needs to be read from left to right? how to deal with this? This does not mind whether you know or not... Thanks!
0
 
LVL 3

Expert Comment

by:dnoelpp
ID: 6369332
That's a problem for Swing. Some Swing components, say, JLabel have to cope with it. Imagine an editor where you edit arabic and english text at the same time. What happens if you move the cursor from the english part to the arabic part.

Strings or Unicode sequence, however, they follow the principle that the "reading" order is important, not the placing. An example. Two words, "english" and "arabic" are in a string, and "arabic" is printed from right to left. Then the string would be displayed as such:

english cibara

But the string sequence is as follows: 'e' 'n' 'g' 'l' 'i' 's' 'h' ' ' 'a' 'r' 'a' 'b' 'i' 'c'
0
 

Author Comment

by:fungho
ID: 6370886
Then does the java swing deal with the situation that the arabic comes from right to left? or I need to deal with it?
0
 
LVL 3

Expert Comment

by:dnoelpp
ID: 6371280
Swing does. Please read this article:
http://java.sun.com/products/jfc/tsc/articles/bidi/index.html

The keyword is "bidi" (short for bi-directional text)

By the way, this could be interesting for you (collating and sorting in other languages than English):
http://java.sun.com/docs/books/tutorial/i18n/text/index.html
0
 

Author Comment

by:fungho
ID: 6371417
Thanks for your help! I will read it later!
0

Featured Post

Industry Leaders: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

This was posted to the Netbeans forum a Feb, 2010 and I also sent it to Verisign. Who didn't help much in my struggles to get my application signed. ------------------------- Start The idea here is to target your cell phones with the correct…
Basic understanding on "OO- Object Orientation" is needed for designing a logical solution to solve a problem. Basic OOAD is a prerequisite for a coder to ensure that they follow the basic design of OO. This would help developers to understand the b…
Viewers will learn about basic arrays, how to declare them, and how to use them. Introduction and definition: Declare an array and cover the syntax of declaring them: Initialize every index in the created array: Example/Features of a basic arr…
Viewers will learn how to properly install Eclipse with the necessary JDK, and will take a look at an introductory Java program. Download Eclipse installation zip file: Extract files from zip file: Download and install JDK 8: Open Eclipse and …

729 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question