Link to home
Start Free TrialLog in
Avatar of PearlJamFanatic
PearlJamFanatic

asked on

need help in designing a comparator with mapping rules

Hi,

I am trying to implement a comparator in JAVA. The objective of the comparator is to compare 3 CSV files. the CSV files will contain the table dump from sybase 12 , sybase 15 and oracle dbs. The is supposed to be identical except a few exceptions caused by the variation in available datatypes in these databases. The comparision will be something like
1) take the first row from all the 3 csv files
2) compare each field of this row with the coprable field of the other two csv files
3)PASS if identical or complies with the mapping rule mentioned below
   a)date in format mmddyyy is same as date in mon, day, year
   b) ignore numbers beyond 4th decimal points (1.11125 is equal to 1.11126)
   c) sybase char field containing trailing spaces will be trimmed in oracle. Hence oracle won't have trailing spaces in varchar field even if the corresponding field in sybase has trailing space.
and  some more rules.

I want to know the optimal design for such a comparator. Please give me the high level design and if i need further detail i will ask as we go along.


and a few other rules

ASKER CERTIFIED SOLUTION
Avatar of for_yan
for_yan
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
I think you want to define the standard within the rules you specified and then
transofomr to that standard the files as you read and then just equals would allow you compare them
For eaxmple when you encounter date convert it to mmddyy
when you encounter number ignore the digits beyond 4 decimal points
trim all spaces.
Then you can open all three files and read first line from them
and then you can first tranform all items in the lines to the standard view
(you can have such method for the line)
then you can compare lines as awhole
if the match - all elenments match you go to the next line
if they don't match the go one by one by item and detemne which items are different

Waht shoud be the ultimate output?
The best approach will depends on whether you expect
that in general data should be more similar tha different.
If so than the outline which I mentioned above
may be useful.
otherwise you can start comparing item by item from
the very begginining
In general if your files are not too big, then
time will not be an issue anyway.


Avatar of Mick Barry
Create a class that represents a single row, and override its equals() method (and optionally hashCode()) so you can compare two rows
then loop through your file creating instances of each row and compare them.
Avatar of PearlJamFanatic
PearlJamFanatic

ASKER

So first trasnsform the data as per mapping rule and then compare. Is this the consensus?
The output is report file that contains all the mismatches. some of the tables contain millions of records. there are 92 tables in total that need comparision across these 3 databases.
Yes, this is probably reasonable thing to do

Comparing lines makes sense first - probably most of them will still be equal
If not equal then comparing item by item
Actually million records  will probably not be too bad if lines are not too long.
Say 200 chrs line pe million will be 200 mb file - that's not too terrible
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
You can probably attempt to standardize fields as  you retrieve them from database.
In Oracle all those which you mentioned can be provided in the query.
Most probably Sybase allows to impose the same standards  (mnumer of digits, format of date,etc)
OK. So this is what is the algorith looks like so far. I planning to use opencsv http://opencsv.sourceforge.net/

1) Read the current line from file1, file2 and file3. The opencsv has a method readNext(). It can be used like below
CSVReader reader = new CSVReader(new FileReader("yourfile.csv"));
String [] nextLine;
while ((nextLine = reader.readNext()) != null)
Hence I will have 3 arrays each containing the current line from the respective DBs

2) Now search for the field/datatype that requires transformation. For example search for the Date type field in the oracle array. Once found transform the value using the mapping rule. Do this for all the mapping rules. At the end of this step all the arrays ideally would be identical.

3) compare the arrays field by field and report the mismatch to the report file.

Is this OK?

for_yan: If I transform at the time of retriving the table from the DB then the CSVs too will have the transformed value. Do you think this is a good approach?
sounds ok, you could also use a bean class (representing each row) to structure your code better
I don't see a problem with that.
Simply whe you select the date field you make sure that it is ij your stndard format
when you select number - make sure that it is in a standard format - i a snse of decimal points
and say length (add spacesfor xome number if necessary).
Then you cab just compare the whole lines and iif they are different then the difference would be because some field is raelly different
Object: can you please give me a example how this could be done?
public class MyRow {
    private String fieldWhatever1;
    private Date fieldWhatever2;
    ...

    public MyRow(String[] row) {
       // populate fields from row
    }

    ....

    public boolean equals(Object o) {
        // test if rows are equal
    }
}

Say, with the date, you will format using to_char(date_field 'YY-MM-DD')
I'm sure SQL has functions to specify how many digits you want in the number
(you can round the decimal places for sure, but you can also say how many
digist should be taken by those digits which preced the point)
and if you have smaller number you canadd spaces, etc.
If you standardize your csv files on output then your first step iof daeling with
line would be just to check thei equals method and only if that fails,
you can go back and compare item by item

And I guess you dump csv files also through java programs
then you can do standardisation not necessarily in SQL but also
in the printout in java. Though, I'm sure all theswe formattting options
are present in SQL  either
for example you may have a rule tohave all strings to be always trimmed,
dates always in MMDDYYYY form, and numbers in csv file should not have any leading spaces
and should always have 4 decimal points.

then from one table where you may by default extract

05/03/11  1.25  "  Tom"

from anothe table by default it will be

3-may-2011  1.2500012  "Tom"

you in both cases produce csv line:

05032011,1.2500,Tom and you just can use equlas() method for these lines

and now you can just compare the whole line

otherwise you'dd have to parse and standradize each piece at comparison time


Use opencsv ColumnPositionMappingStrategy to define a bean for each row. Bale out of equals as soon as the iteration therein produces bean1.equals(bean2) == false, otherwise return true
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial