need help in designing a comparator with mapping rules

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

I think you want to define the standard within the rules you specified and then
transofomr to that standard the files as you read and then just equals would allow you compare them

For eaxmple when you encounter date convert it to mmddyy
when you encounter number ignore the digits beyond 4 decimal points
trim all spaces.
Then you can open all three files and read first line from them
and then you can first tranform all items in the lines to the standard view
(you can have such method for the line)
then you can compare lines as awhole
if the match - all elenments match you go to the next line
if they don't match the go one by one by item and detemne which items are different

Waht shoud be the ultimate output?

The best approach will depends on whether you expect
that in general data should be more similar tha different.
If so than the outline which I mentioned above
may be useful.
otherwise you can start comparing item by item from
the very begginining
In general if your files are not too big, then
time will not be an issue anyway.

Create a class that represents a single row, and override its equals() method (and optionally hashCode()) so you can compare two rows
then loop through your file creating instances of each row and compare them.

ASKER

So first trasnsform the data as per mapping rule and then compare. Is this the consensus?

ASKER

The output is report file that contains all the mismatches. some of the tables contain millions of records. there are 92 tables in total that need comparision across these 3 databases.

Yes, this is probably reasonable thing to do

Comparing lines makes sense first - probably most of them will still be equal
If not equal then comparing item by item

Actually million records will probably not be too bad if lines are not too long.
Say 200 chrs line pe million will be 200 mb file - that's not too terrible

SOLUTION

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

You can probably attempt to standardize fields as you retrieve them from database.
In Oracle all those which you mentioned can be provided in the query.
Most probably Sybase allows to impose the same standards (mnumer of digits, format of date,etc)

ASKER

OK. So this is what is the algorith looks like so far. I planning to use opencsv http://opencsv.sourceforge.net/

1) Read the current line from file1, file2 and file3. The opencsv has a method readNext(). It can be used like below
CSVReader reader = new CSVReader(new FileReader("yourfile.csv"));
String [] nextLine;
while ((nextLine = reader.readNext()) != null)
Hence I will have 3 arrays each containing the current line from the respective DBs

2) Now search for the field/datatype that requires transformation. For example search for the Date type field in the oracle array. Once found transform the value using the mapping rule. Do this for all the mapping rules. At the end of this step all the arrays ideally would be identical.

3) compare the arrays field by field and report the mismatch to the report file.

Is this OK?

ASKER

for_yan: If I transform at the time of retriving the table from the DB then the CSVs too will have the transformed value. Do you think this is a good approach?

sounds ok, you could also use a bean class (representing each row) to structure your code better

I don't see a problem with that.
Simply whe you select the date field you make sure that it is ij your stndard format
when you select number - make sure that it is in a standard format - i a snse of decimal points
and say length (add spacesfor xome number if necessary).
Then you cab just compare the whole lines and iif they are different then the difference would be because some field is raelly different

ASKER

Object: can you please give me a example how this could be done?

public class MyRow {
private String fieldWhatever1;
private Date fieldWhatever2;
...

public MyRow(String[] row) {
// populate fields from row
}

....

public boolean equals(Object o) {
// test if rows are equal
}
}

Say, with the date, you will format using to_char(date_field 'YY-MM-DD')
I'm sure SQL has functions to specify how many digits you want in the number
(you can round the decimal places for sure, but you can also say how many
digist should be taken by those digits which preced the point)
and if you have smaller number you canadd spaces, etc.

If you standardize your csv files on output then your first step iof daeling with
line would be just to check thei equals method and only if that fails,
you can go back and compare item by item

And I guess you dump csv files also through java programs
then you can do standardisation not necessarily in SQL but also
in the printout in java. Though, I'm sure all theswe formattting options
are present in SQL either