Want to protect your cyber security and still get fast solutions? Ask a secure question today.Go Premium

x
  • Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 239
  • Last Modified:

need help in designing a comparator with mapping rules

Hi,

I am trying to implement a comparator in JAVA. The objective of the comparator is to compare 3 CSV files. the CSV files will contain the table dump from sybase 12 , sybase 15 and oracle dbs. The is supposed to be identical except a few exceptions caused by the variation in available datatypes in these databases. The comparision will be something like
1) take the first row from all the 3 csv files
2) compare each field of this row with the coprable field of the other two csv files
3)PASS if identical or complies with the mapping rule mentioned below
   a)date in format mmddyyy is same as date in mon, day, year
   b) ignore numbers beyond 4th decimal points (1.11125 is equal to 1.11126)
   c) sybase char field containing trailing spaces will be trimmed in oracle. Hence oracle won't have trailing spaces in varchar field even if the corresponding field in sybase has trailing space.
and  some more rules.

I want to know the optimal design for such a comparator. Please give me the high level design and if i need further detail i will ask as we go along.


and a few other rules

0
PearlJamFanatic
Asked:
PearlJamFanatic
  • 11
  • 5
  • 4
  • +1
3 Solutions
 
for_yanCommented:
Comparator in normal situation would allow yto impose order  to say which one is bigger than another
Taht is oprobably not hwat you mean

You jsut want to read files and compare them field by field
0
 
for_yanCommented:
I think you want to define the standard within the rules you specified and then
transofomr to that standard the files as you read and then just equals would allow you compare them
0
 
for_yanCommented:
For eaxmple when you encounter date convert it to mmddyy
when you encounter number ignore the digits beyond 4 decimal points
trim all spaces.
Then you can open all three files and read first line from them
and then you can first tranform all items in the lines to the standard view
(you can have such method for the line)
then you can compare lines as awhole
if the match - all elenments match you go to the next line
if they don't match the go one by one by item and detemne which items are different

Waht shoud be the ultimate output?
0
Independent Software Vendors: We Want Your Opinion

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 
for_yanCommented:
The best approach will depends on whether you expect
that in general data should be more similar tha different.
If so than the outline which I mentioned above
may be useful.
otherwise you can start comparing item by item from
the very begginining
In general if your files are not too big, then
time will not be an issue anyway.


0
 
objectsCommented:
Create a class that represents a single row, and override its equals() method (and optionally hashCode()) so you can compare two rows
then loop through your file creating instances of each row and compare them.
0
 
PearlJamFanaticAuthor Commented:
So first trasnsform the data as per mapping rule and then compare. Is this the consensus?
0
 
PearlJamFanaticAuthor Commented:
The output is report file that contains all the mismatches. some of the tables contain millions of records. there are 92 tables in total that need comparision across these 3 databases.
0
 
for_yanCommented:
Yes, this is probably reasonable thing to do

Comparing lines makes sense first - probably most of them will still be equal
If not equal then comparing item by item
0
 
for_yanCommented:
Actually million records  will probably not be too bad if lines are not too long.
Say 200 chrs line pe million will be 200 mb file - that's not too terrible
0
 
objectsCommented:
> So first trasnsform the data as per mapping rule and then compare. Is this the consensus?

No, do it line by line
0
 
for_yanCommented:
You can probably attempt to standardize fields as  you retrieve them from database.
In Oracle all those which you mentioned can be provided in the query.
Most probably Sybase allows to impose the same standards  (mnumer of digits, format of date,etc)
0
 
PearlJamFanaticAuthor Commented:
OK. So this is what is the algorith looks like so far. I planning to use opencsv http://opencsv.sourceforge.net/

1) Read the current line from file1, file2 and file3. The opencsv has a method readNext(). It can be used like below
CSVReader reader = new CSVReader(new FileReader("yourfile.csv"));
String [] nextLine;
while ((nextLine = reader.readNext()) != null)
Hence I will have 3 arrays each containing the current line from the respective DBs

2) Now search for the field/datatype that requires transformation. For example search for the Date type field in the oracle array. Once found transform the value using the mapping rule. Do this for all the mapping rules. At the end of this step all the arrays ideally would be identical.

3) compare the arrays field by field and report the mismatch to the report file.

Is this OK?

0
 
PearlJamFanaticAuthor Commented:
for_yan: If I transform at the time of retriving the table from the DB then the CSVs too will have the transformed value. Do you think this is a good approach?
0
 
objectsCommented:
sounds ok, you could also use a bean class (representing each row) to structure your code better
0
 
for_yanCommented:
I don't see a problem with that.
Simply whe you select the date field you make sure that it is ij your stndard format
when you select number - make sure that it is in a standard format - i a snse of decimal points
and say length (add spacesfor xome number if necessary).
Then you cab just compare the whole lines and iif they are different then the difference would be because some field is raelly different
0
 
PearlJamFanaticAuthor Commented:
Object: can you please give me a example how this could be done?
0
 
objectsCommented:
public class MyRow {
    private String fieldWhatever1;
    private Date fieldWhatever2;
    ...

    public MyRow(String[] row) {
       // populate fields from row
    }

    ....

    public boolean equals(Object o) {
        // test if rows are equal
    }
}
0
 
for_yanCommented:

Say, with the date, you will format using to_char(date_field 'YY-MM-DD')
I'm sure SQL has functions to specify how many digits you want in the number
(you can round the decimal places for sure, but you can also say how many
digist should be taken by those digits which preced the point)
and if you have smaller number you canadd spaces, etc.
0
 
for_yanCommented:
If you standardize your csv files on output then your first step iof daeling with
line would be just to check thei equals method and only if that fails,
you can go back and compare item by item

And I guess you dump csv files also through java programs
then you can do standardisation not necessarily in SQL but also
in the printout in java. Though, I'm sure all theswe formattting options
are present in SQL  either
0
 
for_yanCommented:
for example you may have a rule tohave all strings to be always trimmed,
dates always in MMDDYYYY form, and numbers in csv file should not have any leading spaces
and should always have 4 decimal points.

then from one table where you may by default extract

05/03/11  1.25  "  Tom"

from anothe table by default it will be

3-may-2011  1.2500012  "Tom"

you in both cases produce csv line:

05032011,1.2500,Tom and you just can use equlas() method for these lines

and now you can just compare the whole line

otherwise you'dd have to parse and standradize each piece at comparison time


0
 
CEHJCommented:
Use opencsv ColumnPositionMappingStrategy to define a bean for each row. Bale out of equals as soon as the iteration therein produces bean1.equals(bean2) == false, otherwise return true
0
 
CEHJCommented:
The page http://opencsv.sourceforge.net/ contains an example of ColumnPositionMappingStrategy
0

Featured Post

What does it mean to be "Always On"?

Is your cloud always on? With an Always On cloud you won't have to worry about downtime for maintenance or software application code updates, ensuring that your bottom line isn't affected.

  • 11
  • 5
  • 4
  • +1
Tackle projects and never again get stuck behind a technical roadblock.
Join Now